2025-04-01

Title: PowerGNN: A Topology-Aware Graph Neural Network for Electricity Grids

Authors: Dhruv Suri, Mohak Mangal
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.22721
Pdf URL: https://arxiv.org/pdf/2503.22721
Copy Paste: [[2503.22721]] PowerGNN: A Topology-Aware Graph Neural Network for Electricity Grids(https://arxiv.org/abs/2503.22721)
Keywords: generation
Abstract: The increasing penetration of renewable energy sources introduces significant variability and uncertainty in modern power systems, making accurate state prediction critical for reliable grid operation. Conventional forecasting methods often neglect the power grid's inherent topology, limiting their ability to capture complex spatio temporal dependencies. This paper proposes a topology aware Graph Neural Network (GNN) framework for predicting power system states under high renewable integration. We construct a graph based representation of the power network, modeling buses and transmission lines as nodes and edges, and introduce a specialized GNN architecture that integrates GraphSAGE convolutions with Gated Recurrent Units (GRUs) to model both spatial and temporal correlations in system dynamics. The model is trained and evaluated on the NREL 118 test system using realistic, time synchronous renewable generation profiles. Our results show that the proposed GNN outperforms baseline approaches including fully connected neural networks, linear regression, and rolling mean models, achieving substantial improvements in predictive accuracy. The GNN achieves average RMSEs of 0.13 to 0.17 across all predicted variables and demonstrates consistent performance across spatial locations and operational conditions. These results highlight the potential of topology aware learning for scalable and robust power system forecasting in future grids with high renewable penetration.
摘要：可再生能源的渗透不断增加，在现代电力系统中引入了显着的可变性和不确定性，这使得准确的状态预测对于可靠的网格操作至关重要。常规的预测方法通常忽略了电网的固有拓扑，从而限制了它们捕获复杂时空依赖性的能力。本文提出了一个拓扑意识到的图形神经网络（GNN）框架，用于预测高可再生集成下的电力系统状态。我们构建了电力网络的基于图的表示形式，将总线和传输线建模为节点和边缘，并引入了专门的GNN体系结构，该体系结构将图形卷积与封闭式复发单元（GRUS）集成在一起，以模拟系统动力学中的空间和时间相关性。该模型使用现实的，同步的可再生生成曲线对NREL 118测试系统进行了训练和评估。我们的结果表明，所提出的GNN优于基线方法，包括完全连接的神经网络，线性回归和滚动平均模型，从而实现了预测精度的实质性提高。在所有预测变量中，GNN的平均RMS为0.13至0.17，并在空间位置和操作条件上表现出一致的性能。这些结果突出了拓扑学习对可扩展和强大的电源系统预测的潜力，并在未来的网格中预测具有较高的可再生渗透率。

Title: A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation

Authors: Haonan Shi, Long Tian, Jie Tao, Yufei Li, Liming Wang, Xiyang Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.22724
Pdf URL: https://arxiv.org/pdf/2503.22724
Copy Paste: [[2503.22724]] A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation(https://arxiv.org/abs/2503.22724)
Keywords: generative
Abstract: Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such as heavy rainfall, which are unsuitable for hail nowcasting since it is mainly caused by low-altitude local strong convection associated with terrains. Additionally, radar captures the status of low cloud layers, such as water vapor, droplets, and ice crystals, providing rich signals suitable for hail nowcasting. To this end, we introduce a Spatial-Temporal gEnerAtive Model called SteamCast for hail nowcasting with radar echo extrapolation, it is a deep probabilistic diffusion model based on spatial-temporal representations including radar echoes as well as their position/time embeddings, which we trained on historical reanalysis archive from Yan'an Meteorological Bureau in China, where the crop yield like apple suffers greatly from hail damage. Considering the short-term nature of hail, SteamCast provides 30-minute nowcasts at 6-minute intervals for a single radar reflectivity variable, across 9 different vertical angles, on a latitude-longitude grid with approximately 1 km * 1 km resolution per pixel in Yan'an City, China. By successfully fusing the spatial-temporal features of radar echoes, SteamCast delivers competitive, and in some cases superior, results compared to other deep learning-based models such as PredRNN and VMRNN.
摘要：冰雹现象是导致气象灾难的重要贡献者，非常需要通过精确的预测来减轻其社会经济效应，该预测具有高分辨率，较长的交货时间和具有较大景观的本地细节。现有的中等天气预测方法主要依赖于上流电流和云层的变化来预测降水事件，例如大雨，这些事件不适合冰雹现象，因为它主要是由与地形相关的低空当地强对流引起的。此外，雷达还捕获了低云层的状态，例如水蒸气，液滴和冰晶，提供了适合冰雹现象的丰富信号。为此，我们介绍了一个空间时期的生成模型，称为蒸汽播，以雷达回声外推，它是一个基于空间 - 周期性表示的深层概率扩散模型冰雹损害。考虑到冰雹的短期性质，Steamcast以6分钟的间隔提供30分钟的现象，以在9个不同的垂直角度上，在纬度长度网格上，在中国Yan'an City的每个像素分辨率约为1 km * 1 km * 1 km * 1 km。通过成功融合雷达回声的时空特征，Steamcast的空间特征可提供竞争性，在某些情况下，与其他基于深度学习的模型（如Predrnn和VMRNN）相比，结果优越。

Title: Reasoning Beyond Limits: Advances and Open Problems for LLMs

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.22732
Pdf URL: https://arxiv.org/pdf/2503.22732
Copy Paste: [[2503.22732]] Reasoning Beyond Limits: Advances and Open Problems for LLMs(https://arxiv.org/abs/2503.22732)
Keywords: generation, generative
Abstract: Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI's o1 & o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.
摘要：最近的生成推理突破已经改变了大型语言模型（LLM）如何通过动态检索和完善信息来解决复杂的问题，同时生成相干，多步骤思维过程。推理时间缩放，增强学习，监督微调和蒸馏等技术已成功应用于DeepSeek-R1，OpenAI的O1和O3，GPT-4O，QWEN-32B和各种LLAMA变体等模型，从而导致了增强的推理能力。在本文中，我们对2023年至2025年之间发布的前27个LLM模型进行了全面分析（包括Mistral AI Small 3 24B，DeepSeek-R1，Search-O1，QWQ-32B和PHI-4等模型。然后，我们介绍了培训方法的广泛概述，该方法涵盖了一般培训方法，专家的混合物（MOE）和建筑创新，检索授权的一代（RAG），经过思考和自我提高技术，以及证明时间计算的计算量表量表，蒸馏，蒸馏，蒸馏，蒸馏和强化学习（RL）方法。最后，我们讨论了提高LLM功能的关键挑战，包括在没有人类监督的情况下改善多步推理，克服链式任务的限制，平衡结构化提示与灵活性，并增强长篇文章检索和外部工具集成。

Title: Cyborg Data: Merging Human with AI Generated Training Data

Authors: Kai North, Christopher Ormerod
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22736
Pdf URL: https://arxiv.org/pdf/2503.22736
Copy Paste: [[2503.22736]] Cyborg Data: Merging Human with AI Generated Training Data(https://arxiv.org/abs/2503.22736)
Keywords: generative
Abstract: Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset "Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on "Cyborg Data" show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.
摘要：大规模评估中使用的自动评分（AS）系统传统上使用了小型统计模型，这些模型需要大量的手动尺寸数据以进行准确的预测，这可能是耗时且昂贵的。生成的大语言模型经过许多任务的培训，并显示出令人印象深刻的能力，可以概括到几乎没有数据的新任务。尽管这些模型需要更多的计算能力来做出预测，但它们仍然需要一些微调才能达到运营标准。有证据表明，即使对少量数据进行了微调，这些模型也可以超过人类的一致性。考虑到这一点，我们提出了一个模型蒸馏管道，其中大型生成模型（老师）教授一个小得多的模型，一个学生。在培训数据的一小部分中接受培训的老师用于在其余培训数据上提供分数，然后将其用于培训学生。我们将结果数据集称为“半机械人数据”，因为它结合了人类和机器评分的响应。我们的发现表明，接受过“半机械人数据”培训的学生模型显示出与整个数据集中培训相当的性能，而仅需要10％的原始手动数据。

Title: Uncertainty-Aware Graph Self-Training with Expectation-Maximization Regularization

Authors: Emily Wang, Michael Chen, Chao Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.22744
Pdf URL: https://arxiv.org/pdf/2503.22744
Copy Paste: [[2503.22744]] Uncertainty-Aware Graph Self-Training with Expectation-Maximization Regularization(https://arxiv.org/abs/2503.22744)
Keywords: generation
Abstract: In this paper, we propose a novel \emph{uncertainty-aware graph self-training} approach for semi-supervised node classification. Our method introduces an Expectation-Maximization (EM) regularization scheme to incorporate an uncertainty mechanism during pseudo-label generation and model retraining. Unlike conventional graph self-training pipelines that rely on fixed pseudo-labels, our approach iteratively refines label confidences with an EM-inspired uncertainty measure. This ensures that the predictive model focuses on reliable graph regions while gradually incorporating ambiguous nodes. Inspired by prior work on uncertainty-aware self-training techniques~\cite{wang2024uncertainty}, our framework is designed to handle noisy graph structures and feature spaces more effectively. Through extensive experiments on several benchmark graph datasets, we demonstrate that our method outperforms strong baselines by a margin of up to 2.5\% in accuracy while maintaining lower variance in performance across multiple runs.
摘要：在本文中，我们提出了一种新颖的\ emph {不确定性吸引图形自我训练}方法，用于半监督节点分类。我们的方法引入了期望最大化（EM）正则化方案，以在伪标签生成和模型重新培训期间结合不确定性机制。与依赖固定伪标签的常规图形自我训练管道不同，我们的方法迭代地优化了具有EM启发的不确定性度量的标签信心。这样可以确保预测模型侧重于可靠的图形区域，同时逐渐合并模棱两可的节点。受到不确定性意识自我训练技术的先前工作的启发〜\ cite {wang2024unclentity}，我们的框架旨在更有效地处理嘈杂的图形结构和特征空间。通过在几个基准图数据集上进行的大量实验，我们证明我们的方法的表现优于强基准的准确性高达2.5 \％，同时保持多个运行的性能差异较低。

Title: Graph-Based Uncertainty-Aware Self-Training with Stochastic Node Labeling

Authors: Tom Liu, Anna Wu, Chao Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.22745
Pdf URL: https://arxiv.org/pdf/2503.22745
Copy Paste: [[2503.22745]] Graph-Based Uncertainty-Aware Self-Training with Stochastic Node Labeling(https://arxiv.org/abs/2503.22745)
Keywords: generation
Abstract: Self-training has become a popular semi-supervised learning technique for leveraging unlabeled data. However, the over-confidence of pseudo-labels remains a key challenge. In this paper, we propose a novel \emph{graph-based uncertainty-aware self-training} (GUST) framework to combat over-confidence in node classification. Drawing inspiration from the uncertainty integration idea introduced by Wang \emph{et al.}~\cite{wang2024uncertainty}, our method largely diverges from previous self-training approaches by focusing on \emph{stochastic node labeling} grounded in the graph topology. Specifically, we deploy a Bayesian-inspired module to estimate node-level uncertainty, incorporate these estimates into the pseudo-label generation process via an expectation-maximization (EM)-like step, and iteratively update both node embeddings and adjacency-based transformations. Experimental results on several benchmark graph datasets demonstrate that our GUST framework achieves state-of-the-art performance, especially in settings where labeled data is extremely sparse.
摘要：自我训练已成为一种流行的半监督学习技术，用于利用未标记的数据。但是，伪标签的过分自信仍然是一个关键挑战。在本文中，我们提出了一种新颖的\ emph {基于图形的不确定性感知自我训练}（阵风）框架，以打击节点分类中的过度信心。从Wang \ emph {et al。}〜\ cite {Wang2024uncnectiontity}引入的不确定性集成思想中汲取灵感，我们的方法通过关注\ emph {stochatastic node labeling}扎根于图形拓扑。具体而言，我们部署了一个贝叶斯风格的模块来估计节点级别的不确定性，将这些估计值通过预期最大化（EM）样步骤纳入伪标签生成过程，并迭代地更新节点嵌入量和基于邻接的转换。几个基准图数据集的实验结果表明，我们的阵风框架实现了最新的性能，尤其是在标记数据极为稀疏的设置中。

Title: Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLMs in Temporal Knowledge Graph Forecasting

Authors: Gongzhu Yin, Hongli Zhang, Yi Luo, Yuchen Yang, Kun Lu, Chao Meng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22748
Pdf URL: https://arxiv.org/pdf/2503.22748
Copy Paste: [[2503.22748]] Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLMs in Temporal Knowledge Graph Forecasting(https://arxiv.org/abs/2503.22748)
Keywords: generation, generative
Abstract: Temporal Knowledge Graph (TKG) forecasting is crucial for predicting future events using historical data. With the surge of Large Language Models (LLMs), recent studies have begun exploring their integration into TKG forecasting and achieved some success. However, they still face limitations such as limited input length, inefficient output generation, and resource-intensive refinement, which undermine their performance and practical applicability. To address these limitations, we introduce SPARK, a Sequence-level Proxy-Adapting framework for Refining LLMs in TKG forecasting. Inspired by inference-time algorithms adopted in controlling generation, SPARK offers a cost-effective, plug-and-play solution through two key innovations: (1) Beam Sequence-Level Generation, which reframes TKG forecasting as a top-K sequence-level generation task, using beam search for efficiently generating next-entity distribution in a single forward pass. (2) TKG Adapter for Refinement, which employs traditional TKG models as trainable proxy adapters to leverage global graph information and refine LLM outputs, overcoming both the input length and the resource-intensive fine-tuning problems. Experiments across diverse datasets validate SPARK's forecasting performance, robust generalization capabilities, and high efficiency. We release source codes at this https URL.
摘要：时间知识图（TKG）的预测对于使用历史数据预测未来事件至关重要。随着大语言模型（LLM）的激增，最近的研究已经开始探索其整合到TKG预测中，并取得了一些成功。但是，他们仍然面临限制，例如输入长度，效率低下的产出和资源密集型细化，这会破坏其性能和实际适用性。为了解决这些限制，我们引入了Spark，Spark是一个用于完善TKG预测中LLM的序列级代理框架。受到控制生成的推理时间算法的启发，Spark通过两个关键的创新提供了一种具有成本效益的插件解决方案：（1）Beam序列级别的生成，它将TKG TKG预测作为Top-K序列级别的任务，使用Beam搜索在单个远程通行中有效地生成的下一步分发。（2）用于改进的TKG适配器，该适配器采用传统的TKG模型作为可训练的代理适配器来利用全球图形信息和完善的LLM输出，从而克服了输入长度和资源密集型的精细调整问题。跨不同数据集的实验验证了Spark的预测性能，可靠的概括能力和高效率。我们在此HTTPS URL上发布源代码。

Title: Patronus: Bringing Transparency to Diffusion Models with Prototypes

Authors: Nina Weng, Aasa Feragen, Siavash Bigdeli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.22782
Pdf URL: https://arxiv.org/pdf/2503.22782
Copy Paste: [[2503.22782]] Patronus: Bringing Transparency to Diffusion Models with Prototypes(https://arxiv.org/abs/2503.22782)
Keywords: generation, generative
Abstract: Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{this https URL}{this https URL}.
摘要：基于扩散的生成模型，例如剥离扩散概率模型（DDPM），在图像生成方面取得了显着的成功，但是它们的逐步去核过程仍然不透明，留下了无法解释的生成机制的关键方面。为了解决这个问题，我们介绍了\ emph {Patronus}，这是一种受Protopnet启发的可解释扩散模型。 Patronus将原型网络集成到DDPM中，从而使原型提取并在其原型激活载体上的生成过程调节。该设计通过显示学习的原型以及它们如何影响生成过程来增强可解释性。此外，该模型还支持下游任务，例如图像操纵，从而实现更透明和受控的修改。此外，守护神可以通过检测到学到的原型之间的不良相关性来揭示生成过程中的快捷学习学习。值得注意的是，Patronus完全没有任何注释或文本提示。这项工作为通过基于原型的解释性理解和控制扩散模型开辟了新的途径。我们的代码可在\ href {此https url} {this HTTPS url}上获得。

Title: DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Authors: Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22796
Pdf URL: https://arxiv.org/pdf/2503.22796
Copy Paste: [[2503.22796]] DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers(https://arxiv.org/abs/2503.22796)
Keywords: generation
Abstract: Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.
摘要：文本对图像生成模型，尤其是多模式扩散变压器（MMDIT），在生成高质量图像方面表现出了显着的进展。但是，这些模型通常面临着重要的计算瓶颈，尤其是在注意机制上，这阻碍了它们的可扩展性和效率。在本文中，我们介绍了DitFastAttNV2，这是一种旨在加速MMDIT注意力的训练后压缩方法。通过对MMDIT注意模式的深入分析，我们确定了基于DIT的方法的关键差异，并提出了头明确的箭头注意力和缓存机制，以动态调整注意力头，从而有效地弥合了这一差距。我们还设计了一个有效的融合核，以进一步加速。通过利用本地度量方法和优化技术，我们的方法将最佳压缩方案的搜索时间大大减少到只需几分钟，同时保持发电质量。此外，借助自定义的内核，DitFastAttNV2可在2K图像生成上降低注意力失败和1.5倍的端到端速度，而不会损害视觉保真度。

Title: SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction

Authors: Alexey Gavryushin, Florian Redhardt, Gaia Di Lorenzo, Luc Van Gool, Marc Pollefeys, Kaichun Mo, Xi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22869
Pdf URL: https://arxiv.org/pdf/2503.22869
Copy Paste: [[2503.22869]] SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction(https://arxiv.org/abs/2503.22869)
Keywords: generation
Abstract: We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object, which could be involved in a hand-object interaction scene or pictured by itself. When humans grasp an object, appropriate trajectories naturally form in our minds to use it for specific tasks. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields. However, synthesizing realistic and appropriate hand trajectories given a single object or hand-object interaction image is a highly ambiguous task, requiring to correctly identify the object of interest and possibly even the correct interaction among many possible alternatives. To tackle this challenging problem, we propose the SIGHT-Fusion system, consisting of a curated pipeline for extracting visual features of hand-object interaction details from egocentric videos involving object manipulation, and a diffusion-based conditional motion generation model processing the extracted features. We train our method given video data with corresponding hand trajectory annotations, without supervision in the form of action labels. For the evaluation, we establish benchmarks utilizing the first-person FPHAB and HOI4D datasets, testing our method against various baselines and using multiple metrics. We also introduce task simulators for executing the generated hand trajectories and reporting task success rates as an additional metric. Experiments show that our method generates more appropriate and realistic hand trajectories than baselines and presents promising generalization capability on unseen objects. The accuracy of the generated hand trajectories is confirmed in a physics simulation setting, showcasing the authenticity of the created sequences and their applicability in downstream uses.
摘要：我们介绍了一项新颖的任务，即给出一个对象的单个图像，生成现实和多样的3D手轨迹，该轨迹可以参与手动相互作用场景或本身所示。当人类掌握一个物体时，在我们的脑海中自然形成了适当的轨迹，以将其用于特定任务。手动相互作用轨迹先验可以极大地利用机器人技术，体现AI，增强现实和相关领域的应用。但是，在给定一个对象或手动相互作用图像的情况下，合成现实且适当的手轨迹是一项高度模棱两可的任务，需要正确地识别感兴趣的对象，甚至可能在许多可能的替代方案之间进行正确的交互。为了解决这个具有挑战性的问题，我们提出了观光融合系统，该系统由策划的管道组成，该管道从涉及对象操纵的以中心视频以及基于扩散的条件运动生成模型处理提取的特征中提取手动对象交互细节的视觉特征。我们训练我们的方法给出的视频数据，并带有相应的手轨迹注释，而无需以动作标签的形式进行监督。为了进行评估，我们利用第一人称FPHAB和HOI4D数据集建立了基准，对各种基准进行测试并使用多个指标。我们还介绍了任务模拟器，以执行生成的手轨迹和报告任务成功率作为额外的指标。实验表明，我们的方法比基线生成更合适和更现实的手轨迹，并且在看不见的对象上具有有希望的概括能力。在物理模拟设置中证实了生成的手轨迹的准确性，展示了创建序列的真实性及其在下游用途中的适用性。

Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Subjects: cs.LG, cs.AI, cs.CL, cs.PF
Abstract URL: https://arxiv.org/abs/2503.22879
Pdf URL: https://arxiv.org/pdf/2503.22879
Copy Paste: [[2503.22879]] Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models(https://arxiv.org/abs/2503.22879)
Keywords: generation
Abstract: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: this https URL.
摘要：国家空间模型（SSM）由于其一致的内存使用和高性能而成为变压器的引人注目的替代方案。尽管如此，由于其存储要求和计算能力，在云服务或限量资源设备上扩展SSM是具有挑战性的。为了克服这一点，量化具有低位数据格式的SSM可以降低模型大小并受益于硬件加速度。由于SSM易于量化引起的错误，因此最近的工作重点是优化特定模型或位宽度，以提高效率而不牺牲性能。但是，明显的位宽度配置对于不同方案至关重要，例如用于提高大批次解码速度的W4A8，而W4A16则可以在简短的及时及时应用程序中提高生成速度。为此，我们介绍了MAMBA1和MAMBA2骨架与W8A8，W4A8和W4A16兼容的Quamba2，以解决各种平台上对SSM部署的需求不断增长。基于SSMS的通道顺序保留和激活持久性，我们提出了一种离线方法，通过对输入$ X $进行排序和聚类，以量化8位线性复发的输入，并结合输入依赖性参数$ b $和$ c $的每个状态组量化。为了确保对SSM输出的计算不变性，我们根据聚类序列离线重新排列。实验表明，QUAMBA2-8B的表现优于几种最先进的SSM量化方法，并分别提供1.3 $ \ times $和3 $ \ times $的$ QUATE，分别在预填充和发电阶段，同时提供4 $ \ times $ $ $ $ \ times $ nemory减少，仅$ 1.6 \％\％\％\％\％。对MMLU的评估显示了我们框架的普遍性和鲁棒性。代码和量化模型将在以下位置发布：此HTTPS URL。

Title: AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs

Authors: Yi-Ting Shen, Sungmin Eum, Doheon Lee, Rohit Shete, Chiao-Yi Wang, Heesung Kwon, Shuvra S. Bhattacharyya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22884
Pdf URL: https://arxiv.org/pdf/2503.22884
Copy Paste: [[2503.22884]] AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs(https://arxiv.org/abs/2503.22884)
Keywords: generation
Abstract: Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
摘要：组成的姿势检索（CPR）使用户能够通过指定参考姿势和过渡描述来搜索人类姿势，但是该领域的进展受到注释姿势过渡的稀缺性和不一致性的阻碍。现有的CPR数据集依赖于昂贵的人类注释或基于启发式的规则生成，这两者都限制了可扩展性和多样性。在这项工作中，我们介绍了自动化，这是利用多模式大型语言模型（MLLM）自动生成富且结构化的姿势过渡描述的第一个框架。我们的方法通过将过渡到细颗粒的身体部位运动并引入镜像/交换变化来增强注释质量，而环状一致性约束确保向前和逆向过渡之间的逻辑连贯性。为了推进CPR研究，我们构建并发布了两个专用基准AIST-CPR和PoseFixCPR，并补充了具有增强属性的先前数据集。广泛的实验表明，具有自身化的训练检索模型的产生比基于人类的启发式方法的方法优于较高的表现，从而大大降低了注释成本，同时提高了检索质量。我们的工作开创了姿势过渡的自动注释，为未来的CPR研究建立了可扩展的基础。

Title: Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

Authors: Ron Vainshtein, Zohar Rimon, Shie Mannor, Chen Tessler
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.22886
Pdf URL: https://arxiv.org/pdf/2503.22886
Copy Paste: [[2503.22886]] Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models(https://arxiv.org/abs/2503.22886)
Keywords: generation
Abstract: Recent advancements in imitation learning have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. We introduce "Task Tokens", a method to effectively tailor BFMs to specific tasks while preserving their flexibility. Our approach leverages the transformer architecture of BFMs to learn a new task-specific encoder through reinforcement learning, keeping the original BFM frozen. This allows incorporation of user-defined priors, balancing reward design and prompt engineering. By training a task encoder to map observations to tokens, used as additional BFM inputs, we guide performance improvement while maintaining the model's diverse control characteristics. We demonstrate Task Tokens' efficacy across various tasks, including out-of-distribution scenarios, and show their compatibility with other prompting modalities. Our results suggest that Task Tokens offer a promising approach for adapting BFMs to specific control tasks while retaining their generalization capabilities.
摘要：模仿学习的最新进展导致了基于变压器的行为基础模型（BFM），该模型可以对人形生物剂进行多模式，类人类的控制。虽然在零发的稳健行为上表现出色，但BFM通常需要精心及时的及时工程来完成特定任务，并可能产生次优的结果。我们介绍了“任务令牌”，这是一种有效地量身定制BFM的特定任务的方法，同时保留其灵活性。我们的方法利用BFM的变压器体系结构通过加强学习来学习新的特定任务编码器，并保持原始的BFM冻结。这允许合并用户定义的先验，平衡奖励设计和及时的工程。通过训练任务编码器以将观测值映射到代币（用作其他BFM输入），我们指导性能改进，同时保持模型的多样化控制特性。我们演示了任务令牌在各种任务中的功效，包括分发场景，并显示其与其他提示方式的兼容性。我们的结果表明，任务令牌为将BFMS适应特定的控制任务提供了有希望的方法，同时保留其概括能力。

Title: Bi-Level Multi-View fuzzy Clustering with Exponential Distance

Authors: Kristina P. Sinaga
Subjects: cs.CV, cs.LG, math.PR
Abstract URL: https://arxiv.org/abs/2503.22932
Pdf URL: https://arxiv.org/pdf/2503.22932
Copy Paste: [[2503.22932]] Bi-Level Multi-View fuzzy Clustering with Exponential Distance(https://arxiv.org/abs/2503.22932)
Keywords: generation
Abstract: In this study, we propose extension of fuzzy c-means (FCM) clustering in multi-view environments. First, we introduce an exponential multi-view FCM (E-MVFCM). E-MVFCM is a centralized MVC with consideration to heat-kernel coefficients (H-KC) and weight factors. Secondly, we propose an exponential bi-level multi-view fuzzy c-means clustering (EB-MVFCM). Different to E-MVFCM, EB-MVFCM does automatic computation of feature and weight factors simultaneously. Like E-MVFCM, EB-MVFCM present explicit forms of the H-KC to simplify the generation of the heat-kernel $\mathcal{K}(t)$ in powers of the proper time $t$ during the clustering process. All the features used in this study, including tools and functions of proposed algorithms will be made available at this https URL.
摘要：在这项研究中，我们建议在多视图环境中扩展模糊C均值（FCM）聚类。首先，我们引入了指数多视图FCM（E-MVFCM）。 E-MVFCM是一种集中式MVC，并考虑到热内核系数（H-KC）和权重因子。其次，我们提出了一个指数的双级多视图模糊c均值聚类（EB-MVFCM）。与E-MVFCM不同，EB-MVFCM同时自动计算特征和权重因子。像E-MVFCM一样，EB-MVFCM呈现H-KC的显式形式，以简化在聚类过程中适当的时间$ t $的功能中的热 - 内核$ \ Mathcal {k}（t）$。本研究中使用的所有功能，包括所提出算法的工具和功能，将在此HTTPS URL上提供。

Title: From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Authors: Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22976
Pdf URL: https://arxiv.org/pdf/2503.22976
Copy Paste: [[2503.22976]] From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D(https://arxiv.org/abs/2503.22976)
Keywords: generation
Abstract: Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
摘要：LVLM的最新进展提高了视力语言的理解，但他们仍然在空间感知上挣扎，限制了他们推理复杂的3D场景的能力。与以前将3D表示形式纳入模型以提高空间理解的方法不同，我们旨在通过利用空间相关的图像数据来释放VLM的潜力。为此，我们介绍了一个新颖的2D空间数据生成和注释管道，该管道以3D地面真相构建的场景数据构建。该管道可以创建各种空间任务，从基本的感知任务到更复杂的推理任务。利用这条管道，我们构建了SPAR-7M，这是一个大规模数据集，该数据集是由数千个场景在多个公共数据集中产生的。此外，我们推出了SPAR基础基准，这是一种基准测试，旨在与现有的空间基准相比，对空间功能进行更全面的评估，从而支持单视图和多视图输入。对SPAR-7M和大规模2D数据集进行的培训使我们的模型能够在2D空间基准上实现最先进的性能。 3D特定于任务数据集的进一步微调会产生竞争结果，从而强调了我们数据集在增强空间推理方面的有效性。

Title: indiSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy

Authors: Ashesh Ashesh, Florian Jug
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22983
Pdf URL: https://arxiv.org/pdf/2503.22983
Copy Paste: [[2503.22983]] indiSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy(https://arxiv.org/abs/2503.22983)
Keywords: restoration
Abstract: Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant to the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called indiSplit that is cognizant of the severity of the above mentioned mixing ratio. Our idea is based on InDI, a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing asymmetry) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of indiSplit on $5$ public datasets. We will release all sources under a permissive license.
摘要：荧光显微镜虽然是生命科学进展的关键驱动力，但也受到技术限制。为了克服它们，最近提出了计算多路复用技术，该技术允许在单个图像中捕获多个蜂窝结构，然后将其毫无混合。现有的图像分解方法在一组叠加的输入图像和各个未混合目标图像上训练。至关重要的是要注意，给定输入的叠加图像的相对强度（混合比）是先验未知的。但是，现有方法是按照叠加输入的固定强度比的训练，使其不认识到荧光显微镜中可能发生的相对强度范围。在这项工作中，我们提出了一种称为Indisplit的新方法，该方法意识到上述混合比的严重程度。我们的想法基于INDI，这是一种流行的图像恢复方法，也是在任何给定输入中包含未知混合率的理想起点。我们介绍了（i）一个经过适当训练的回归网络，该网络可预测给定输入图像的降解水平（混合不对称）和（ii）一个特定于降解的归一化模块，从而使所有混合率跨所有混合率降解了推理。我们表明，该方法解决了荧光显微镜中的两个相关任务，即图像分裂和拆卸，并经验证明了Indisplit在$ 5 $公共数据集上的适用性。我们将根据允许许可发布所有资源。

Title: On Geometrical Properties of Text Token Embeddings for Strong Semantic Binding in Text-to-Image Generation

Authors: Hoigi Seo, Junseo Bang, Haechang Lee, Joohoon Lee, Byung Hyun Lee, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23011
Pdf URL: https://arxiv.org/pdf/2503.23011
Copy Paste: [[2503.23011]] On Geometrical Properties of Text Token Embeddings for Strong Semantic Binding in Text-to-Image Generation(https://arxiv.org/abs/2503.23011)
Keywords: generation
Abstract: Text-to-Image (T2I) models often suffer from text-image misalignment in complex scenes involving multiple objects and attributes. Semantic binding aims to mitigate this issue by accurately associating the generated attributes and objects with their corresponding noun phrases (NPs). Existing methods rely on text or latent optimizations, yet the factors influencing semantic binding remain underexplored. Here we investigate the geometrical properties of text token embeddings and their cross-attention (CA) maps. We empirically and theoretically analyze that the geometrical properties of token embeddings, specifically both angular distances and norms, play a crucial role in CA map differentiation. Then, we propose \textbf{TeeMo}, a training-free text embedding-aware T2I framework with strong semantic binding. TeeMo consists of Causality-Aware Projection-Out (CAPO) for distinct inter-NP CA maps and Adaptive Token Mixing (ATM) with our loss to enhance inter-NP separation while maintaining intra-NP cohesion in CA maps. Extensive experiments confirm TeeMo consistently outperforms prior arts across diverse baselines and datasets.
摘要：文本对图像（T2I）模型通常在涉及多个对象和属性的复杂场景中遭受文本图像错位的困扰。语义绑定旨在通过将生成的属性和对象与相应的名词短语（NP）准确关联来减轻此问题。现有方法依赖文本或潜在的优化，但是影响语义结合的因素仍然没有被逐渐倍增。在这里，我们研究了文本令牌嵌入的几何特性及其交叉注意（CA）地图。我们从经验和理论上分析了令牌嵌入的几何特性，特别是角度距离和规范，在CA MAP分化中起着至关重要的作用。然后，我们提出\ textbf {teemo}，这是一种具有强大语义绑定的无训练文本嵌入感知的T2I框架。 TEEMO包括因果关系的投影（CAPO），用于不同的NP CA映射和自适应令牌混合（ATM），并损失了我们的损失，以增强NP间的分离，同时保持在Ca地图中内部NP内部凝聚力。广泛的实验证实，Teemo始终优于各种基线和数据集的先前艺术。

Title: MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs

Authors: Xianglong He, Junyi Chen, Di Huang, Zexiang Liu, Xiaoshui Huang, Wanli Ouyang, Chun Yuan, Yangguang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23022
Pdf URL: https://arxiv.org/pdf/2503.23022
Copy Paste: [[2503.23022]] MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs(https://arxiv.org/abs/2503.23022)
Keywords: generation
Abstract: In the domain of 3D content creation, achieving optimal mesh topology through AI models has long been a pursuit for 3D artists. Previous methods, such as MeshGPT, have explored the generation of ready-to-use 3D objects via mesh auto-regressive techniques. While these methods produce visually impressive results, their reliance on token-by-token predictions in the auto-regressive process leads to several significant limitations. These include extremely slow generation speeds and an uncontrollable number of mesh faces. In this paper, we introduce MeshCraft, a novel framework for efficient and controllable mesh generation, which leverages continuous spatial diffusion to generate discrete triangle faces. Specifically, MeshCraft consists of two core components: 1) a transformer-based VAE that encodes raw meshes into continuous face-level tokens and decodes them back to the original meshes, and 2) a flow-based diffusion transformer conditioned on the number of faces, enabling the generation of high-quality 3D meshes with a predefined number of faces. By utilizing the diffusion model for the simultaneous generation of the entire mesh topology, MeshCraft achieves high-fidelity mesh generation at significantly faster speeds compared to auto-regressive methods. Specifically, MeshCraft can generate an 800-face mesh in just 3.2 seconds (35$\times$ faster than existing baselines). Extensive experiments demonstrate that MeshCraft outperforms state-of-the-art techniques in both qualitative and quantitative evaluations on ShapeNet dataset and demonstrates superior performance on Objaverse dataset. Moreover, it integrates seamlessly with existing conditional guidance strategies, showcasing its potential to relieve artists from the time-consuming manual work involved in mesh creation.
摘要：在创建3D内容的领域中，长期以来，通过AI模型实现最佳网格拓扑一直是3D艺术家的追求。以前的方法（例如MeshGPT）已经通过网格自动回归技术探索了现成的3D对象的生成。尽管这些方法在视觉上产生了令人印象深刻的结果，但它们对自动回归过程中逐态预测的依赖导致了几个重大局限性。这些包括极慢的生成速度和无法控制的网格面孔。在本文中，我们介绍了Meshcraft，这是一个新颖的框架，用于有效且可控制的网格生成，该框架利用了连续的空间扩散来产生离散的三角形面。具体来说，Meshcraft由两个核心组件组成：1）基于变压器的VAE，将原始网格编码为连续的面部级令牌，并将它们解码为原始网格，以及2）基于流的扩散变压器，以面孔的数量为条件，使高素质3D网格具有与面孔的高素质3D网格，面孔的数字数字。通过利用扩散模型同时生成整个网格拓扑，与自动回归方法相比，Meshcraft以明显更快的速度实现高保真网格的生成。具体而言，Meshcraft只能在仅3.2秒内生成800脸网格（比现有基线快35 $ \ times $）。广泛的实验表明，在Shapenet数据集的定性和定量评估中，Meshcraft优于最先进的技术，并且在Objaverse数据集上表现出了出色的性能。此外，它与现有的有条件指导策略无缝集成，展示了其使艺术家摆脱涉及网格创建涉及的耗时的手动工作的潜力。

Title: Shape and Texture Recognition in Large Vision-Language Models

Authors: Sagi Eppel, Mor Bismut, Alona Faktor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23062
Pdf URL: https://arxiv.org/pdf/2503.23062
Copy Paste: [[2503.23062]] Shape and Texture Recognition in Large Vision-Language Models(https://arxiv.org/abs/2503.23062)
Keywords: generation
Abstract: Shape and texture recognition is fundamental to visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures independently of their associated objects, is essential for general visual understanding of the world. We introduce the Large Shape & Textures dataset (LAS&T), a giant collection of diverse shapes and textures automatically extracted from real-world images. This dataset is used to evaluate how effectively leading Large Vision-Language Models (LVLMs) understand shapes, textures, and materials in both 2D and 3D scenes. For shape recognition, we test models' ability to match identical shapes that differ in orientation, texture, color, or environment. Our results show that LVLMs' shape identification capabilities remain significantly below human performance. Single alterations (orientation, texture) cause minor decreases in matching accuracy, while multiple changes precipitate dramatic drops. LVLMs appear to rely predominantly on high-level and semantic features and struggle with abstract shapes lacking clear class associations. For texture and material recognition, we evaluate models' ability to identify identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler 2D textures. The LAS&T dataset and benchmark, the largest and most diverse resource for shape and texture evaluation, is freely available with generation and testing scripts.
摘要：形状和纹理识别是视觉感知的基础。识别形状的能力无论取向，纹理或上下文如何，并独立于其相关对象识别纹理，对于对世界的一般视觉理解至关重要。我们介绍了大型和纹理数据集（LAS＆T），这是一个自动从实际图像中提取的各种形状和纹理的巨大集合。该数据集用于评估如何有效地领导大型视觉模型（LVLM）了解2D和3D场景中的形状，纹理和材料。对于形状识别，我们测试了模型匹配方向，纹理，颜色或环境不同形状的能力。我们的结果表明，LVLMS的形状识别能力仍然显着低于人类绩效。单个变化（方向，纹理）导致匹配精度的较小降低，而多个变化会导致剧烈降低。 LVLM似乎主要依赖于高级和语义特征，以及与缺乏明确阶级关联的抽象形状的斗争。对于纹理和材料识别，我们评估了模型在不同对象和环境中识别相同纹理和材料的能力。有趣的是，领先的LVLM在识别3D场景中识别材料方面的效果，但在识别较简单的2D纹理时，人体的表现显而易见。 LAS＆T数据集和基准测试是形状和纹理评估的最大，最多样化的资源，可以通过生成和测试脚本自由使用。

Title: Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

Authors: Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23121
Pdf URL: https://arxiv.org/pdf/2503.23121
Copy Paste: [[2503.23121]] Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation(https://arxiv.org/abs/2503.23121)
Keywords: generation
Abstract: We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{this https URL}{here}.
摘要：我们提出了一种新的方法，用于生成文本引导的人类对象相互作用（HOI），该方法以计算有效的方式实现明确的联合相互作用建模。以前的方法代表整个人体作为一个单一令牌，因此很难捕获细粒的联合层次相互作用并导致不切实际的HOI。但是，将每个关节视为代币将产生超过20倍的令牌，从而增加计算开销。为了应对这些挑战，我们引入了一个有效的显式关节相互作用模型（EJIM）。 EJIM具有双分支HOI MAMBA，可分别有效地对时空HOI信息进行模型，以及用于将文本语义和对象几何形状整合到人类和对象运动中的双分支条件注射器。此外，我们设计了动态的相互作用块和渐进式掩蔽机制，以迭代过滤无关的接头，以确保准确而细微的相互作用建模。公共数据集上的广泛定量和定性评估表明，EJIM仅使用5％的推理时间，超过了以前的工作。代码可用\ href {this HTTPS url} {there}。

Title: Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors: Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23125
Pdf URL: https://arxiv.org/pdf/2503.23125
Copy Paste: [[2503.23125]] Evaluating Compositional Scene Understanding in Multimodal Generative Models(https://arxiv.org/abs/2503.23125)
Keywords: generation, generative
Abstract: The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.
摘要：视觉世界在根本上是构图。视觉场景由对象的组成及其关系来定义。因此，计算机视觉系统必须反映和利用这种组成性，以获得强大而可推广的场景理解。虽然在开发通用，多模式的模型（包括文本形象模型和多模式视觉模型）方面已经取得了重大进展，但尚不清楚这些系统是否能够准确地生成和解释涉及多个对象和关系组成的场景。在这项工作中，我们介绍了当前文本图像（DALL-E 3）和多模式视觉语言模型（GPT-4V，GPT-4V，Claude Sonnet 3.5，Qwen2-vl-72b，以及Intervl2.5-38b）的评估，并与人类参与这些系统的参与者，评估了文本图像（DALL-E 3）和多模式视觉语言模型（GPT-4V，GPT-4V，GPT-4V，Claude Sonnet 3.5，QPT-4V，Claude-4O，以及Intervl2.5-38b），以及这些系统的参与者，对人类的参与者，我们介绍了对当前文本图像（DALL-E 3）和多模式视觉语言模型的评估。结果表明，这些系统表现出一些解决组成和关系任务的能力，显示了对上一代多模式模型的显着改进，但性能远低于人类参与者的水平，尤其是对于涉及许多（$> 5 $）对象和多重关系的更复杂的场景。这些结果突出了需要进一步进步，以对视觉场景的组成理解。

Title: Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery

Authors: Boyi Ma, Yanguang Zhao, Jie Wang, Guankun Wang, Kun Yuan, Tong Chen, Long Bai, Hongliang Ren
Subjects: cs.CV, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2503.23130
Pdf URL: https://arxiv.org/pdf/2503.23130
Copy Paste: [[2503.23130]] Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery(https://arxiv.org/abs/2503.23130)
Keywords: generation
Abstract: DeepSeek-V3, a recently emerging Large Language Model (LLM), demonstrates outstanding performance in general scene understanding, question-answering (QA), and text generation tasks, owing to its efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of DeepSeek-V3 in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our comprehensive evaluation results indicate that, when provided with specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue recognition tasks However, DeepSeek-V3 exhibits significant limitations in spatial position analysis and struggles to understand surgical actions accurately. Additionally, our findings reveal that, under general prompts, DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts and fails to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek-V3 is not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.
摘要：DeepSeek-V3是最近新兴的大型语言模型（LLM），在一般场景理解，提问（QA）和文本生成任务中表现出出色的表现，这是由于其有效的培训范式和强大的推理能力。在这项研究中，我们研究了DeepSeek-V3在机器人手术方案中的对话能力，重点介绍了诸如单词QA，Visual QA和详细描述之类的任务。单词QA任务进一步包括子任务，例如手术仪器识别，动作理解和空间位置分析。我们使用公开可用的数据集进行了广泛的评估，包括Restovis18和Cholect50，以及它们相应的对话数据。我们的全面评估结果表明，在提供特定的提示时，DeepSeek-V3在手术仪器和组织识别任务中表现良好，但是，DeepSeek-V3在空间位置分析中表现出很大的限制，并努力努力准确了解手术动作。此外，我们的发现表明，在一般提示下，DeepSeek-V3缺乏有效分析全球外科手术概念的能力，并且无法提供对手术场景的详细见解。根据我们的观察，我们认为DeepSeek-V3还没有准备好在手术环境中进行视觉任务，而不会在特定于手术的数据集中进行微调。

Title: Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL

Authors: Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan "O. Arik
Subjects: cs.LG, cs.AI, cs.DB, cs.PL
Abstract URL: https://arxiv.org/abs/2503.23157
Pdf URL: https://arxiv.org/pdf/2503.23157
Copy Paste: [[2503.23157]] Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL(https://arxiv.org/abs/2503.23157)
Keywords: generation
Abstract: Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks.
摘要：文本到SQL是一项具有挑战性的任务，涉及多个推理密集型子任务，包括自然语言理解，数据库架构理解和精确的SQL查询配方。现有的方法通常依靠具有归纳偏见的手工推理路径，从而限制其整体效率。以推理增强模型（例如DeepSeek R1和Openai O1）的成功的激励，它们有效地利用了奖励驱动的自我探索来增强推理能力和概括，我们提出了一套专门针对文本到SQL任务的新型部分奖励。我们的奖励集包括架构链接，AI反馈，n-gram相似性和语法检查，这些奖励旨在解决奖励稀疏性问题（RL）普遍存在的奖励稀疏问题。利用小组相对政策优化（GRPO），我们的方法明确鼓励大型语言模型（LLMS）发展准确的SQL查询生成所需的内在推理技能。借助不同尺寸的模型，我们证明，与监督的微调（SFT）相比，我们提出的奖励的仅RL训练始终达到更高的准确性和卓越的概括。值得注意的是，我们经过RL训练的14B参数模型极大地胜过更大的专有模型，例如O3米尼在4％和双子座1.5-Pro-002上乘以鸟基准为3％。这些突出了我们提议的RL训练框架的功效，并具有部分奖励，以增强文本到SQL任务中的准确性和推理能力。

Title: A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery

Authors: Pengyu Chen, Sicheng Wang, Cuizhen Wang, Senrong Wang, Beiao Huang, Lu Huang, Zhe Zang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23200
Pdf URL: https://arxiv.org/pdf/2503.23200
Copy Paste: [[2503.23200]] A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery(https://arxiv.org/abs/2503.23200)
Keywords: super-resolution, generative
Abstract: Accurate rooftop detection from historical aerial imagery is vital for examining long-term urban development and human settlement patterns. However, black-and-white analog photographs pose significant challenges for modern object detection frameworks due to their limited spatial resolution, lack of color information, and archival degradation. To address these limitations, this study introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization using DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were then used to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. Results show that combining colorization with super-resolution substantially improves detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85%. This reflects an improvement of approximately 40% over original black-and-white images and 20% over images enhanced through colorization alone. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, enabling more reliable extraction of building footprints from historical aerial photographs.
摘要：从历史航空图像中进行准确的屋顶检测对于检查长期城市发展和人类定居模式至关重要。但是，由于空间分辨率有限，颜色信息缺乏和档案退化，黑白模拟照片对现代对象检测框架构成了重大挑战。为了解决这些局限性，这项研究基于生成对抗网络（GAN）引入了两阶段图像增强管道：使用Deoldify的图像着色，然后使用Real-Esrgan进行超分辨率增强。然后使用增强的图像训练和评估屋顶检测模型，包括更快的R-CNN，Detreg和Yolov11n。结果表明，将着色与超分辨率结合起来大大提高了检测性能，而Yolov11n的平均平均精度（MAP）超过85％。这反映了与原始黑白图像相比，大约40％的改善，而仅通过着色而增强的图像比图像提高了20％。所提出的方法有效地弥合了档案图像与当代深度学习技术之间的差距，从而使从历史航空照片中更可靠地提取建筑足迹。

Title: Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset

Authors: Kushal Agrawal, Romi Banerjee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23226
Pdf URL: https://arxiv.org/pdf/2503.23226
Copy Paste: [[2503.23226]] Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset(https://arxiv.org/abs/2503.23226)
Keywords: generation, generative
Abstract: The intersection of generative AI and art is a fascinating area that brings both exciting opportunities and significant challenges, especially when it comes to identifying synthetic artworks. This study takes a unique approach by examining diffusion-based generative models in the context of Indian art, specifically focusing on the distinctive style of Jamini Roy. To explore this, we fine-tuned Stable Diffusion 3 and used techniques like ControlNet and IPAdapter to generate realistic images. This allowed us to create a new dataset that includes both real and AI-generated artworks, which is essential for a detailed analysis of what these models can produce. We employed various qualitative and quantitative methods, such as Fourier domain assessments and autocorrelation metrics, to uncover subtle differences between synthetic images and authentic pieces. A key takeaway from recent research is that existing methods for detecting deepfakes face considerable challenges, especially when the deepfakes are of high quality and tailored to specific cultural contexts. This highlights a critical gap in current detection technologies, particularly in light of the challenges identified above, where high-quality and culturally specific deepfakes are difficult to detect. This work not only sheds light on the increasing complexity of generative models but also sets a crucial foundation for future research aimed at effective detection of synthetic art.
摘要：生成AI和艺术的交集是一个引人入胜的领域，既带来令人兴奋的机会又带来重大挑战，尤其是在识别合成艺术品时。这项研究通过在印度艺术的背景下检查基于扩散的生成模型，采用独特的方法，特别关注Jamini Roy的独特风格。为了探讨这一点，我们微调了稳定的扩散3，并使用了ControlNet和iPadapter（例如ControlNet和iPadapter）来生成逼真的图像。这使我们能够创建一个包括真实和AI生成的艺术品的新数据集，这对于对这些模型可以产生的内容进行详细分析至关重要。我们采用了各种定性和定量方法，例如傅立叶域评估和自相关指标，以发现合成图像和真实作品之间的细微差异。最近的研究的一个关键要点是，现有的检测深烟的方法面临着巨大的挑战，尤其是当深层味道具有高质量并针对特定文化背景下量身定制时。这凸显了当前检测技术的危险差距，特别是鉴于上述挑战，在很难检测到高质量和文化的深层效果的情况下。这项工作不仅阐明了生成模型的复杂性日益复杂，而且为未来的研究奠定了至关重要的基础，旨在有效检测合成艺术。

Title: Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus

Authors: Claas Beger, Carl-Leander Henneking
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23229
Pdf URL: https://arxiv.org/pdf/2503.23229
Copy Paste: [[2503.23229]] Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus(https://arxiv.org/abs/2503.23229)
Keywords: generation
Abstract: Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (this https URL), as well as an implementation harness that works with several different LLM implementations.
摘要：大型语言模型为产生高质量的书面作品提供了重要的新机会。但是，他们在研究界的就业受到幻觉造成无效资源的趋势的抑制，并且缺乏直接获得相关科学文章的知识基础。在这项工作中，我们提出了CiteGeist：使用动态检索增强生成（RAG）上的应用程序管道，以生成相关的工作部分和其他引用支持的输出。为此，我们采用了基于嵌入的相似性匹配，汇总和多阶段过滤的混合物。为了适应文档基础的持续增长，我们还提出了一种合并新论文和修改后的优化方法。为了在科学界易于利用，我们将同时发布一个网站（此HTTPS URL），以及与几种不同LLM实现的实施线束。

Title: SalesRLAgent: A Reinforcement Learning Approach for Real-Time Sales Conversion Prediction and Optimization

Authors: Nandakishor M
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23303
Pdf URL: https://arxiv.org/pdf/2503.23303
Copy Paste: [[2503.23303]] SalesRLAgent: A Reinforcement Learning Approach for Real-Time Sales Conversion Prediction and Optimization(https://arxiv.org/abs/2503.23303)
Keywords: generation
Abstract: Current approaches to sales conversation analysis and conversion prediction typically rely on Large Language Models (LLMs) combined with basic retrieval augmented generation (RAG). These systems, while capable of answering questions, fail to accurately predict conversion probability or provide strategic guidance in real time. In this paper, we present SalesRLAgent, a novel framework leveraging specialized reinforcement learning to predict conversion probability throughout sales conversations. Unlike systems from this http URL, Mendable, Inkeep, and others that primarily use off-the-shelf LLMs for content generation, our approach treats conversion prediction as a sequential decision problem, training on synthetic data generated using GPT-4O to develop a specialized probability estimation model. Our system incorporates Azure OpenAI embeddings (3072 dimensions), turn-by-turn state tracking, and meta-learning capabilities to understand its own knowledge boundaries. Evaluations demonstrate that SalesRLAgent achieves 96.7% accuracy in conversion prediction, outperforming LLM-only approaches by 34.7% while offering significantly faster inference (85ms vs 3450ms for GPT-4). Furthermore, integration with existing sales platforms shows a 43.2% increase in conversion rates when representatives utilize our system's real-time guidance. SalesRLAgent represents a fundamental shift from content generation to strategic sales intelligence, providing moment-by-moment conversion probability estimation with actionable insights for sales professionals.
摘要：当前的销售对话分析和转换预测方法通常依赖于大型语言模型（LLM）以及基本检索增强发电（RAG）。这些系统虽然能够回答问题，但无法准确预测转化概率或实时提供战略指导。在本文中，我们介绍了Salesragent，这是一个新颖的框架，利用专门的强化学习来预测整个销售对话的转换概率。与此HTTP URL的系统不同，Mendable，Messeep和其他主要使用现成的LLM进行内容生成的系统不同，我们的方法将转换预测视为顺序决策问题，对使用GPT-4O生成的合成数据培训来开发专业的概率估计模型。我们的系统结合了Azure OpenAI嵌入（3072个维度），转弯状态跟踪以及元学习能力，以了解其自身的知识边界。评估表明，Salesragent在转化预测中的精度达到了96.7％，超过唯一的LLM方法的方法比仅34.7％，同时提供的推断速度明显更快（GPT-4的85ms和3450ms）。此外，与现有销售平台的集成显示，当代表利用我们系统的实时指导时，转化率增加了43.2％。 Salesragent代表了从内容产生到战略销售情报的基本转变，并为销售专业人员提供了可行的见解。

Title: MoCha: Towards Movie-Grade Talking Character Synthesis

Authors: Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23307
Pdf URL: https://arxiv.org/pdf/2503.23307
Copy Paste: [[2503.23307]] MoCha: Towards Movie-Grade Talking Character Synthesis(https://arxiv.org/abs/2503.23307)
Keywords: generation
Abstract: Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.
摘要：视频发电的最新进步取得了令人印象深刻的动作现实主义，但他们经常忽略角色驱动的讲故事，这是自动电影《动画生成》的至关重要的任务。我们介绍了会说话的角色，这是一项更现实的任务，可以直接从语音和文本中生成会说话的角色动画。与说话的人不同，会说话的角色旨在产生面部区域以外的一个或多个角色的完整肖像。在本文中，我们提出了Mocha，这是第一个产生会说话的角色的同类产品。为了确保视频和语音之间的精确同步，我们提出了一种语音视频窗口注意机制，可以有效地对齐语音和视频令牌。为了解决大型语音标记的视频数据集的稀缺性，我们引入了一种联合培训策略，该策略利用语音标记和文本标记的视频数据，从而大大改善了各种角色动作的概括。我们还设计了具有字符标签的结构化及时模板，这是首次以转向对话为基础的对话，以赋予AI生成的字符，从而与电影连贯性进行上下文感知的对话。广泛的定性和定量评估，包括人类的偏好研究和基准比较，表明摩卡咖啡为AI生成的电影讲故事设定了新的标准，实现了优越的现实主义，表现力，可控性和概括性。

Title: HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Authors: Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, Hongkai Xiong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.23331
Pdf URL: https://arxiv.org/pdf/2503.23331
Copy Paste: [[2503.23331]] HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation(https://arxiv.org/abs/2503.23331)
Keywords: generation, generative
Abstract: Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
摘要：现有的2到3D人类姿势估计（HPE）方法通过在举重阶段丰富了诸如时间和视觉提示之类的信息，以困扰遮挡问题。在本文中，我们认为这些方法忽略了稀疏骨架2D输入表示的局限性，该表示从根本上限制了2d-3d的提升，并使闭塞问题恶化。为了解决这些问题，我们提出了一种新型的两阶段生成致密方法，称为层次姿势自回归变压器（HIPART），以从原始的稀疏2D姿势中生成分层的2D密集姿势。具体而言，我们首先开发了多尺度的骨骼令牌化模块，以将高度致密的2D姿势量化为分层令牌，并提出一个骨骼感知的对准以加强令牌连接。然后，我们为分层2D姿势生成开发了分层自回归建模方案。由于生成的分层姿势是2到3D举重的输入，因此所提出的方法在遮挡的场景中显示出强大的鲁棒性，并在基于单帧的3D HPE上实现了最先进的性能。此外，它的表现优于众多多帧方法，同时降低了参数和计算复杂性，还可以补充它们以进一步提高性能和鲁棒性。

Title: TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement

Authors: Wenhao Luo, Zhangyi Shen, Ye Yao, Feng Ding, Guopu Zhu, Weizhi Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23332
Pdf URL: https://arxiv.org/pdf/2503.23332
Copy Paste: [[2503.23332]] TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement(https://arxiv.org/abs/2503.23332)
Keywords: generation
Abstract: Image generation algorithms are increasingly integral to diverse aspects of human society, driven by their practical applications. However, insufficient oversight in artificial Intelligence generated content (AIGC) can facilitate the spread of malicious content and increase the risk of copyright infringement. Among the diverse range of image generation models, the Latent Diffusion Model (LDM) is currently the most widely used, dominating the majority of the Text-to-Image model market. Currently, most attribution methods for LDMs rely on directly embedding watermarks into the generated images or their intermediate noise, a practice that compromises both the quality and the robustness of the generated content. To address these limitations, we introduce TraceMark-LDM, an novel algorithm that integrates watermarking to attribute generated images while guaranteeing non-destructive performance. Unlike current methods, TraceMark-LDM leverages watermarks as guidance to rearrange random variables sampled from a Gaussian distribution. To mitigate potential deviations caused by inversion errors, the small absolute elements are grouped and rearranged. Additionally, we fine-tune the LDM encoder to enhance the robustness of the watermark. Experimental results show that images synthesized using TraceMark-LDM exhibit superior quality and attribution accuracy compared to state-of-the-art (SOTA) techniques. Notably, TraceMark-LDM demonstrates exceptional robustness against various common attack methods, consistently outperforming SOTA methods.
摘要：图像产生算法越来越多于人类社会的各个方面不可或缺的一部分。但是，人工智能产生的内容（AIGC）的监督不足可以促进恶意内容的传播并增加侵犯版权的风险。在各种图像生成模型中，潜在扩散模型（LDM）目前是最广泛使用的，主要是文本到图像模型市场的大多数。当前，LDMS的大多数归因方法都依赖于将水印直接嵌入到生成的图像或其中间噪声中，这种做法会损害生成内容的质量和鲁棒性。为了解决这些局限性，我们引入了Tracemark-LDM，这是一种新颖的算法，将水印整合到属性生成的图像的同时，同时保证了非破坏性性能。与当前的方法不同，Tracemark-LDM利用水印作为从高斯分布中采样的重新排列随机变量的指导。为了减轻反演错误引起的潜在偏差，将小元素分组和重新排列。此外，我们微调LDM编码器以增强水印的鲁棒性。实验结果表明，与最新技术（SOTA）技术相比，使用Tracemark-LDM合成的图像表现出较高的质量和归因精度。值得注意的是，Tracemark-LDM表现出针对各种常见攻击方法的出色鲁棒性，始终优于SOTA方法。

Title: Object Isolated Attention for Consistent Story Visualization

Authors: Xiangyang Luo, Junhao Cheng, Yifan Xie, Xin Zhang, Tao Feng, Zhou Liu, Fei Ma, Fei Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23353
Pdf URL: https://arxiv.org/pdf/2503.23353
Copy Paste: [[2503.23353]] Object Isolated Attention for Consistent Story Visualization(https://arxiv.org/abs/2503.23353)
Keywords: generation
Abstract: Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.
摘要：开放式故事可视化是一项艰巨的任务，涉及从给定故事情节产生连贯的图像序列。主要困难之一是在创建自然和上下文拟合的场景的同时保持角色一致性，这是许多现有方法挣扎的领域。在本文中，我们提出了一个增强的变压器模块，该模块使用单独的自我注意力和交叉注意机制，利用预先训练的扩散模型的先验知识来确保逻辑场景创建。孤立的自我关注机制通过完善注意图来提高性格一致性，以减少对无关方面的关注，并突出同一特征的关键特征。同时，孤立的交叉注意机制独立处理每个角色的特征，避免特征融合并进一步增强一致性。值得注意的是，我们的方法是无训练的，可以在不重新调查的情况下连续产生新的角色和故事情节。定性评估和定量评估都表明，我们的方法表现出了当前方法的表现，证明了其有效性。

Title: ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts

Authors: Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23356
Pdf URL: https://arxiv.org/pdf/2503.23356
Copy Paste: [[2503.23356]] ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts(https://arxiv.org/abs/2503.23356)
Keywords: restoration
Abstract: Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios and lack the flexibility to accommodate user-specific requirements. In response to these challenges, we propose a controllable image fusion framework with language-vision prompts, termed ControlFusion, which adaptively neutralizes composite degradations. On the one hand, we develop a degraded imaging model that integrates physical imaging mechanisms, including the Retinex theory and atmospheric scattering principle, to simulate composite degradations, thereby providing potential for addressing real-world complex degradations from the data level. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features with degradation prompts, enabling our method to accommodate composite degradation of varying levels. Specifically, considering individual variations in quality perception of users, we incorporate a text encoder to embed user-specified degradation types and severity levels as degradation prompts. We also design a spatial-frequency collaborative visual adapter that autonomously perceives degradations in source images, thus eliminating the complete dependence on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly in countering real-world and compound degradations with various levels.
摘要：当前的图像融合方法难以解决实际成像方案中遇到的复合降解，并且缺乏适应特定用户需求的灵活性。为了应对这些挑战，我们提出了一个可控的图像融合框架，该框架具有语言视觉提示，称为ControlFusion，该框架可自适应地中和复合降解。一方面，我们开发了一个退化的成像模型，该模型整合了物理成像机制，包括Etinex理论和大气散射原理，以模拟复合降解，从而提供了从数据水平中解决现实世界中复杂降解的潜力。另一方面，我们设计了一个迅速调制的恢复和融合网络，该网络通过降解提示动态增强功能，从而使我们的方法能够适应不同级别的复合降解。具体而言，考虑到用户质量感知的个人变化，我们将文本编码器与嵌入式用户指定的降解类型和严重性水平合并为降解提示。我们还设计了一个空间频率协作的视觉适配器，该适配器自主会在源图像中自主感知降解，从而消除了对用户指令的完全依赖。广泛的实验表明，控制流体在融合质量和降解处理方面的表现优于SOTA融合方法，尤其是在抵抗具有不同级别的现实世界和复合降解时。

Title: VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration

Authors: Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, Hao Zhang, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23359
Pdf URL: https://arxiv.org/pdf/2503.23359
Copy Paste: [[2503.23359]] VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration(https://arxiv.org/abs/2503.23359)
Keywords: restoration
Abstract: Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with $220$ temporally synchronized and spatially registered infrared-visible video pairs comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from (potentially degraded) multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios, effectively mitigating temporal inconsistency and interference.
摘要：与图像相比，视频可以更好地与现实世界中的获取场景保持一致，并具有宝贵的时间提示。但是，现有的多传感器融合研究主要将来自多个图像而不是视频的互补上下文整合在一起。这主要源于两个因素：1）大规模多传感器视频数据集的稀缺性，限制视频融合的研究以及2）在统一框架中共同建模空间和时间依赖性的固有难度。本文主动补偿了困境。首先，我们构建了M3SVD，这是一个基准数据集，其$ 220 $暂时同步和空间注册的红外可见视频对，包括153,797帧，填补了视频融合社区的数据差距。其次，我们提出了视频，这是一个多模式的视频融合模型，该模型完全利用了交叉模式互补性和时间动力学，以从（可能退化的）多模式输入中生成时空连贯的视频。具体而言，1）开发了一个用于跨模式信息互动和增强的差异加固模块，2）使用完全模态引导的融合策略来适应地整合多模式特征，3）三）双向共同注意机制被设计为动态地聚集了前进的前进时间上环境，以增强交叉互动的交叉互动表现形式。广泛的实验表明，视频在顺序场景中优于现有面向图像的融合范式，从而有效地减轻了时间不一致和干扰。

Title: FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

Authors: Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, Luca Benini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23367
Pdf URL: https://arxiv.org/pdf/2503.23367
Copy Paste: [[2503.23367]] FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning(https://arxiv.org/abs/2503.23367)
Keywords: generation
Abstract: Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7$\times$ with negligible performance drop of <1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at this https URL.
摘要：视觉自回旋（VAR）建模因其向换句话预测的转变而受欢迎。但是，现有的VAR范式在每个刻度步骤上处理整个令牌映射，从而通过图像分辨率大大导致复杂性和运行时缩放。为了应对这一挑战，我们提出了FASTVAR，这是一种训练后加速度方法，用于使用VAR进行有效的分辨率缩放。我们的主要发现是，大多数延迟源于大多数令牌已经融合的大规模步骤。利用这一观察结果，我们开发了缓存的代币修剪策略，该策略只能将关键令牌转发用于规模特异性建模，同时使用从先前的规模步骤中使用的加速令牌来恢复修剪的插槽。这大大减少了转发令牌的数量，并提高了较大分辨率的效率。实验表明，拟议的FastVar可以进一步加速闪存加速的VAR 2.7 $ \ times $ $，而性能降低<1％。我们进一步将FastVar扩展到了高分辨率图像的零击生成。特别是，FastVar可以在单个NVIDIA 3090 GPU上生成一个1.5秒的2K图像。代码可在此HTTPS URL上找到。

Title: Towards Physically Plausible Video Generation via VLM Planning

Authors: Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23368
Pdf URL: https://arxiv.org/pdf/2503.23368
Copy Paste: [[2503.23368]] Towards Physically Plausible Video Generation via VLM Planning(https://arxiv.org/abs/2503.23368)
Keywords: generation
Abstract: Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: this https URL.
摘要：近年来，视频扩散模型（VDM）已取得了显着发展，从而能够产生高度现实的视频，并吸引社区作为世界模拟器的潜力。然而，尽管具有功能，VDMS通常由于对物理学的理解缺乏固有的理解，因此通常无法产生身体上合理的视频，从而导致了不正确的动态和事件序列。为了解决这一限制，我们提出了一个新颖的两阶段图像到视频生成框架，该框架明确结合了物理。在第一阶段，我们采用视觉语言模型（VLM）作为粗粒的运动计划者，将思想链和物理意识的推理整合在一起，以预测近似现实世界中物理动态的粗糙运动轨迹/变化，同时确保框架间的一致性。在第二阶段，我们使用预测的运动轨迹/更改来指导VDM的视频生成。由于预测的运动轨迹/变化是粗糙的，在推断过程中添加了噪声，以提供VDM的自由，以产生更多细节。广泛的实验结果表明，我们的框架可以产生物理上合理的运动，并且比较评估突出了我们方法比现有方法的显着优势。我们的项目页面上提供了更多视频结果：此HTTPS URL。

Title: Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization

Authors: Chenxing Sun, Jing Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23370
Pdf URL: https://arxiv.org/pdf/2503.23370
Copy Paste: [[2503.23370]] Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization(https://arxiv.org/abs/2503.23370)
Keywords: generation, generative, quality assessment
Abstract: In intelligent cartographic generation tasks empowered by generative models, the authenticity of synthesized maps constitutes a critical determinant. Concurrently, the selection of appropriate evaluation metrics to quantify map authenticity emerges as a pivotal research challenge. Current methodologies predominantly adopt computer vision-based image assessment metrics to compute discrepancies between generated and reference maps. However, conventional visual similarity metrics-including L1, L2, SSIM, and FID-primarily operate at pixel-level comparisons, inadequately capturing cartographic global features and spatial correlations, consequently inducing semantic-structural artifacts in generated outputs. This study introduces a novel Map Feature Perception Metric designed to evaluate global characteristics and spatial congruence between synthesized and target maps. Diverging from pixel-wise metrics, our approach extracts elemental-level deep features that comprehensively encode cartographic structural integrity and topological relationships. Experimental validation demonstrates MFP's superior capability in evaluating cartographic semantic features, with classification-enhanced implementations outperforming conventional loss functions across diverse generative frameworks. When employed as optimization objectives, our metric achieves performance gains ranging from 2% to 50% across multiple benchmarks compared to traditional L1, L2, and SSIM baselines. This investigation concludes that explicit consideration of cartographic global attributes and spatial coherence substantially enhances generative model optimization, thereby significantly improving the geographical plausibility of synthesized maps.
摘要：在智能制图生成的任务中，生成模型赋予了能力，合成地图的真实性构成了关键的决定因素。同时，选择适当的评估指标来量化MAP真实性是一个关键的研究挑战。当前的方法论主要采用基于计算机的图像评估指标来计算生成和参考图之间的差异。然而，传统的视觉相似性指标 - 包括L1，L2，SSIM和FID主要在像素级比较中运行，不充分捕获制图的全局特征和空间相关性，因此诱导了生成的输出中的语义结构文物。这项研究介绍了一个新的地图特征感知度量，旨在评估合成图和目标图之间的全球特征和空间一致性。我们的方法与像素指标不同，我们的方法提取了元素级的深度特征，这些特征全面编码了制图结构完整性和拓扑关系。实验验证表明，MFP在评估制图语义特征方面具有出色的能力，分类增强的实现优于各种生成框架的常规损失函数。当用作优化目标时，与传统的L1，L2和SSIM基准相比，多个基准测试的性能增长范围从2％到50％。这项调查得出的结论是，对制图全球属性和空间连贯性的明确考虑大大增强了生成模型优化，从而显着提高了合成图的地理合理性。

Title: JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Authors: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.23377
Pdf URL: https://arxiv.org/pdf/2503.23377
Copy Paste: [[2503.23377]] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization(https://arxiv.org/abs/2503.23377)
Keywords: generation
Abstract: This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at this https URL.
摘要：本文介绍了Javisdit，这是一种新型的关节音频传播扩散变压器，设计用于同步音频效率生成（JAVG）。 Javisdit建立在强大的扩散变压器（DIT）体系结构上，能够从开放式用户提示中同时生成高质量的音频和视频内容。为了确保最佳同步，我们通过层次的空间 - 周期性同步先验（Hist-Sypo）估计器引入了细粒的时空比对机制。该模块同时提取全局和细粒时空先验，指导视觉和听觉组件之间的同步。此外，我们提出了一个新的基准标准Javisbench，其中包括10,140个高质量的文本启动的声音视频，涵盖了各种场景和复杂的真实世界情景。此外，我们专门设计了一个可靠的度量标准，用于评估现实世界中复杂内容中生成的音频视频对之间的同步。实验结果表明，通过确保高质量的生成和精确同步，为JAVG任务设定了新标准，可以显着胜过现有方法。我们的代码，模型和数据集将在此HTTPS URL上公开可用。

Title: COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

Authors: Fanding Huang, Jingyan Jiang, Qinting Jiang, Hebei Li, Faisal Nadeem Khan, Zhi Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.23388
Pdf URL: https://arxiv.org/pdf/2503.23388
Copy Paste: [[2503.23388]] COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation(https://arxiv.org/abs/2503.23388)
Keywords: generation
Abstract: Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50. Code is available at this http URL.
摘要：最近的视觉模型（VLM）在测试时间适应新领域时面临着重大挑战。尽管基于缓存的方法通过利用历史信息表现出希望，但它们在查询过程中使用单级信息而在使用单级信息的情况下进行了不可靠的特征标签对，并不差别地损害了适应精度。为了解决这些局限性，我们提出了宇宙（夹子面积的语义多空间集成），这是一个可靠的测试时间适应框架，可通过多粒子，跨模式的语义缓存和基于图形的查询机制增强适应性。我们的框架介绍了两个关键创新：双语义学图（DSG）和集团指导Hyper-Class（CGH）。双语上图构造了互补的语义空间，通过结合文本特征，粗粒夹特征和细粒度的dinov2特征来捕获丰富的语义关系。在这些双重图的基础上，该集团指导的超级级组件利用结构化的类关系来通过相关的类选择来增强预测鲁棒性。广泛的实验表明，宇宙在多个基准测试中的出色表现，比最先进的方法取得了重大改进：分布式任务的增长15.81％，而clip RN-50的跨域产生5.33％。代码可在此HTTP URL上找到。

Title: A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models

Authors: Leander Girrbach, Stephan Alaniz, Genevieve Smith, Zeynep Akata
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.23398
Pdf URL: https://arxiv.org/pdf/2503.23398
Copy Paste: [[2503.23398]] A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models(https://arxiv.org/abs/2503.23398)
Keywords: generation, generative
Abstract: With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.
摘要：随着图像产生技术的越来越多，了解其社会偏见（包括性别偏见）至关重要。本文介绍了关于文本对图像（T2I）模型中性别偏见的首次大规模研究，重点是日常情况。尽管以前的研究已经检查了职业的偏见，但我们将此分析扩展到日常活动，对象和环境中的性别关联。我们创建了一个3,217个性别中性提示的数据集，并从五个领先的T2I模型中每提示生成200张图像。我们会自动检测到生成的图像中人们感知到的性别，并没有任何人或不同性别的人过滤出图像，留下2,293,295张图像。为了对T2I模型中的性别偏见进行广泛的分析，我们将提示分组为语义上相似的概念，并计算每个提示的男性和女性图像的比例。我们的分析表明，T2I模型增强了传统的性别角色，反映了家庭角色中常见的性别刻板印象以及与财务相关活动中的代表性不足的妇女。妇女主要在以护理和人为中心的情况下描绘了女性，在技术或身体劳动的情况下。

Title: Diffusion Meets Few-shot Class Incremental Learning

Authors: Junsu Kim, Yunhoe Ku, Dongyoon Han, Seungryul Baek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23402
Pdf URL: https://arxiv.org/pdf/2503.23402
Copy Paste: [[2503.23402]] Diffusion Meets Few-shot Class Incremental Learning(https://arxiv.org/abs/2503.23402)
Keywords: generation, generative
Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model's capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, miniImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.
摘要：由于培训数据极为有限，很少有类班级学习（FSCIL）具有挑战性。旨在减少灾难性遗忘和学习新信息。我们提出了扩散-FSCIL，这是一种新型方法，该方法采用文本对图扩散模型作为冷冻骨干。我们的猜想是，可以使用大型生成模型的能力来解决FSCIL，从而受益于1）通过大规模预训练的发电能力； 2）多尺度表示； 3）通过文本编码器表示灵活性。为了最大程度地提高表示能力，我们建议提取多个互补扩散特征，以发挥作用，作为潜在重放的作用，并在特征蒸馏中略有支持以防止产生偏见。我们的框架通过1）使用冷冻的主链实现效率； 2）最小的可训练组件； 3）多种特征提取的批处理处理。对CUB-200，Miniimagenet和Cifar-100进行的广泛实验表明，扩散FSCIL超过了最新方法，可以保留以前学到的类别的性能并有效地适应新的方法。

Title: GMapLatent: Geometric Mapping in Latent Space

Authors: Wei Zeng, Xuebin Chang, Jianghao Su, Xiang Gu, Jian Sun, Zongben Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23407
Pdf URL: https://arxiv.org/pdf/2503.23407
Copy Paste: [[2503.23407]] GMapLatent: Geometric Mapping in Latent Space(https://arxiv.org/abs/2503.23407)
Keywords: generation, generative
Abstract: Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generalization capabilities. In this work, we innovate a cross-domain alignment and generation model that introduces a canonical latent space representation based on geometric mapping to align the cross-domain latent spaces in a rigorous and precise manner, thus avoiding mode collapse and mixture in the encoder-decoder generation architectures. We name this model GMapLatent. The core of the method is to seamlessly align latent spaces with strict cluster correspondence constraints using the canonical parameterizations of cluster-decorated latent spaces. We first (1) transform the latent space to a canonical parameter domain by composing barycenter translation, optimal transport merging and constrained harmonic mapping, and then (2) compute geometric registration with cluster constraints over the canonical parameter domains. This process realizes a bijective (one-to-one and onto) mapping between newly transformed latent spaces and generates a precise alignment of cluster pairs. Cross-domain generation is then achieved through the aligned latent spaces embedded in the encoder-decoder pipeline. Experiments on gray-scale and color images validate the efficiency, efficacy and applicability of GMapLatent, and demonstrate that the proposed model has superior performance over existing models.
摘要：基于编码器AI体系结构的跨域生成模型在生成逼真的图像中引起了很多关注，在逼真的图像中，域对齐对于生成准确性至关重要。域对准方法通常直接处理初始分布；但是，不匹配或混合的簇可能导致模式崩溃，并在解码器，损害模型的概括能力中的混合问题。在这项工作中，我们创新了一个跨域的对准和生成模型，该模型基于几何图映射引入规范的潜在空间表示，以严格而精确的方式与跨域的潜在空间对齐，从而避免模式崩溃并在编码器解码器生成体系结构中崩溃和混合物。我们将此型号命名为gmaplatent。该方法的核心是使用群集装饰的潜在空间的规范参数化的严格群集对应约束，无缝地对准潜在空间。我们首先（1）通过组成barycenter翻译，最佳传输合并和约束谐波映射，然后（2）计算几何学注册，并在规范参数域上使用群集约束来，将潜在空间转换为规范参数域。这个过程实现了新转化的潜在空间之间的培训（一对一，然后映射），并生成了群集对的精确比对。然后，通过嵌入编码器折线管道中的对齐的潜在空间来实现跨域的生成。关于灰度和颜色图像的实验验证了Gmaplatent的效率，功效和适用性，并证明所提出的模型比现有模型具有较高的性能。

Title: VideoGen-Eval: Agent-based System for Video Generation Evaluation

Authors: Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23452
Pdf URL: https://arxiv.org/pdf/2503.23452
Copy Paste: [[2503.23452]] VideoGen-Eval: Agent-based System for Video Generation Evaluation(https://arxiv.org/abs/2503.23452)
Keywords: generation
Abstract: The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
摘要：视频生成的快速进步使现有的评估系统不足以评估最新模型，这主要是由于无法展示模型的功能，固定评估操作员在分发范围（OOD）案例（OOD）案例中挣扎的固定评估操作员以及计算的度量和人类偏好之间的错误。为了弥合差距，我们提出了Videogen-eval，这是一个集成了基于LLM的内容结构，基于MLLM的内容判断以及设计用于时间密度尺寸的补丁工具，以实现动态，灵活且可扩展的视频生成评估。此外，我们引入了视频生成基准，以评估现有的尖端模型并验证我们的评估系统的有效性。它包括700个结构化的，富含内容的提示（T2V和I2V）和20多个由20多个模型生成的视频，其中包括8种尖端模型作为代理和人类的定量评估。广泛的实验验证了我们提出的基于代理的评估系统表明与人类偏好的一致性很强，并可靠地完成了评估以及基准的多样性和丰富性。

Title: Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Authors: Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.23455
Pdf URL: https://arxiv.org/pdf/2503.23455
Copy Paste: [[2503.23455]] Efficient Token Compression for Vision Transformer with Spatial Information Preserved(https://arxiv.org/abs/2503.23455)
Keywords: restoration
Abstract: Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at this https URL.
摘要：令牌压缩对于减少变压器模型的计算和内存要求至关重要，从而使其在资源受限环境中的部署。在这项工作中，我们提出了一种称为Prune and Merge的高效且与硬件的令牌压缩方法。我们的方法在变压器模型中集成了令牌修剪和合并操作，以实现在图层令牌压缩方面。通过引入可训练的合并和重建矩阵并利用快捷方式连接，我们可以有效合并令牌，同时保留重要信息并启用修复的令牌。此外，我们引入了一种新颖的梯度加权评分机制，该机制在训练阶段计算令牌重要性得分，从而消除了推断过程中对单独计算的需求，并提高了压缩效率。我们还利用梯度信息来捕获令牌的全球影响，并自动识别最佳压缩结构。对Imagenet-1K和ADE20K数据集进行的广泛实验验证了我们方法的有效性，与最先进的方法相比，准确性降解的速度最小。例如，在Deit-small上，我们获得了1.64 $ \ times $加速，而Imagenet-1k的准确性仅为0.2 \％。此外，通过压缩分段模型并与现有方法进行比较，我们在效率和有效性方面证明了方法的出色表现。代码和型号已在此HTTPS URL上提供。

Title: TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Authors: Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23461
Pdf URL: https://arxiv.org/pdf/2503.23461
Copy Paste: [[2503.23461]] TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes(https://arxiv.org/abs/2503.23461)
Keywords: generation, generative
Abstract: This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.
摘要：本文探讨了复杂的视觉文本生成（CVTG）的任务，该任务的重点是生成复杂的文本内容，分布在视觉图像中的各个区域。在CVTG中，图像生成模型通常会使视觉文本变形和模糊或缺少一些视觉文本。为了应对这些挑战，我们提出了一种新型的多视觉文本渲染方法TextCrafter。 TextCrafter采用渐进策略将复杂的视觉文本分解为不同的组件，同时确保文本内容及其视觉载体之间的稳健对齐。此外，它结合了令牌焦点增强机制，以扩大生成过程中视觉文本的突出。 TextCrafter有效地解决了CVTG任务中的关键挑战，例如文本混乱，遗漏和模糊。此外，我们提出了一个新的基准数据集CVTG-2K，该数据集是为严格评估CVTG任务上生成模型的性能而定制的。广泛的实验表明，我们的方法超过了最先进的方法。

Title: OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Authors: Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Alois C. Knoll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23463
Pdf URL: https://arxiv.org/pdf/2503.23463
Copy Paste: [[2503.23463]] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model(https://arxiv.org/abs/2503.23463)
Keywords: generation
Abstract: We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddings, we propose a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Besides, OpenDriveVLA models the dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process, ensuring both spatially and behaviorally informed trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate OpenDriveVLA's superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving. We will release our code to facilitate further research in this domain.
摘要：我们提出了OpenDriveVLA，这是一种旨在端到端自动驾驶的视觉动作（VLA）模型。 OpenDriveVLA建立在开源预培训的大型视觉模型（VLMS）的基础上，以生成可靠的驾驶动作，以3D环境感知，EGO车辆状态和驾驶员命令为条件。为了弥合驱动视觉表示和语言嵌入之间的模态差距，我们提出了一个层次的视觉对齐过程，将2D和3D结构化的视觉令牌投影到统一的语义空间中。此外，OpenDriveVLA通过自回归的代理-ENV-EGO互动过程对自我车辆，周围代理和静态道路元素之间的动态关系进行建模，从而确保了在空间和行为上知情的轨迹计划。 Nuscenes数据集的广泛实验表明，OpenDriveVLA在开环轨迹计划和与驾驶有关的问题的任务中实现最先进的结果。定性分析进一步说明了Opendrivevla遵循高级驾驶命令并在具有挑战性的情况下稳健地产生轨迹的卓越能力，从而强调了其下一代端到端自动驾驶的潜力。我们将发布我们的代码，以促进该领域的进一步研究。

Title: A Survey on Unlearnable Data

Authors: Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23536
Pdf URL: https://arxiv.org/pdf/2503.23536
Copy Paste: [[2503.23536]] A Survey on Unlearnable Data(https://arxiv.org/abs/2503.23536)
Keywords: generation
Abstract: Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.
摘要：未获得的数据（ULD）已成为一种创新的防御技术，以防止机器学习模型从特定数据中学习有意义的模式，从而保护数据隐私和安全性。通过向培训数据引入扰动，ULD降低了模型性能，因此未经授权的模型很难提取有用的表示形式。尽管ULD的重要性越来越重要，但现有的调查主要集中在相关领域，例如对抗性攻击和机器耕种，而对ULD的关注很少，因为ULD是独立的研究领域。这项调查通过提供对ULD的全面审查，研究未校验的数据生成方法，公共基准，评估指标，理论基础和实际应用来填补这一空白。我们比较和对比不同的ULD方法，分析其优势，局限性和权衡与无障碍性，不可识别性，效率和鲁棒性相关的权衡。此外，我们讨论了关键挑战，例如平衡扰动的不可识别与模型降解和ULD生成的计算复杂性。最后，我们重点介绍了未来的研究方向，以提高ULD的有效性和适用性，强调其成为机器学习数据保护不断发展的景观中至关重要的工具的潜力。

Title: Enhancing Creative Generation on Stable Diffusion-based Models

Authors: Jiyeon Han, Dahee Kwon, Gayoung Lee, Junho Kim, Jaesik Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23538
Pdf URL: https://arxiv.org/pdf/2503.23538
Copy Paste: [[2503.23538]] Enhancing Creative Generation on Stable Diffusion-based Models(https://arxiv.org/abs/2503.23538)
Keywords: generation, generative
Abstract: Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative capability remains constrained, as including `creative' in prompts seldom yields the desired results. This paper introduces C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models.
摘要：最近的文本到图像生成模型，尤其是稳定的扩散及其蒸馏变体，已经达到了令人印象深刻的忠诚度和强大的文本图像对齐方式。但是，他们的创造力仍然受到限制，因为在提示中包括“创意”很少会产生所需的结果。本文介绍了C3（Creative Concept Catalyst），这是一种无训练的方法，旨在增强基于稳定的扩散模型中的创造力。 C3选择性地放大了在降级过程中的功能，以促进更多创意输出。我们提供了根据创造力的两个主要方面选择放大因素的实用指南。 C3是提高没有大量计算成本的扩散模型中创造力的第一个研究。我们在各种稳定的基于扩散的模型中证明了它的有效性。

Title: DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Authors: Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23580
Pdf URL: https://arxiv.org/pdf/2503.23580
Copy Paste: [[2503.23580]] DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution(https://arxiv.org/abs/2503.23580)
Keywords: super-resolution, generation, generative
Abstract: Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: this https URL.
摘要：大规模的预训练扩散模型由于其丰富的生成先验而在解决现实世界图像超分辨率（Real-ISR）问题方面变得越来越流行。扩散变压器（DIT）的最新发展目睹了对图像生成中基于UNET的传统体系结构的压倒性表现，这也提出了一个问题：我们可以采用真实ISR的基于先进的基于DIT的扩散模型吗？为此，我们提出了DIT4SR，这是一个开拓者之一，是为了驯服用于Real-ISR的大规模DIT模型。我们将LR嵌入在DIT的原始注意机制中，而不是直接注入从低分辨率（LR）图像等图像中提取的嵌入，从而允许LR潜在和产生的潜在潜伏之间的信息流动。这两个流的足够相互作用允许LR流通过扩散过程发展，从而逐渐完善的指导可以更好地与每个扩散步骤中生成的潜在的一致。此外，通过横流卷积层将LR指南注入生成的潜在，以补偿DIT捕获本地信息的有限能力。这些简单但有效的设计赋予了DIT模型在Real-ISR中具有出色的性能，这是通过广泛的实验证明的。项目页面：此HTTPS URL。

Title: Make Autoregressive Great Again: Diffusion-Free Graph Generation with Next-Scale Prediction

Authors: Samuel Belkadi, Steve Hong, Marian Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23612
Pdf URL: https://arxiv.org/pdf/2503.23612
Copy Paste: [[2503.23612]] Make Autoregressive Great Again: Diffusion-Free Graph Generation with Next-Scale Prediction(https://arxiv.org/abs/2503.23612)
Keywords: generation, generative
Abstract: Autoregressive models are popular generative models due to their speed and properties. However, they require an explicit sequence order, which contradicts the unordered nature of graphs. In contrast, diffusion models maintain permutation invariance and enable one-shot generation but require up to thousands of denoising steps and additional features, leading to high computational costs. Inspired by recent breakthroughs in image generation-especially the success of visual autoregressive methods-we propose MAG, a novel diffusion-free graph generation framework based on next-scale prediction. By leveraging a hierarchy of latent representations, the model progressively generates scales of the entire graph without the need for explicit node ordering. Extensive experiments on both generic and molecular graph datasets demonstrate that MAG delivers competitive performance compared to state-of-the-art methods, achieving up to three orders of magnitude in speedup during inference.
摘要：自回归模型由于其速度和特性而成为流行的生成模型。但是，它们需要明确的序列顺序，这与图形的无序性质相矛盾。相比之下，扩散模型保持置换不变性并启用一声生成，但最多需要成千上万的剥离步骤和其他功能，从而导致高计算成本。受图像生成的最新突破的启发，尤其是视觉自回旋方法的成功，我们提出了MAG，这是一个基于次级预测的新型无扩散图生成框架。通过利用潜在表示的层次结构，模型逐渐生成了整个图的比例，而无需显式节点排序。对通用图和分子图数据集进行的广泛实验表明，与最先进的方法相比，MAG提供了竞争性能，在推理过程中最多达到了三个数量级。

Title: Graph-Eq: Discovering Mathematical Equations using Graph Generative Models

Authors: Nisal Ranasinghe, Damith Senanayake, Saman Halgamuge
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23617
Pdf URL: https://arxiv.org/pdf/2503.23617
Copy Paste: [[2503.23617]] Graph-Eq: Discovering Mathematical Equations using Graph Generative Models(https://arxiv.org/abs/2503.23617)
Keywords: generative
Abstract: The ability to discover meaningful, accurate, and concise mathematical equations that describe datasets is valuable across various domains. Equations offer explicit relationships between variables, enabling deeper insights into underlying data patterns. Most existing equation discovery methods rely on genetic programming, which iteratively searches the equation space but is often slow and prone to overfitting. By representing equations as directed acyclic graphs, we leverage the use of graph neural networks to learn the underlying semantics of equations, and generate new, previously unseen equations. Although graph generative models have been shown to be successful in discovering new types of graphs in many fields, there application in discovering equations remains largely unexplored. In this work, we propose Graph-EQ, a deep graph generative model designed for efficient equation discovery. Graph-EQ uses a conditional variational autoencoder (CVAE) to learn a rich latent representation of the equation space by training it on a large corpus of equations in an unsupervised manner. Instead of directly searching the equation space, we employ Bayesian optimization to efficiently explore this learned latent space. We show that the encoder-decoder architecture of Graph-Eq is able to accurately reconstruct input equations. Moreover, we show that the learned latent representation can be sampled and decoded into valid equations, including new and previously unseen equations in the training data. Finally, we assess Graph-Eq's ability to discover equations that best fit a dataset by exploring the latent space using Bayesian optimization. Latent space exploration is done on 20 dataset with known ground-truth equations, and Graph-Eq is shown to successfully discover the grountruth equation in the majority of datasets.
摘要：发现描述数据集的有意义，准确和简洁的数学方程的能力在各个领域都很有价值。方程提供了变量之间的明确关系，从而可以深入了解基本数据模式。大多数现有的方程发现方法都依赖于遗传编程，该方法迭代地搜索方程空间，但通常很慢且容易过度拟合。通过表示方程为定向无环图，我们利用图形神经网络的使用来学习方程式的基本语义，并生成新的，以前看不见的方程式。尽管已证明图形生成模型在许多字段中发现了新类型的图形，但在发现方程中的应用仍然很大程度上没有探索。在这项工作中，我们提出了Graph-eq，这是一种旨在有效方程发现的深图生成模型。 Graph-EQ使用条件变分自动编码器（CVAE）通过以无监督的方式在大型方程式上训练方程空间来学习方程空间的丰富潜在表示。我们没有直接搜索方程空间，而是采用贝叶斯优化来有效探索这个学到的潜在空间。我们表明，Graph-EQ的编码器架构能够准确地重建输入方程。此外，我们表明可以将学习的潜在表示可以被采样并解码为有效的方程式，包括培训数据中的新方程和以前看不见的方程式。最后，我们评估了Graph-EQ通过使用贝叶斯优化探索潜在空间来发现最适合数据集的方程式的能力。潜在的空间探索是在具有已知地面真相方程的20个数据集上完成的，并且显示了图形-EQ成功地发现大多数数据集中的grountruth方程。

Title: Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

Authors: Amar Kumar, Anita Kriz, Barak Pertzov, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23618
Pdf URL: https://arxiv.org/pdf/2503.23618
Copy Paste: [[2503.23618]] Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging(https://arxiv.org/abs/2503.23618)
Keywords: generation
Abstract: Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.
摘要：视觉语言基础模型（VLM）在通过文本引导图像生成时表现出了令人印象深刻的性能，并在医学成像中进行了新兴应用。在这项工作中，我们是第一个调查一个问题的人：“微调的基础模型可以帮助确定关键，甚至可能未知的数据属性？”通过评估我们在胸部X射线数据集上提出的方法，我们表明这些模型可以生成高分辨率，精确编辑的图像，而与依靠结构性因果模型（SCM）的方法相比，根据许多指标。我们首次证明，微调的VLM可以揭示由于可用的元数据粒度和模型容量限制而导致的隐藏数据关系。我们的实验证明了这些模型揭示基本数据集属性的潜力，同时还暴露了微调VLM的局限性，以进行准确的图像编辑以及对偏见和虚假相关性的敏感性。

Title: Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

Authors: Zahra TehraniNasab, Amar Kumar, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23623
Pdf URL: https://arxiv.org/pdf/2503.23623
Copy Paste: [[2503.23623]] Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation(https://arxiv.org/abs/2503.23623)
Keywords: generation, generative
Abstract: Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.
摘要：文本到图像扩散模型表明，可以从自然语言提示中产生逼真的图像的出色能力。这些高分辨率，语言引导的合成图像对于疾病的解释或探索因果关系至关重要。但是，它们在医学成像（例如医学成像）等专业领域中脱离和控制潜在变化的潜在因素的潜力仍然不足。在这项工作中，我们首次研究了一旦在医学图像数据集上进行微调的预训练视觉基础模型的力量，以对分解的医疗图像产生和插值进行潜在分解。通过对胸部X射线和皮肤数据集进行的广泛实验，我们说明了固有的微调，语言引导的稳定扩散可以使图像产生的关键属性分解，例如患者的解剖结构或疾病诊断特征。我们设计了一个框架，通过潜在的生成模型的潜在空间轨迹轨迹识别，隔离和操纵关键属性，从而促进对医学图像合成的精确控制。

Title: DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance

Authors: Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23660
Pdf URL: https://arxiv.org/pdf/2503.23660
Copy Paste: [[2503.23660]] DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance(https://arxiv.org/abs/2503.23660)
Keywords: generation
Abstract: Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.
摘要：当前的电影配音技术可以从给定的语音提示中产生所需的声音，从而确保语音和视觉效果之间的良好同步，同时准确地传达了预期的情绪。但是，在电影配音中，诸如适应不同配音风格的关键方面，有效地处理对话，叙述和独白，以及了解诸如扬声器的年龄和性别之类的微妙细节，尚未得到很好的研究。为了应对这一挑战，我们提出了一个多模式大语言模型的框架。首先，它利用视觉输入的多模式链（COT）推理方法来了解配音样式和细粒属性。其次，它通过多模式条件引导，通过大型语音生成模型产生高质量的配音。此外，我们还开发了带有COT注释的电影配音数据集。评估结果表明，对多个数据集的最先进方法的性能提高。特别是，对于评估指标，对于V2C动画数据集中的配音设置2.0，SPK-SIM和EMO-SIM从82.48％增加到89.74％，66.24％增加到78.88％，LSE-D和MCD-SL上的配音设置2.0，从14.79到14.79到14.63，5.24 spk，for Datas for DataS for DataS for Datas for Datas for DataS for Datas for Datas for Datas，for DATES的设置为0.04在与最先进的模型相比，在提议的COT-Movie-Dubing数据集上，64.03至83.42和WER从52.69％降低至23.20％。

Title: Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Authors: Kotaro Inoue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23667
Pdf URL: https://arxiv.org/pdf/2503.23667
Copy Paste: [[2503.23667]] Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity(https://arxiv.org/abs/2503.23667)
Keywords: generation
Abstract: Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.
摘要：由于它们在图像字幕，文档分析和自动化内容产生等任务中的多功能性很高，因此多模式大语言模型（LLMS）在各个工业领域引起了极大的关注。特别是，它们已被证明超过了光学特征识别（OCR）的专业模型。然而，它们在不同图像条件下的性能仍然不足，并且由于对上下文提示的依赖，因此无法保证个人角色识别。在这项工作中，我们使用具有不同视觉复杂性的单个字符图像来检查独立于上下文的OCR任务，以确定准确识别的条件。我们的发现表明，多模式LLM可以在约300 ppi的情况下与常规的OCR方法匹配，但其性能大大恶化低于150 ppi。此外，我们观察到视觉复杂性与错误识别之间的相关性非常弱，而常规的OCR特异性模型没有相关性。这些结果表明，图像分辨率和视觉复杂性可能在多模式LLMS在需要精确字符级准确性的OCR任务的可靠应用中起重要作用。

Title: Expanding-and-Shrinking Binary Neural Networks

Authors: Xulong Shi, Caiyi Sun, Zhi Qi, Liu Hao, Xiaodong Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23709
Pdf URL: https://arxiv.org/pdf/2503.23709
Copy Paste: [[2503.23709]] Expanding-and-Shrinking Binary Neural Networks(https://arxiv.org/abs/2503.23709)
Keywords: generative
Abstract: While binary neural networks (BNNs) offer significant benefits in terms of speed, memory and energy, they encounter substantial accuracy degradation in challenging tasks compared to their real-valued counterparts. Due to the binarization of weights and activations, the possible values of each entry in the feature maps generated by BNNs are strongly constrained. To tackle this limitation, we propose the expanding-and-shrinking operation, which enhances binary feature maps with negligible increase of computation complexity, thereby strengthening the representation capacity. Extensive experiments conducted on multiple benchmarks reveal that our approach generalizes well across diverse applications ranging from image classification, object detection to generative diffusion model, while also achieving remarkable improvement over various leading binarization algorithms based on different architectures including both CNNs and Transformers.
摘要：虽然二进制神经网络（BNN）在速度，记忆和能量方面具有重大好处，但与其值相比，它们在具有挑战性的任务中遇到了很大的准确性降解。由于权重和激活的二元化，BNN生成的特征图中每个条目的可能值受到了强烈限制。为了应对这一限制，我们提出了扩展和缩减的操作，从而增强了二进制特征图，而计算复杂性的增加，从而增强了表示能力。在多个基准上进行的广泛实验表明，我们的方法跨越了从图像分类，对象检测到生成扩散模型的各种应用，同时在基于不同架构的各种领先的二进制算法上也取得了显着改进，包括CNNS和包括CNN和变形金刚的不同架构。

Title: HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Authors: Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, Wu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23715
Pdf URL: https://arxiv.org/pdf/2503.23715
Copy Paste: [[2503.23715]] HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation(https://arxiv.org/abs/2503.23715)
Keywords: generation
Abstract: Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at this https URL.
摘要：文本到视频（T2V）的生成在基于文本的复杂场景中取得了巨大进展。但是，由于缺少具有准确标题的HOI字幕的大型视频，当前T2V模型通常无法精确地生成人类对象的相互作用（HOI）。为了解决这个问题，我们介绍了HOIGEN-1M，这是HOI生成的第一个LargesCale数据集，由从不同来源收集的一百万个高质量的视频组成。特别是，为了保证视频的高质量，我们首先设计了一个有效的框架，以使用强大的多模式大语言模型（MLLM）自动策划HOI视频，然后由人类注释者进一步清洁视频。此外，为了获得HOI视频的准确文本标题，我们根据混合物的混合物（MOME）策略设计了一种新颖的视频描述方法，该方法不仅会产生表现力的字幕，而且还消除了单个MLLM的幻觉。此外，由于缺乏生成的HOI视频的评估框架，我们提出了两个新的指标，以粗略到细节的方式评估生成的视频的质量。广泛的实验表明，当前的T2V模型难以生成高质量的HOI视频，并确认我们的Hoigen-1M数据集对改善HOI视频的生成有用。项目网页可在此HTTPS URL上找到。

Title: Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Authors: Yi Liu, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23717
Pdf URL: https://arxiv.org/pdf/2503.23717
Copy Paste: [[2503.23717]] Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space(https://arxiv.org/abs/2503.23717)
Keywords: generative
Abstract: Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at this https URL.
摘要：在遥感图像处理中，云拆卸（CR）仍然是一项具有挑战性的任务。尽管扩散模型（DM）具有强大的生成能力，但它们对CR的直接应用是次优的，因为它们从随机噪声中生成无云图像，忽略了云输入中固有的信息。为了克服这一缺点，我们基于均值扩散模型（MRDM）开发了新的CR模型EMRDM，以在云和无云图像之间建立直接扩散过程。与当前的MRDMS相比，EMRDM基于重新准备的远期过程和新的普通微分方程（ODE）基于基于的新型向后过程提供了一个具有可更新模块和阐明的设计空间的模块化框架。利用我们的框架，我们重新设计了密钥MRDM模块以提高CR性能，包括通过预处理技术重组Denoiser，重新组织训练过程以及通过引入确定性和随机抽样器来改善采样过程。为了实现多阶梯性CR，我们进一步开发了一个用于同时降级顺序图像的脱氧网络。单颞和多时间数据集的实验证明了EMRDM的出色性能。我们的代码可在此HTTPS URL上找到。

Title: KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Authors: Yoonshik Kim, Jaeyoon Jung
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.23730
Pdf URL: https://arxiv.org/pdf/2503.23730
Copy Paste: [[2503.23730]] KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language(https://arxiv.org/abs/2503.23730)
Keywords: generative
Abstract: The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at this https URL
摘要：大型视觉模型（VLM）的最新出现导致了各种不同的基准来评估此类模型。尽管如此，我们观察到，大多数现有的评估方法都遭受这样的事实，即它们要么要求模型选择预定的响应，牺牲开放性或使用法官模型评估反应，从而进行主观和不可靠的评估。此外，我们观察到缺乏韩语VLM的基准，这是与更常见的英语语言基准单独的指标，因为生成语言模型的性能可能会根据所使用的语言而有很大差异。因此，我们提出了KoffVQA，这是一种通用的自由形式的视觉问题，以韩国语言来评估VLM的基准。我们的基准由275个精心制作的问题组成，每个问题与图像和分级标准配对，涵盖了VLM性能的10个不同方面。分级标准通过允许法官模型根据预定的一组规则来对每个响应进行评分，从而消除了不可靠的问题。通过以客观的方式定义评估标准，即使是小的开源模型也可以用于可靠地评估我们的基准模型。除了在基准上评估大量现有的VLM外，我们还经过实验验证，使用预先存在的评估标准进行评估的方法比现有方法更可靠。我们的评估代码可在此HTTPS URL上找到

Title: Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation

Authors: Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.23736
Pdf URL: https://arxiv.org/pdf/2503.23736
Copy Paste: [[2503.23736]] Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation(https://arxiv.org/abs/2503.23736)
Keywords: generation
Abstract: We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.
摘要：我们介绍了一个专门设计的无训练框架，该框架旨在通过图像到视频（I2V）综合将现实世界中的静态绘画栩栩如生，以解决将这些动作与文本指导保持一致的挑战，同时将其保留到原始艺术品上。现有的I2V方法主要在自然视频数据集中训练，通常很难从静态绘画中产生动态输出。在保持现实世界绘画的视觉一致性的同时，产生运动仍然具有挑战性。这导致了两种不同的故障模式：由于基于文本的运动解释有限，要么是由于与现实世界艺术风格不足引起的静态输出。我们利用预训练的图像模型的高级文本图像对齐功能来指导动画过程。我们的方法通过两个关键创新引入合成代理图像：（1）双路径评分蒸馏：我们采用双路径体系结构来将运动先验从真实和合成数据中提炼出来，从而从原始绘画中保留静态细节，同时从合成框架中学习动态特性。（2）混合潜在融合：我们整合了从真实绘画和合成代理图像中通过球形线性插值在潜在空间中提取的混合特征，从而确保了平滑的过渡并增强了时间的一致性。实验评估证实，我们的方法可以通过文本提示显着提高语义对齐，同时忠实地保留了原始画的独特特征和完整性。至关重要的是，通过在不需要任何模型培训或可学习的参数的情况下实现增强的动态效果，我们的框架可以与现有的I2V方法进行插入式播放集成，从而使其成为为真实世界绘画而动画的理想解决方案。可以在我们的项目网站上找到更多动画示例。

Title: Time-Series Forecasting via Topological Information Supervised Framework with Efficient Topological Feature Learning

Authors: ZiXin Lin, Nur Fariha Syaqina Zulkepli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23757
Pdf URL: https://arxiv.org/pdf/2503.23757
Copy Paste: [[2503.23757]] Time-Series Forecasting via Topological Information Supervised Framework with Efficient Topological Feature Learning(https://arxiv.org/abs/2503.23757)
Keywords: generative
Abstract: Topological Data Analysis (TDA) has emerged as a powerful tool for extracting meaningful features from complex data structures, driving significant advancements in fields such as neuroscience, biology, machine learning, and financial modeling. Despite its success, the integration of TDA with time-series prediction remains underexplored due to three primary challenges: the limited utilization of temporal dependencies within topological features, computational bottlenecks associated with persistent homology, and the deterministic nature of TDA pipelines restricting generalized feature learning. This study addresses these challenges by proposing the Topological Information Supervised (TIS) Prediction framework, which leverages neural networks and Conditional Generative Adversarial Networks (CGANs) to generate synthetic topological features, preserving their distribution while significantly reducing computational time. We propose a novel training strategy that integrates topological consistency loss to improve the predictive accuracy of deep learning models. Specifically, we introduce two state-of-the-art models, TIS-BiGRU and TIS-Informer, designed to capture short-term and long-term temporal dependencies, respectively. Comparative experimental results demonstrate the superior performance of TIS models over conventional predictors, validating the effectiveness of integrating topological information. This work not only advances TDA-based time-series prediction but also opens new avenues for utilizing topological features in deep learning architectures.
摘要：拓扑数据分析（TDA）已成为从复杂的数据结构中提取有意义的特征，在神经科学，生物学，机器学习和财务建模等领域的重大进步中提取有意义的功能的强大工具。尽管取得了成功，但由于三个主要挑战，TDA与时间序列预测的集成仍然没有得到充实的态度：在拓扑特征中对时间依赖性的利用率有限，与持续同源性相关的计算瓶颈以及TDA管道的确定性性质限制了一般特征学习。这项研究通过提出拓扑信息监督（TIS）预测框架来解决这些挑战，该框架利用神经网络和有条件的生成对抗网络（CGAN）生成合成的拓扑特征，并保留它们的分布，同时大大减少计算时间。我们提出了一种新颖的培训策略，该策略将拓扑一致性损失整合在一起，以提高深度学习模型的预测准确性。具体而言，我们介绍了两种最先进的模型，即tis-bigru和tis-informer，旨在分别捕获短期和长期的时间依赖性。比较实验结果表明，TIS模型的性能优于常规预测因子，从而验证了整合拓扑信息的有效性。这项工作不仅可以提高基于TDA的时间序列预测，而且为在深度学习体系结构中利用拓扑特征开辟了新的途径。

Title: Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement Learning

Authors: Jiangjie Qiu, Hou Hei Lam, Xiuyuan Hu, Wentao Li, Siwei Fu, Fankun Zeng, Hao Zhang, Xiaonan Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23766
Pdf URL: https://arxiv.org/pdf/2503.23766
Copy Paste: [[2503.23766]] Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement Learning(https://arxiv.org/abs/2503.23766)
Keywords: generative
Abstract: Organic photovoltaic (OPV) materials offer a promising avenue toward cost-effective solar energy utilization. However, optimizing donor-acceptor (D-A) combinations to achieve high power conversion efficiency (PCE) remains a significant challenge. In this work, we propose a framework that integrates large-scale pretraining of graph neural networks (GNNs) with a GPT-2 (Generative Pretrained Transformer 2)-based reinforcement learning (RL) strategy to design OPV molecules with potentially high PCE. This approach produces candidate molecules with predicted efficiencies approaching 21\%, although further experimental validation is required. Moreover, we conducted a preliminary fragment-level analysis to identify structural motifs recognized by the RL model that may contribute to enhanced PCE, thus providing design guidelines for the broader research community. To facilitate continued discovery, we are building the largest open-source OPV dataset to date, expected to include nearly 3,000 donor-acceptor pairs. Finally, we discuss plans to collaborate with experimental teams on synthesizing and characterizing AI-designed molecules, which will provide new data to refine and improve our predictive and generative models.
摘要：有机光伏（OPV）材料为实现具有成本效益的太阳能利用提供了有希望的途径。但是，优化捐助者（D-A）组合以实现高功率转化效率（PCE）仍然是一个重大挑战。在这项工作中，我们提出了一个框架，该框架将图形神经网络（GNNS）与GPT-2（生成预验证的变压器2）基于基于pce的GPT-2（GNNS）（GNNS）（GNNS）（GNNS）（GNNS）（GNNS）（GNNS）（RL）策略（RL）策略。尽管需要进一步的实验验证，但这种方法会产生候选分子，预测的效率接近21 \％。此外，我们进行了初步的片段级分析，以确定RL模型认可的结构图案，该模型可能有助于增强PCE，从而为更广泛的研究社区提供了设计指南。为了促进持续的发现，我们正在建立迄今为止最大的开源OPV数据集，预计将包括近3,000个捐助者对。最后，我们讨论了与实验团队合成和表征AI设计的分子的计划，该分子将提供新的数据来完善和改善我们的预测性和生成性模型。

Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

Authors: Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23796
Pdf URL: https://arxiv.org/pdf/2503.23796
Copy Paste: [[2503.23796]] On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices(https://arxiv.org/abs/2503.23796)
Keywords: generation, generative
Abstract: We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(this https URL).
摘要：我们提出了Device Sora，这是第一个用于基于扩散的基于扩散的在设备上的文本到视频生成的无模型培训解决方案，可在智能手机级设备上有效运行。为了解决基于扩散的文本到视频生成在计算和内存限制的移动设备上的挑战，拟议的内设备Sora将三种新技术应用于预训练的视频生成模型。首先，线性比例LEAP（LPL）通过有效的基于LEAP的方法来减少视频扩散所需的过度降解步骤。其次，时间维令牌合并（TDTM）通过沿时间维度合并连续令牌来最大程度地减少注意力层中的密集令牌处理。第三，与动态加载（CI-DL）的同时推断将大型模型动态分配到较小的块中，并将它们加载到内存中以进行并发模型推理，从而有效地解决了有限的设备内存的挑战。我们在iPhone 15 Pro上实施了设备的Sora，实验评估表明，它能够在设备上生成高质量的视频，与高端GPU生成的视频相当。这些结果表明，在资源约束的移动设备上，evice sora启用了有效且高质量的视频生成。我们将拟议的“设备上的Sora”视为朝着最先进的生成技术民主化的重要第一步，从而在商品移动设备和嵌入式设备上实现了视频，而无需进行资源密集的重新训练以进行模型优化（压缩）。代码实现可在GITHUB存储库（此HTTPS URL）上获得。

Title: Learned Image Compression and Restoration for Digital Pathology

Authors: SeonYeong Lee, EonSeung Seong, DongEon Lee, SiYeoul Lee, Yubin Cho, Chunsu Park, Seonho Kim, MinKyoung Seo, YoungSin Ko, MinWoo Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23862
Pdf URL: https://arxiv.org/pdf/2503.23862
Copy Paste: [[2503.23862]] Learned Image Compression and Restoration for Digital Pathology(https://arxiv.org/abs/2503.23862)
Keywords: restoration
Abstract: Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: this https URL.
摘要：数字病理图像在医学诊断中起着至关重要的作用，但是它们的超高分辨率和大型文件尺寸在存储，传输和实时可视化方面构成了重大挑战。为了解决这些问题，我们提出了一种新颖的基于深度学习的图像压缩框架的牧师，专为整个幻灯片图像（WSIS）设计。牧师整合了可学习的提升方案和先进的卷积技术，以提高压缩效率，同时保留关键的病理细节。我们的框架在分析阶段采用了升压 - 旋转转换，将图像分解为低频和高频组件，从而实现了更结构的潜在表示。这些组件通过并行编码器进行处理，这些编码器结合了可变形的残留块（DRB）和复发性残留块（R2B），以改善特征提取和空间适应性。合成阶段应用了反向提升变换，以进行有效的图像重建，从而确保高保真恢复细粒的组织结构。我们在数字病理图像数据集上评估牧师，并将其性能与最新的学术图像压缩（LIC）模型进行比较。实验结果表明，牧师实现了较高的速率（RD）性能，可显着降低存储要求，同时保持高诊断图像质量。我们的研究强调了基于深度学习的压缩在数字病理学中的潜力，促进有效的数据管理和长期存储，同时确保无缝集成到临床工作流程和AI辅助诊断系统中。代码和模型可在以下网址提供：此HTTPS URL。

Title: MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Authors: Xin Zhang, Siting Huang, Xiangyang Luo, Yifan Xie, Weijiang Yu, Heng Chang, Fei Ma, Fei Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23888
Pdf URL: https://arxiv.org/pdf/2503.23888
Copy Paste: [[2503.23888]] MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach(https://arxiv.org/abs/2503.23888)
Keywords: generation
Abstract: Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.
摘要：面部编辑修改了面部的外观，这在定制和增强个人图像中起着关键作用。尽管许多工作在文本驱动的面部编辑中取得了巨大的成功，但它们仍然面临重大挑战，因为它们都没有同时满足多样性，可控性和灵活性的特征。为了应对这一挑战，我们提出了一个由文本驱动的面部编辑框架的MuseFace，它仅依赖文本提示来启用面部编辑。具体而言，MuseFace集成了文本对掩蔽扩散模型和语义吸引的面部编辑模型，该模型能够直接从文本中直接生成细粒度的语义面具并进行表面编辑。文本对掩码扩散模型提供\ textit {多样性}和\ textit {fortimie}的框架，而语义吸引的面部编辑模型可确保框架的\ textit {可控性}。我们的框架可以创建精细的语义面具，使精确的面部编辑成为可能，并显着增强面部编辑模型的可控性和灵活性。广泛的实验表明Museface可以达到卓越的高保真表现。

Title: DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models

Authors: Maximilian Springenberg, Noelia Otero, Yuxin Xue, Jackie Ma
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.23893
Pdf URL: https://arxiv.org/pdf/2503.23893
Copy Paste: [[2503.23893]] DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models(https://arxiv.org/abs/2503.23893)
Keywords: generative
Abstract: Renewable resources are strongly dependent on local and large-scale weather situations. Skillful subseasonal to seasonal (S2S) forecasts -- beyond two weeks and up to two months -- can offer significant socioeconomic advantages to the energy sector. This study aims to enhance wind speed predictions using a diffusion model with classifier-free guidance to downscale S2S forecasts of surface wind speed. We propose DiffScale, a diffusion model that super-resolves spatial information for continuous downscaling factors and lead times. Leveraging weather priors as guidance for the generative process of diffusion models, we adopt the perspective of conditional probabilities on sampling super-resolved S2S forecasts. We aim to directly estimate the density associated with the target S2S forecasts at different spatial resolutions and lead times without auto-regression or sequence prediction, resulting in an efficient and flexible model. Synthetic experiments were designed to super-resolve wind speed S2S forecasts from the European Center for Medium-Range Weather Forecast (ECMWF) from a coarse resolution to a finer resolution of ERA5 reanalysis data, which serves as a high-resolution target. The innovative aspect of DiffScale lies in its flexibility to downscale arbitrary scaling factors, enabling it to generalize across various grid resolutions and lead times -without retraining the model- while correcting model errors, making it a versatile tool for improving S2S wind speed forecasts. We achieve a significant improvement in prediction quality, outperforming baselines up to week 3.
摘要：可再生资源在很大程度上取决于当地和大规模的天气情况。熟练的季节性（S2S）预测（超过两个星期和两个月）可以为能源部门提供重要的社会经济优势。这项研究旨在使用具有无分类器指导的扩散模型来提高风速预测，以降低表面风速的降级S2S预测。我们提出了Diffscale，这是一个扩散模型，该模型是连续缩小因子和交货时间的超级空间信息。利用天气先验作为扩散模型生成过程的指导，我们在抽样超级分辨S2的预测中采用了条件概率的观点。我们的目标是直接估计与目标S2的预测相关的密度，并在没有自动回归或序列预测的情况下，在不同的空间分辨率和交货时间下，导致了有效且灵活的模型。设计合成实验可从欧洲中范围的天气预报中心（ECMWF）进行超级溶解风速S2的预测，从粗分辨率到ERA5重新分析数据的精细分辨率，该数据可作为高分辨率目标。 DIFFSCALE的创新方面在于它的灵活性对下限任意缩放因素，使其能够跨越各种网格分辨率和交货时间 - 在纠正模型错误的同时，使其成为改善S2S风速预测的多功能工具。我们在预测质量方面取得了重大改进，在第3周之前，我们的表现优于基线。

Title: Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Authors: Qihan Huang, Long Chan, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23905
Pdf URL: https://arxiv.org/pdf/2503.23905
Copy Paste: [[2503.23905]] Boosting MLLM Reasoning with Text-Debiased Hint-GRPO(https://arxiv.org/abs/2503.23905)
Keywords: generation
Abstract: MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at this https URL.
摘要：MLLM推理因其出色的解决问题的能力而引起了广泛的研究。当前的推理方法分为两种类型：PRM，该方法监督了中间推理步骤，而ORM则是监督最终结果。最近，DeepSeek-R1挑战了传统观点，即PRM优于ORM，该观点使用ORM方法（即GRPO）表现出强大的概括性能。但是，当前的MLLM的GRPO算法仍然难以处理具有挑战性且复杂的多模式推理任务（例如，数学推理）。在这项工作中，我们揭示了两个问题，这些问题阻碍了GRPO在MLLM上的性能：低数据利用和文本偏置。低数据利用是指GRPO无法获得正面的奖励来更新困难样本的MLLM，而文本偏置是一种现象，MLLM绕过图像条件，仅依靠文本条件来进行GRPO培训后的生成条件。为了解决这些问题，这项工作提出了提示GRPO，该提示通过适应性地提供了不同难度的样本的提示，并提高了数据利用率，并且文本偏置校准通过校准具有图像时间状态的图像条件来减轻文本偏置。实验在十一数据集的三个基本MLLM上的结果表明，我们提出的方法通过很大的边缘提高了原始MLLM的推理能力，表现出优于现有MLLM推理方法的性能。我们的代码可在此HTTPS URL上找到。

Title: FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment

Authors: Ruisheng Han, Kanglei Zhou, Amir Atapour-Abarghouei, Xiaohui Liang, Hubert P. H. Shum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23911
Pdf URL: https://arxiv.org/pdf/2503.23911
Copy Paste: [[2503.23911]] FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment(https://arxiv.org/abs/2503.23911)
Keywords: quality assessment
Abstract: Action quality assessment (AQA) is critical for evaluating athletic performance, informing training strategies, and ensuring safety in competitive sports. However, existing deep learning approaches often operate as black boxes and are vulnerable to spurious correlations, limiting both their reliability and interpretability. In this paper, we introduce FineCausal, a novel causal-based framework that achieves state-of-the-art performance on the FineDiving-HM dataset. Our approach leverages a Graph Attention Network-based causal intervention module to disentangle human-centric foreground cues from background confounders, and incorporates a temporal causal attention module to capture fine-grained temporal dependencies across action stages. This dual-module strategy enables FineCausal to generate detailed spatio-temporal representations that not only achieve state-of-the-art scoring performance but also provide transparent, interpretable feedback on which features drive the assessment. Despite its strong performance, FineCausal requires extensive expert knowledge to define causal structures and depends on high-quality annotations, challenges that we discuss and address as future research directions. Code is available at this https URL.
摘要：动作质量评估（AQA）对于评估运动表现，为培训策略提供信息以及确保竞争性运动的安全至关重要。但是，现有的深度学习方法通常是黑匣子，并且容易受到虚假相关性的影响，从而限制了它们的可靠性和解释性。在本文中，我们介绍了Finecausal，这是一种基于因果关系的新型框架，可在罚款HM数据集中实现最先进的性能。我们的方法利用基于图形的因果干预模块将以人为中心的前景线索与背景混杂因素解开，并结合了一个暂时的因果注意模块，以捕获跨动作阶段的细粒度的时间依赖性。这种双模型策略使FineCausal能够生成详细的时空表示，不仅实现了最先进的评分性能，而且还提供了透明，可解释的反馈，可以通过哪些功能推动评估。尽管表现出色，但Finecausal仍需要广泛的专家知识来定义因果结构，并取决于高质量的注释，这是我们讨论和解决的挑战，作为未来的研究方向。代码可在此HTTPS URL上找到。

Title: Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations

Authors: Adrián Sánchez-Mompó, Ioannis Mavromatis, Peizheng Li, Konstantinos Katsaros, Aftab Khan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23934
Pdf URL: https://arxiv.org/pdf/2503.23934
Copy Paste: [[2503.23934]] Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations(https://arxiv.org/abs/2503.23934)
Keywords: generative
Abstract: This study presents an empirical investigation into the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines. For Discriminative models, we examine various architectures and hyperparameters during training and inference and identify energy-efficient practices. For Generative AI, Large Language Models (LLMs) are assessed, focusing primarily on energy consumption across different model sizes and varying service requests. Our study employs software-based power measurements, ensuring ease of replication across diverse configurations, models, and datasets. We analyse multiple models and hardware setups to uncover correlations among various metrics, identifying key contributors to energy consumption. The results indicate that for Discriminative models, optimising architectures, hyperparameters, and hardware can significantly reduce energy consumption without sacrificing performance. For LLMs, energy efficiency depends on balancing model size, reasoning complexity, and request-handling capacity, as larger models do not necessarily consume more energy when utilisation remains low. This analysis provides practical guidelines for designing green and sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. This paper can serve as a benchmark for accurately estimating total energy use across different types of AI models.
摘要：这项研究介绍了对现实世界中MLOPS管道中判别和生成AI模型的能源消耗的实证研究。对于判别模型，我们在训练和推理过程中检查了各种体系结构和超参数，并确定节能实践。对于生成的AI，评估了大型语言模型（LLM），主要关注不同模型尺寸和各种服务请求的能源消耗。我们的研究采用了基于软件的功率测量，可确保跨不同配置，模型和数据集的复制性。我们分析了多种模型和硬件设置，以发现各种指标之间的相关性，从而确定了能源消耗的关键因素。结果表明，对于判别模型，优化体系结构，超参数和硬件可以大大减少能耗而无需牺牲性能。对于LLM，能源效率取决于平衡模型大小，推理复杂性和请求处理能力，因为当利用率保持较低时，较大的模型不一定会消耗更多的能量。该分析提供了设计绿色和可持续的ML操作的实用指南，强调能源消耗和碳足迹减少，同时保持性能。本文可以用作准确估计不同类型AI模型的总能源使用的基准。

Title: JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

Authors: Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23951
Pdf URL: https://arxiv.org/pdf/2503.23951
Copy Paste: [[2503.23951]] JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation(https://arxiv.org/abs/2503.23951)
Keywords: generation
Abstract: Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.
摘要：最近的文本到视频进步已从提示中启用了连贯的视频综合，并扩展到对外观和运动的细粒度控制。但是，现有方法要么受到特征域不匹配而引起的概念干扰，要么是由幼稚脱钩的优化引起的，或者由于参考视频重建中的运动和外观而引起的空间特征泄漏引起的外观污染。在本文中，我们提出了一种新型的自适应联合培训框架联合塔纳，以减轻这些问题。具体而言，我们开发了自适应洛拉，该洛拉结合了上下文感知的门控机制，并将封闭式的Lora组件整合到扩散模型中的空间和颞型变压器中。这些组件能够同时优化外观和运动，从而消除了概念干扰。此外，我们介绍了与外观无关的时间损失，该损失将运动模式从参考视频重建中的固有外观通过外观敏锐的噪声预测任务解除。关键的创新在于将框架的偏移噪声添加到地面高斯噪声中，使其分布扰动，从而破坏了与帧相关的空间属性，同时保留了时间连贯性。此外，我们在四个维度上构建了一个基准，其中包括90个外观定制组合和10个多类型自动指标，从而促进了对此定制任务的更全面评估。与当前的高级方法相比，广泛的实验证明了我们方法的出色性能。

Title: AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

Authors: Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, Hao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23956
Pdf URL: https://arxiv.org/pdf/2503.23956
Copy Paste: [[2503.23956]] AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference(https://arxiv.org/abs/2503.23956)
Keywords: generation
Abstract: Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch size and prompt length of inputs. Notably, as cache retention rates decrease, our method exhibits increasing performance advantages over existing approaches.
摘要：大型视觉语言模型（LVLM）的最新进展由于其出色的推理能力和概括性而引起了重大关注。但是，处理大量的视觉令牌并生成长篇小说输出施加了大量的计算开销，从而导致对密钥值（KV）缓存的过度需求。为了解决这个关键的瓶颈，我们提出了Aircache，这是一种旨在加速LVLMS推断的新型KV缓存压缩方法。这项工作系统地研究了LVLMS的注意机制中的视觉和文本令牌之间的相关性。我们的经验分析揭示了缓存的视觉令牌的大量冗余，其中从策略上消除这些令牌可以保留模型性能，同时显着加速了上下文的生成。受这些发现的启发，我们引入了一个精英观察窗口，用于评估KV缓存中视觉组件的重要性，重点是稳定的模式间相关模型，具有增强的多观点一致性。此外，我们制定了一种自适应层的预算分配策略，该策略利用了代币重要性分布的强度和偏度，与统一分配相比，提高了效率的效率。对多个LVLM和基准进行的全面评估表明，我们的方法可以达到与完整缓存的可比性，同时仅保留了10％的视觉KV缓存，从而在各种批次大小和及时的输入长度中将解码潜伏期降低了29％至66％。值得注意的是，随着缓存保留率的降低，我们的方法表现出比现有方法提高的性能优势。

Title: Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

Authors: Bizhe Bai, Jianjian Cao, Yadan Luo, Tao Che
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23959
Pdf URL: https://arxiv.org/pdf/2503.23959
Copy Paste: [[2503.23959]] Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning(https://arxiv.org/abs/2503.23959)
Keywords: generation
Abstract: Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.
摘要：扎根的对话生成（GCG）是一项新兴的视觉语言任务，需要模型才能与相应的对象分割掩码无缝地交织在一起的自然语言响应。最近的模型，例如Glamm和Omg-llava，由于处理了大量的视觉令牌而获得了像素级接地，但由于处理了大量的视觉令牌而产生了巨大的计算成本。现有的令牌修剪方法，例如FASTV和PYRAMIDDROP，无法保留对准确接地至关重要的局部视觉特征，从而导致GCG任务的大量性能下降。为了解决这个问题，我们提出了自适应的本地感知令牌修剪（ALTP），这是一个简单而有效的框架，通过优先考虑本地对象信息来加速GCG模型。 ALTP介绍了两个关键组件：（1）详细密度捕获（DDC），该详细信息捕获（DDC），该组件使用超级像素分割来保留以对象为中心区域的令牌，保留细粒细节，以及（2）动态密度形成（DDF），该动态密度形成（DDF），该动态分配基于信息密度，以确保更高的信息密度，从而在较高的语言上具有较高的语言疗法较高的疗法较高的水平。在GrandF数据集上进行的广泛实验表明，ALTP在Glamm和Omg-llava模型上都显着胜过现有的令牌修剪方法，例如FastV和PyramidDrop。值得注意的是，当应用于Glamm时，ALTP的视觉令牌降低了90％，与金字塔式电视相比，AP50提高了4.9％，召回率提高了5.0％。同样，与PDROP相比，在OMG-LALAVA上，ALTP以90％的令牌减少了AP，并将MIOU提高3.0％。

Title: DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

Authors: Ming Yuan, Sichao Wang, Chuang Zhang, Lei He, Qing Xu, Jianqiang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23993
Pdf URL: https://arxiv.org/pdf/2503.23993
Copy Paste: [[2503.23993]] DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model(https://arxiv.org/abs/2503.23993)
Keywords: generation
Abstract: The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.
摘要：深度完成任务是自主驾驶中的一个关键问题，涉及从稀疏深度图和RGB图像产生密集的深度图。大多数现有方法在获得初始密集的深度后，采用空间传播网络迭代地完善深度图。在本文中，我们提出了一种新颖的方法，该方法将扩散模型整合到深度完成任务中。通过结合扩散模型的剥离机理，致密形式通过通过多次迭代逐步完善初始的随机深度分布来生成密集的深度图。我们提出了一个特征提取模块，该模块利用特征金字塔结构以及多层可变形的注意力，以有效地提取和整合稀疏深度图和RGB图像的特征，这是扩散过程的指导条件。此外，本文提出了一个深度细化模块，该模块在各种范围内应用多个迭代的改进到扩散过程产生的密集深度结果。该模块利用图像功能丰富了多尺度信息和稀疏深度输入，以进一步提高预测深度图的准确性。 KITTI户外场景数据集的大量实验表明，密度构造器的表现优于经典的深度完成方法。

Title: HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Authors: Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24026
Pdf URL: https://arxiv.org/pdf/2503.24026
Copy Paste: [[2503.24026]] HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation(https://arxiv.org/abs/2503.24026)
Keywords: generation
Abstract: Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
摘要：人类动机的视频生成一直是一项艰巨的任务，这主要是由于学习人体运动所固有的困难。尽管某些方法试图通过姿势控制明确地推动以人为中心的视频，但这些方法通常依赖于现有视频中衍生的姿势，从而缺乏灵活性。为了解决这个问题，我们提出了Humandreamer，这是一个被解耦的人类视频生成框架，该框架首先从文本提示中产生多样化的姿势，然后利用这些姿势生成人动物视频。具体而言，我们提出了MotionVid，这是人类动作姿势产生的最大数据集。基于数据集，我们提出了MotionDit，该运动台受训练，可以从文本提示中生成结构化的人动作。此外，引入了一种新颖的喇嘛损失，该喇嘛损失共同提高了62.4％，同时top1，top2和top3的R-Precision在41.8％，26.3％和18.3％中的提高，从而晋升为文本对置式控制的准确性和FID Metrics。我们在各种姿势到视频基线的实验表明，我们方法产生的姿势可以产生多样化和高质量的人类动物视频。此外，我们的模型可以促进其他下游任务，例如姿势序列预测和2d-3d运动提升。

Title: TransMamba: Flexibly Switching between Transformer and Mamba

Authors: Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.24067
Pdf URL: https://arxiv.org/pdf/2503.24067
Copy Paste: [[2503.24067]] TransMamba: Flexibly Switching between Transformer and Mamba(https://arxiv.org/abs/2503.24067)
Keywords: generation
Abstract: Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.
摘要：变压器是现代大语模型的基石，但是它们的二次计算复杂性限制了长期处理中的效率。 Mamba的最新进步是一种具有线性复杂性的状态空间模型（SSM），可提供有希望的效率提高，但具有不稳定的上下文学习和多任务概括。本文提出了TransMamba，这是一个新颖的框架，该框架通过共享参数矩阵（例如QKV和CBX）统一变压器和MAMBA，因此可以在不同令牌长度和层上的注意力和SSM机制之间动态切换。我们通过将注意力输出转换为SSM兼容状态，将存储器转换器转换为桥梁变压器和MAMBA，从而确保在变换发生的跨点处无缝信息流。还对TransPoint调度进行了彻底探索以进行进一步的改进。我们进行了广泛的实验，表明Transmamba与基线相比达到了卓越的训练效率和性能，并验证了变压器和Mamba范式之间的更深层次的一致性，为下一代序列建模提供了可扩展的解决方案。

Title: DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

Authors: Adrienne Deganutti, Simon Hadfield, Andrew Gilbert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24096
Pdf URL: https://arxiv.org/pdf/2503.24096
Copy Paste: [[2503.24096]] DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description(https://arxiv.org/abs/2503.24096)
Keywords: generation
Abstract: Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.
摘要：音频描述是叙述性评论，旨在在视频中感知关键的视觉元素中帮助视力受损的受众。尽管简短的视频理解已经迅速提高，但保持连贯的长期视觉故事讲述的解决方案仍未解决。现有方法仅依赖框架级嵌入，有效地描述了基于对象的内容，但在场景中缺乏上下文信息。我们介绍了Dante-AD，这是一个增强的视频描述模型，该模型利用了基于双视变压器的架构来解决此差距。 Dante-Ad依次融合了框架和场景级别的嵌入，以改善长期上下文理解。我们提出了一种新颖的，最先进的方法，用于顺序交叉注意，以实现上下文基础，以实现细粒度的音频描述生成。从著名的电影剪辑中进行了广泛的关键场景评估，但丁 - AD在传统的NLP指标和基于LLM的评估中的现有方法优于现有方法。

Title: Level the Level: Balancing Game Levels for Asymmetric Player Archetypes With Reinforcement Learning

Authors: Florian Rupp, Kai Eckert
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.24099
Pdf URL: https://arxiv.org/pdf/2503.24099
Copy Paste: [[2503.24099]] Level the Level: Balancing Game Levels for Asymmetric Player Archetypes With Reinforcement Learning(https://arxiv.org/abs/2503.24099)
Keywords: generation
Abstract: Balancing games, especially those with asymmetric multiplayer content, requires significant manual effort and extensive human playtesting during development. For this reason, this work focuses on generating balanced levels tailored to asymmetric player archetypes, where the disparity in abilities is balanced entirely through the level design. For instance, while one archetype may have an advantage over another, both should have an equal chance of winning. We therefore conceptualize game balancing as a procedural content generation problem and build on and extend a recently introduced method that uses reinforcement learning to balance tile-based game levels. We evaluate the method on four different player archetypes and demonstrate its ability to balance a larger proportion of levels compared to two baseline approaches. Furthermore, our results indicate that as the disparity between player archetypes increases, the required number of training steps grows, while the model's accuracy in achieving balance decreases.
摘要：平衡游戏，尤其是那些具有不对称多人游戏内容的游戏，需要在开发过程中进行大量的手动努力和大量的人类游戏测试。因此，这项工作着重于生成针对不对称播放器原型量身定制的平衡水平，在这种情况下，能力差异完全通过级别的设计平衡。例如，尽管一种原型可能比另一个原型具有优势，但两者都应该有同等的获胜机会。因此，我们将游戏的平衡概念化为程序性内容生成问题，并基于并扩展了一种使用强化学习来平衡基于瓷砖的游戏水平的最近引入的方法。我们评估了四种不同玩家原型的方法，并证明了与两种基线方法相比，它可以平衡更大比例的水平。此外，我们的结果表明，随着玩家原型之间的差异增加，所需的训练步骤数量增加，而模型实现平衡的准确性也会降低。

Title: Learning a Canonical Basis of Human Preferences from Binary Ratings

Authors: Kailas Vodrahalli, Wei Wei, James Zou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.24150
Pdf URL: https://arxiv.org/pdf/2503.24150
Copy Paste: [[2503.24150]] Learning a Canonical Basis of Human Preferences from Binary Ratings(https://arxiv.org/abs/2503.24150)
Keywords: generative
Abstract: Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF). RLHF and related techniques typically involve constructing a dataset of binary or ranked choice human preferences and subsequently fine-tuning models to align with these preferences. This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories (selected from a set of nearly 5,000 distinct preferences) captures >89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies. Through both synthetic and empirical evaluations, we confirm that our low-rank, canonical set of human preferences generalizes across the entire dataset and within specific topics. We further demonstrate our preference basis' utility in model evaluation, where our preference categories offer deeper insights into model alignment, and in model training, where we show that fine-tuning on preference-defined subsets successfully aligns the model accordingly.
摘要：生成AI的最新进展是由对准技术（例如从人类反馈中学习（RLHF）学习）驱动的。 RLHF及相关技术通常涉及构建二进制或排名选择的人类偏好的数据集，并随后进行微调模型，以与这些偏好保持一致。本文将重点转移到理解此类数据集中编码的偏好并识别共同的人类偏好中。我们发现，一小部分21个偏好类别（从一组近5,000个不同的偏好中选择）捕获了跨个体偏好变化的89％。这一小偏好集类似于人类偏好的规范基础，类似于既定的发现，这些发现是人类在心理学或面部识别研究中的变化。通过合成和经验评估，我们确认我们的低级别，规范的人类偏好集在整个数据集和特定主题中概括了。我们进一步证明了我们在模型评估中的偏好基础的效用，在这种情况下，我们的偏好类别为模型一致性提供了更深入的见解，在模型培训中，我们表明，对偏好定义的子集进行微调成功地将模型成功地对齐。

Title: Predicting Targeted Therapy Resistance in Non-Small Cell Lung Cancer Using Multimodal Machine Learning

Authors: Peiying Hua, Andrea Olofson, Faraz Farhadi, Liesbeth Hondelink, Gregory Tsongalis, Konstantin Dragnev, Dagmar Hoegemann Savellano, Arief Suriawinata, Laura Tafe, Saeed Hassanpour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24165
Pdf URL: https://arxiv.org/pdf/2503.24165
Copy Paste: [[2503.24165]] Predicting Targeted Therapy Resistance in Non-Small Cell Lung Cancer Using Multimodal Machine Learning(https://arxiv.org/abs/2503.24165)
Keywords: generation
Abstract: Lung cancer is the primary cause of cancer death globally, with non-small cell lung cancer (NSCLC) emerging as its most prevalent subtype. Among NSCLC patients, approximately 32.3% have mutations in the epidermal growth factor receptor (EGFR) gene. Osimertinib, a third-generation EGFR-tyrosine kinase inhibitor (TKI), has demonstrated remarkable efficacy in the treatment of NSCLC patients with activating and T790M resistance EGFR mutations. Despite its established efficacy, drug resistance poses a significant challenge for patients to fully benefit from osimertinib. The absence of a standard tool to accurately predict TKI resistance, including that of osimertinib, remains a critical obstacle. To bridge this gap, in this study, we developed an interpretable multimodal machine learning model designed to predict patient resistance to osimertinib among late-stage NSCLC patients with activating EGFR mutations, achieving a c-index of 0.82 on a multi-institutional dataset. This machine learning model harnesses readily available data routinely collected during patient visits and medical assessments to facilitate precision lung cancer management and informed treatment decisions. By integrating various data types such as histology images, next generation sequencing (NGS) data, demographics data, and clinical records, our multimodal model can generate well-informed recommendations. Our experiment results also demonstrated the superior performance of the multimodal model over single modality models (c-index 0.82 compared with 0.75 and 0.77), thus underscoring the benefit of combining multiple modalities in patient outcome prediction.
摘要：肺癌是全球癌症死亡的主要原因，非小细胞肺癌（NSCLC）成为其最普遍的亚型。在NSCLC患者中，大约32.3％的表皮生长因子受体（EGFR）基因具有突变。 Osimertinib是第三代EGFR-酪氨酸激酶抑制剂（TKI），在治疗NSCLC患者的激活和T790M耐药性EGFR突变方面表现出显着的功效。尽管具有确定的功效，但药物耐药性仍对患者完全受益于奥西替尼构成了重大挑战。缺乏准确预测包括Osimertinib在内的TKI电阻的标准工具仍然是一个关键的障碍。为了弥合这一差距，在这项研究中，我们开发了一种可解释的多模式学习模型，旨在预测激活EGFR突变的晚期NSCLC患者中患者对osimertinib的抵抗力，在多机构数据集中达到了0.82的C-索引。该机器学习模型可以利用在患者就诊和医疗评估期间通常收集的随时收集的数据，以促进精确的肺癌管理和明智的治疗决策。通过整合各种数据类型，例如组织学图像，下一代测序（NGS）数据，人口统计数据和临床记录，我们的多模式模型可以生成信息良好的建议。我们的实验结果还证明了多模型模型的出色性能超过了单个模式模型（C-指数为0.82，而0.75和0.77），因此强调了在患者预测预测中结合多种方式的好处。

Title: Many-to-Many Matching via Sparsity Controlled Optimal Transport

Authors: Weijie Liu, Han Bao, Makoto Yamada, Zenan Huang, Nenggan Zheng, Hui Qian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.24204
Pdf URL: https://arxiv.org/pdf/2503.24204
Copy Paste: [[2503.24204]] Many-to-Many Matching via Sparsity Controlled Optimal Transport(https://arxiv.org/abs/2503.24204)
Keywords: generation
Abstract: Many-to-many matching seeks to match multiple points in one set and multiple points in another set, which is a basis for a wide range of data mining problems. It can be naturally recast in the framework of Optimal Transport (OT). However, existing OT methods either lack the ability to accomplish many-to-many matching or necessitate careful tuning of a regularization parameter to achieve satisfactory results. This paper proposes a novel many-to-many matching method to explicitly encode many-to-many constraints while preventing the degeneration into one-to-one matching. The proposed method consists of the following two components. The first component is the matching budget constraints on each row and column of a transport plan, which specify how many points can be matched to a point at most. The second component is the deformed $q$-entropy regularization, which encourages a point to meet the matching budget maximally. While the deformed $q$-entropy was initially proposed to sparsify a transport plan, we employ it to avoid the degeneration into one-to-one matching. We optimize the objective via a penalty algorithm, which is efficient and theoretically guaranteed to converge. Experimental results on various tasks demonstrate that the proposed method achieves good performance by gleaning meaningful many-to-many matchings.
摘要：多对多的匹配试图在一组中匹配多个点和另一组中的多个点，这是解决广泛数据挖掘问题的基础。它可以自然地在最佳运输框架（OT）的框架中重塑。但是，现有的OT方法要么缺乏完成多到许多匹配的能力，要么需要仔细调整正则化参数以获得令人满意的结果。本文提出了一种新颖的多一对匹配方法，可以显式编码多到多的约束，同时防止退化为一对一的匹配。所提出的方法由以下两个组件组成。第一个组件是运输计划的每一行和列的匹配预算约束，该限制最多可以匹配多少点。第二个组件是变形的$ Q $ -Entropy正规化，这鼓励了最大程度地满足匹配预算的意义。虽然最初提出了变形的$ q $ - entropy来稀疏运输计划，但我们采用它来避免退化为一对一的匹配。我们通过惩罚算法优化目标，该算法有效且理论上保证会收敛。各种任务的实验结果表明，所提出的方法通过收集有意义的多到许多匹配来实现良好的性能。

Title: Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes

Authors: Daichi Otsuka, Shinichi Mae, Ryosuke Yamada, Hirokatsu Kataoka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24229
Pdf URL: https://arxiv.org/pdf/2503.24229
Copy Paste: [[2503.24229]] Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes(https://arxiv.org/abs/2503.24229)
Keywords: generation, generative
Abstract: In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real-world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real-world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large-scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre-training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point-E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point-E as an early 3D generative model can effectively support the pre-training with 3D synthetic data. In the experimental section, we compare our pre-training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.
摘要：近年来，研究界已经见证了3D点云数据在各种现实世界应用中的高适用性。通过3D点云，这种方式可以考虑实际的大小和空间理解。应用场包括机器人，车辆或其他现实世界系统的机械控制。沿着这一行，我们想改进3D点云实例细分，这已成为这些应用程序的特别有希望的方法。但是，与2D图像数据集相比，创建3D点云数据集需要巨大的成本。要训练3D点云实例细分的模型，不仅有必要分配类别，而且要为大规模3D空间中每个点提供详细的注释。同时，在3D域中生成模型的最新建议的增加促使了使用生成模型创建3D点云数据的建议。在这项工作中，我们建议使用3D合成数据进行预训练，以训练基于生成模型的3D点云实例分割模型，用于以点云数据为代表的3D场景。我们直接使用Point-E生成3D点云数据，用于将生成的数据插入3D场景。最近在2025年，尽管还有其他精确的3D代模型，即使使用点-E作为早期的3D生成模型也可以有效地使用3D合成数据支持预训练。在实验部分中，我们将我们的预训练方法与基线方法进行比较，表明性能提高，证明了3D生成模型对3D点云实例分割的功效。

Title: Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation

Authors: Lorenzo Tronchin, Tommy Löfstedt, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24258
Pdf URL: https://arxiv.org/pdf/2503.24258
Copy Paste: [[2503.24258]] Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation(https://arxiv.org/abs/2503.24258)
Keywords: generation, generative
Abstract: The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.
摘要：生成AI的进步，尤其是在医学成像中，面临着确保合成数据生成中高保真，多样性和效率的三元素。尽管生成的对抗网络（GAN）在各种应用程序中都表现出了希望，但它们仍然面临诸如模式崩溃和对真实数据分布的覆盖不足之类的挑战。这项工作探讨了GAN合奏的使用来克服这些局限性，特别是在医学成像的背景下。通过解决平衡忠诚度和多样性的多目标优化问题，我们提出了一种选择用于医疗数据的最佳甘套合奏的方法。所选的合奏能够产生代表真实数据分布并在计算上有效的多样化的合成医学图像。合奏中的每个模型都带来了独特的贡献，可确保最少的冗余。我们使用三个不同的医疗数据集进行了全面的评估，测试了22种具有各种损失功能和正则化技术的不同GAN体系结构。通过在不同训练时期的采样模型，我们制作了110个独特的配置。结果突出了GAN合奏提高合成医学图像的质量和实用性的能力，从而提高了下游任务（例如诊断建模）的功效。

Title: FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

Authors: Yixuan Li, Yu Tian, Yipo Huang, Wei Lu, Shiqi Wang, Weisi Lin, Anderson Rocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24267
Pdf URL: https://arxiv.org/pdf/2503.24267
Copy Paste: [[2503.24267]] FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics(https://arxiv.org/abs/2503.24267)
Keywords: generation, generative
Abstract: The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust. As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency. However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity. In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights. We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework. Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy. Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios.
摘要：生成人工智能（AI）的快速而不受约束的进步提出了一把双刃剑：在实现前所未有的创造力的同时，它也促进了产生高度令人信服的欺骗性内容，破坏了社会信任。随着图像生成技术变得越来越复杂，检测合成图像不再仅仅是二进制任务：它需要可解释的，上下文感知的方法来提高可信度和透明度。但是，现有的检测模型主要集中在分类上，为图像真实性提供了有限的解释性见解。在这项工作中，我们提出了Fakescope，这是一种针对AI生成的图像取证量量身定制的专家多模式模型（LMM），它不仅可以识别具有高精度的AI合成图像，而且还提供了丰富，可解释和查询驱动的驱动的取证见解。我们首先构建了Fakechain数据集，该数据集包含基于视觉痕量证据的语言真实性推理，这是通过新型的人机协作框架开发的。在此基础上，我们进一步提出了Fake Instruct，这是最大的多模式指令调谐数据集，该数据集包含200万个视觉说明，以提高LMMS中的法医意识。 Fakescope在封闭式和开放式的法医场景中都能达到最先进的表现。它可以以很高的精度区分合成图像，同时提供连贯和有见地的解释，对细粒伪造属性的自由形式讨论以及可操作的增强策略。值得注意的是，尽管受过定性硬标签的培训，但Fakescope表现出了出色的零射击定量能力检测能力，这是我们提出的基于令牌的概率估计策略所实现的。此外，假镜具有强大的概括和野外能力，可确保其在现实情况下的适用性。

Title: Visual Acoustic Fields

Authors: Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24270
Pdf URL: https://arxiv.org/pdf/2503.24270
Copy Paste: [[2503.24270]] Visual Acoustic Fields(https://arxiv.org/abs/2503.24270)
Keywords: generation
Abstract: Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at this https URL.
摘要：击中对象会产生不同的声音，并且人类可以直观地推断物体根据其外观和材料属性的声音。受到这种直觉的启发，我们提出了视觉声磁场，该框架使用3D高斯分裂（3DGS）在3D空间内桥接声音和视觉信号。我们的方法具有两个关键模块：声音发电和声音本地化。声音生成模块利用有条件的扩散模型，该模型采用从功能增强功能的3DG呈现的多尺度功能来产生逼真的击打声音。同时，声音本地化模块可以通过功能增强的3DG来查询3D场景，以基于声源定位击球位置。为了支持此框架，我们介绍了一条新颖的管道，用于收集场景级别的视觉示单样品对，在捕获的图像，冲击位置和相应的声音之间实现对齐。据我们所知，这是第一个在3D上下文中连接视觉和声学信号的数据集。在我们的数据集上进行的广泛实验证明了视力场在产生合理的冲击声音和准确定位影响源方面的有效性。我们的项目页面在此HTTPS URL上。

Title: Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

Authors: Yizhou Huang, Yihua Cheng, Kezhi Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.24272
Pdf URL: https://arxiv.org/pdf/2503.24272
Copy Paste: [[2503.24272]] Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction(https://arxiv.org/abs/2503.24272)
Keywords: generation
Abstract: Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.
摘要：了解人类运动对于准确的行人轨迹预测至关重要。常规方法通常依赖于监督学习，在该学习中，基于预测的轨迹直接优化了地面真相标签。这扩大了由长尾数据分布引起的局限性，使该模型难以捕获异常行为。在这项工作中，我们提出了一个自我监管的行人轨迹预测框架，该框架明确模拟了位置，速度和加速度。我们利用速度和加速度信息来通过特征注入和自我监管的运动一致性机制来增强位置预测。我们的模型分层将速度特征注入位置流。加速功能被注入速度流中。这使该模型能够共同预测位置，速度和加速度。从预测的位置来看，我们计算相应的伪速度和加速度，从而使模型可以从数据生成的伪标签中学习，从而实现了自我监督的学习。我们进一步设计了以物理原则为基础的运动一致性评估策略；它通过将其与历史动力学进行比较来选择最合理的预测运动趋势，并利用这种趋势来指导和限制轨迹产生。我们对ETH-COY和Stanford无人机数据集进行了实验，这表明我们的方法在两个数据集上都达到了最新的性能。

Title: Style Quantization for Data-Efficient GAN Training

Authors: Jian Wang, Xin Lan, Jizhe Zhou, Yuxin Tian, Jiancheng Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24282
Pdf URL: https://arxiv.org/pdf/2503.24282
Copy Paste: [[2503.24282]] Style Quantization for Data-Efficient GAN Training(https://arxiv.org/abs/2503.24282)
Keywords: generation
Abstract: Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.
摘要：在有限的数据设置下，甘斯经常难以导航并有效利用输入潜在空间。因此，稀疏输入潜在空间中相邻变量产生的图像可能在现实主义中显示出明显的差异，从而导致次优的一致性正则化（CR）结果。为了解决这个问题，我们提出了\ textit {sq-gan}，这是一种通过引入样式空间量化方案来增强CR的新颖方法。此方法将稀疏，连续的输入潜在空间转换为紧凑的结构化离散代理空间，使每个元素都可以与特定的真实数据点相对应，从而改善了CR性能。我们首先将输入潜在变量映射到一个较不纠缠的``样式''空间，而不是直接量化，而是使用可学习的代码簿进行量化。这使每个量化的代码能够控制变化的不同因素。此外，我们优化了最佳传输距离，以使代码书代码与基础模型从培训数据中提取的功能保持一致，将外部知识嵌入代码簿中，并建立了正确描述培训数据集的语义丰富的词汇。广泛的实验表明，通过我们的方法，歧视剂的鲁棒性和发电质量都有显着改善。

Title: PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks

Authors: Fang Yan, Jianfeng Wu, Jiawen Li, Wei Wang, Jiaxuan Lu, Wen Chen, Zizhao Gao, Jianan Li, Hong Yan, Jiabo Ma, Minda Chen, Yang Lu, Qing Chen, Yizhi Wang, Xitong Ling, Xuenian Wang, Zihan Wang, Qiang Huang, Shengyi Hua, Mianxin Liu, Lei Ma, Tian Shen, Xiaofan Zhang, Yonghong He, Hao Chen, Shaoting Zhang, Zhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24345
Pdf URL: https://arxiv.org/pdf/2503.24345
Copy Paste: [[2503.24345]] PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks(https://arxiv.org/abs/2503.24345)
Keywords: generation
Abstract: The complexity and variability inherent in high-resolution pathological images present significant challenges in computational pathology. While pathology foundation models leveraging AI have catalyzed transformative advancements, their development demands large-scale datasets, considerable storage capacity, and substantial computational resources. Furthermore, ensuring their clinical applicability and generalizability requires rigorous validation across a broad spectrum of clinical tasks. Here, we present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides from 20 tissue and organ types across multiple centers. The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets. These tasks encompass digital slide preprocessing, pan-cancer classification, lesion identification, multi-cancer subtype classification, biomarker assessment, gene expression prediction, and the generation of structured reports. PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks, including pan-cancer classification across various organs, lymphoma subtype diagnosis, and bladder cancer screening. Notably, it is the first model to generate structured reports for high-incidence colorectal cancer and diagnostically complex lymphoma-areas that are infrequently addressed by foundational models but hold immense clinical potential. Overall, PathOrchestra exemplifies the feasibility and efficacy of a large-scale, self-supervised pathology foundation model, validated across a broad range of clinical-grade tasks. Its high accuracy and reduced reliance on extensive data annotation underline its potential for clinical integration, offering a pathway toward more efficient and high-quality medical services.
摘要：高分辨率病理图像中固有的复杂性和可变性在计算病理学中带来了重大挑战。尽管病理基础模型利用AI催化了变革性的进步，但其开发需要大规模的数据集，相当大的存储容量和实质性的计算资源。此外，确保它们的临床适用性和可推广性需要在广泛的临床任务中进行严格的验证。在这里，我们提出了Pathorchestra，这是一种多功能病理基础模型，该模型通过自我监督的学习在数据集中训练，其中包括来自多个中心的20个组织和器官类型的300K病理幻灯片。该模型使用61个私人和51个公共数据集的组合对112项临床任务进行了严格评估。这些任务包括数字幻灯片预处理，PAN-CACTER分类，病变识别，多癌子类型分类，生物标志物评估，基因表达预测以及结构化报告的生成。 Pathorchestra在27,755个WSI和9,415,729 ROI中表现出非凡的表现，在47个任务中达到了超过0.950的准确性，包括跨各种器官，淋巴瘤亚型诊断和BlaDder Cancer筛查的泛癌分类。值得注意的是，这是第一个生成高含量结直肠癌和诊断复杂淋巴瘤疾病的结构化报告的模型，这些报告很少被基础模型解决，但具有巨大的临床潜力。总体而言，Pathorchestra体现了大规模，自我监管的病理基础模型的可行性和功效，这些模型在广泛的临床级任务中得到了验证。它的高精度和对广泛的数据注释的依赖降低了其临床整合的潜力，从而提供了通往更高效和高质量的医疗服务的途径。

Title: ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

Authors: Rana Muhammad Shahroz Khan, Dongwen Tang, Pingzhi Li, Kai Wang, Tianlong Chen
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.24354
Pdf URL: https://arxiv.org/pdf/2503.24354
Copy Paste: [[2503.24354]] ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion(https://arxiv.org/abs/2503.24354)
Keywords: generation
Abstract: Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.
摘要：参数生成已成为神经网络开发的新型范式，通过直接合成高质量的模型权重，为传统神经网络培训提供了替代方案。在低级适应（lora）的背景下，用于发展的（$ \ textit {i.e。} $，不断更新）大语言模型（LLMS），这种方法承诺有效的适应性，而无需昂贵的再训练。但是，现有方法在同时实现可伸缩性和可控性方面面临关键局限性。在本文中，我们介绍了$ \ texttt {oral} $，一种新颖的$ \ textbf {有条件的经常性扩散} $框架，以解决这些挑战。 $ \ texttt {oral} $结合了一种新颖的调理机制，该机制集成了模型体系结构和文本任务规范，从而可以生成特定于任务的LORA参数，这些参数可以在不断发展的基础模型中无缝传输。我们的方法成功地扩展到了数十亿个参数LLM，并保持可控性。通过使用五个预训练的LLMS进行七个语言任务，四个视觉任务和三个多模式任务的大规模实验，我们证明$ \ texttt {oral} $生成了与Vanilla训练的对方相当或出色的性能的高质量Lora参数。

Title: InstructRestore: Region-Customized Image Restoration with Human Instructions

Authors: Shuaizheng Liu, Jianqi Ma, Lingchen Sun, Xiangtao Kong, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24357
Pdf URL: https://arxiv.org/pdf/2503.24357
Copy Paste: [[2503.24357]] InstructRestore: Region-Customized Image Restoration with Human Instructions(https://arxiv.org/abs/2503.24357)
Keywords: restoration, generation
Abstract: Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at this https URL.
摘要：尽管基于扩散的先验图像恢复方面取得了重大进展，但大多数现有的方法都对整个图像进行统一处理，因此无法根据用户说明执行区域注定的图像恢复的能力。在这项工作中，我们提出了一个新的框架，即指定框架，以按照人类的说明进行区域调整的图像恢复。为了实现这一目标，我们首先开发一个数据生成引擎来生产训练三重态，每个引擎由高质量图像，目标区域描述和相应的区域掩码组成。通过此引擎和仔细的数据筛选，我们构建了一个全面的数据集，其中包括536,945个三胞胎，以支持对此任务的培训和评估。然后，我们研究如何在控制网架构下集成低质量的图像特征，以调整图像细节的增强程度。因此，我们开发了一个类似控制网的模型，以识别目标区域并将不同的集成量表分配给目标和周围区域，从而实现与用户指令保持一致的区域注定的图像恢复。实验结果表明，我们提出的Instrestore方法可以实现有效的人建造图像恢复，例如具有散景效应的图像和用户指导的局部增强。我们的工作推进了对交互式图像恢复和增强技术的研究。数据，代码和模型将在此HTTPS URL上找到。

Title: Effectively Controlling Reasoning Models through Thinking Intervention

Authors: Tong Wu, Chong Xiang, Jiachen T. Wang, Prateek Mittal
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.24370
Pdf URL: https://arxiv.org/pdf/2503.24370
Copy Paste: [[2503.24370]] Effectively Controlling Reasoning Models through Thinking Intervention(https://arxiv.org/abs/2503.24370)
Keywords: generation
Abstract: Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.
摘要：推理增强大语模型（LLMS）在生成最终答案之前明确生成了中间的推理步骤，从而帮助模型在复杂的问题解决方面表现出色。在本文中，我们证明了这个新兴生成框架为对模型行为进行更细粒度的控制提供了独特的机会。我们提出了思维干预措施，这是一种新颖的范式，旨在通过策略性地插入或修改特定思维令牌来明确指导LLM的内部推理过程。我们跨多个任务进行全面评估，包括有关IFEVAL的指令，SEP的指令层次结构以及Xstest和遗憾的基础的安全对准。我们的结果表明，思维干预措施的表现明显胜过基线提示方法，在遵循教学的方案中获得了高达6.7％的准确性提高，有关教学层次结构的推理的提高了15.4％，使用开放式DeepSeek R1型号提示了指导层次结构的推理率提高了40.0％。总体而言，我们的工作为控制推理LLM的新研究途径开放。

Title: Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Authors: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.24376
Pdf URL: https://arxiv.org/pdf/2503.24376
Copy Paste: [[2503.24376]] Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1(https://arxiv.org/abs/2503.24376)
Keywords: generation
Abstract: Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
摘要：思想链（COT）生成的最新进展显着提高了大语言模型（LLMS）的推理能力，并作为一种有效的后培训方法出现了强化学习（RL）。多模式的大语言模型（MLLM）继承了这种推理潜力，但在需要感知和逻辑推理的任务中仍然没有反应。为了解决这个问题，我们介绍了Seed Bench-R1，这是一种基准，旨在系统地评估视频理解中MLLM的训练后方法。它包括复杂的现实视频和复杂的日常计划任务，其格式多项选择问题，需要复杂的感知和推理。种子基础-R1通过三级层次结构评估概括：分布，跨环境和跨环境任务情景，配备了大规模培训数据集，并具有易于可验证的地面真相答案。我们将QWEN2-VL-INSTRUCT-7B作为基本模型，我们将RL与受监督的微调（SFT）进行比较，证明了RL在分发和分发任务上的数据效率和出色的性能，并在一般视频理解基准（如LongVideBench）上表现出均超过SFT的SFT。我们的详细分析表明，RL增强了视觉感知，但通常会产生逻辑上一致的推理链。我们确定关键局限性，例如不一致的推理和忽略的视觉提示，并建议对基本模型推理，奖励建模和针对嘈杂信号的RL鲁棒性进行未来的改进。

Title: Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Authors: Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24379
Pdf URL: https://arxiv.org/pdf/2503.24379
Copy Paste: [[2503.24379]] Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation(https://arxiv.org/abs/2503.24379)
Keywords: generation
Abstract: To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: this https URL
摘要：为了解决当前视频生成社区中准确的用户意图解释的瓶颈，我们提出了任何2caption，这是在任何情况下可控视频生成的新型框架。关键思想是将视频综合步骤中的各种条件解释步骤分解。通过利用现代多模式大型语言模型（MLLM），任何2caption解释了各种输入 - 文本，图像，视频和专业提示，例如区域，运动和相机姿势 - into密集的结构化字幕，可为骨干视频发电机提供更好的指导。我们还介绍了Any2Capins，这是一个大型数据集，具有337K实例和407K条件，用于任何条件到托管说明。全面的评估表明，我们系统在现有视频生成模型的各个方面的可控性和视频质量方面有了重大改进。项目页面：此HTTPS URL

Title: Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views

Authors: Chong Bao, Xiyu Zhang, Zehao Yu, Jiale Shi, Guofeng Zhang, Songyou Peng, Zhaopeng Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24382
Pdf URL: https://arxiv.org/pdf/2503.24382
Copy Paste: [[2503.24382]] Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views(https://arxiv.org/abs/2503.24382)
Keywords: generation
Abstract: Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360° scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360° scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: this https URL
摘要：神经渲染在高质量的3D神经重建和新型视图综合方面取得了巨大的成功，并具有致密的输入视图和准确的姿势。但是，将其应用于无界360°场景中极为稀疏，未受欢迎的视图仍然是一个具有挑战性的问题。在本文中，我们提出了一个新型的神经渲染框架，以在无限的360°场景中完成未训练和极为稀疏的3D重建。为了解决具有稀疏输入视图的无界场景中固有的空间歧义，我们提出了一种基于高斯的表示形式，以用不同的空间层有效地对场景进行建模。通过采用密集的立体声重建模型来恢复粗几何形状，我们引入了特定层的自举优化，以优化噪声并填充重建中的闭塞区域。此外，我们提出了重建和生成的迭代融合，并采用不确定性感知的训练方法，以促进这两个过程之间的相互调节和增强。全面的实验表明，在呈现质量和表面重建精度方面，我们的方法优于现有的最新方法。项目页面：此HTTPS URL

Title: Consistent Subject Generation via Contrastive Instantiated Concepts

Authors: Lee Hsin-Ying, Kelvin C.K. Chan, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24387
Pdf URL: https://arxiv.org/pdf/2503.24387
Copy Paste: [[2503.24387]] Consistent Subject Generation via Contrastive Instantiated Concepts(https://arxiv.org/abs/2503.24387)
Keywords: generation, generative
Abstract: While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.
摘要：尽管文本到图像生成模型可以综合多样化和忠实的内容，但多个创建的主题变化限制了长期内容的应用程序。现有方法需要耗时的调整，所有主题的参考或访问其他创作的参考。我们引入对比度概念实例（可可），以有效地综合多个独立创造的一致主题。该框架由生成模型和映射网络组成，该模型将输入潜在代码转换为与某些概念实例相关的伪字。用户可以使用相同的潜在代码生成一致的主题。为了构建这种关联，我们提出了一种对比学习方法，该方法训练网络以区分提示和潜在代码的组合。对人脸进行单一主题的广泛评估表明，可可在保持更高的灵活性的同时，可可的性能与现有方法相当。我们还展示了将椰子扩展到多个受试者和其他对象类别的潜力。