2025-03-04

Title: Streaming Looking Ahead with Token-level Self-reward

Authors: Hongming Zhang, Ruixin Hong, Dong Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00029
Pdf URL: https://arxiv.org/pdf/2503.00029
Copy Paste: [[2503.00029]] Streaming Looking Ahead with Token-level Self-reward(https://arxiv.org/abs/2503.00029)
Keywords: generation
Abstract: Autoregressive decoding algorithms that use only past information often cannot guarantee the best performance. Recently, people discovered that looking-ahead algorithms such as Monte Carlo Tree Search (MCTS) with external reward models (RMs) can significantly improve models' output by allowing them to think ahead and leverage future outputs and associated rewards to guide the current generation. Such techniques can help the reinforcement fine-tuning phase by sampling better trajectories and the inference phase by selecting the better output. However, their high computational cost limits their applications, especially in streaming scenarios. To address this issue, we propose equipping the policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication. We name the new architecture as Reward Transformer. In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization. Experiments show that SLA achieves an overall win rate of 79.7\% against the baseline greedy decoding algorithm on three general-domain datasets with a frozen policy model while maintaining streaming efficiency. If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4\%.
摘要：仅使用过去信息的自回归解码算法通常无法保证最佳性能。最近，人们发现，具有外部奖励模型（RMS）的算法，例如蒙特卡洛树搜索（MCT），可以通过允许他们提前思考并利用未来的输出和相关奖励来显着改善模型的输出，以指导当前一代。这样的技术可以通过选择更好的输出来对更好的轨迹和推理阶段进行采样，从而有助于加强微调阶段。但是，他们的高计算成本限制了他们的应用程序，尤其是在流媒体方案中。为了解决这个问题，我们建议将策略模型配备令牌级别的自我奖励建模（TRM）功能，以消除对外部模型和额外交流的需求。我们将新体系结构命名为奖励变压器。此外，我们提出了一种看上去像趋化的（SLA）算法，以进一步提高搜索效率，并更好地并行化。实验表明，SLA的总体获胜率在三个具有冷冻策略模型的一般域数据集上的基线贪婪解码算法上达到了79.7％，同时保持流媒体效率。如果我们将SLA与诸如DPO之类的加固微调技术相结合，则SLA的总胜率为89.4 \％。

Title: Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

Authors: Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arróyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2503.00152
Pdf URL: https://arxiv.org/pdf/2503.00152
Copy Paste: [[2503.00152]] Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation(https://arxiv.org/abs/2503.00152)
Keywords: generation
Abstract: We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
摘要：我们考虑使用语言模型（LMS）生成水晶材料的问题。一个关键步骤是将3D晶体结构转换为1D序列，以通过LMS处理。先前的研究使用了晶体学信息框架（CIF）文件流，该文件流无法确保SE（3）和周期性不变性，并且可能不会导致给定晶体结构的唯一序列表示。在这里，我们提出了一种称为Mat2Seq的新颖方法，以应对这一挑战。 MAT2SEQ将3D晶体结构转换为1D序列，并确保以单个唯一序列表示同一晶体的不同数学描述，从而可以实现SE（3）和周期性不变性。实验结果表明，与先前的方法相比，MAT2SEQ借助MAT2SEQ在晶体结构的产生中实现了有希望的性能。

Title: PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation

Authors: Denis Musinguzi, Andrew Katumba, Sudi Murindanyi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00171
Pdf URL: https://arxiv.org/pdf/2503.00171
Copy Paste: [[2503.00171]] PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation(https://arxiv.org/abs/2503.00171)
Keywords: generation
Abstract: Tuberculosis (TB) is a infectious global health challenge. Chest X-rays are a standard method for TB screening, yet many countries face a critical shortage of radiologists capable of interpreting these images. Machine learning offers an alternative, as it can automate tasks such as disease diagnosis, and report generation. However, traditional approaches rely on task-specific models, which cannot utilize the interdependence between tasks. Building a multi-task model capable of performing multiple tasks poses additional challenges such as scarcity of multimodal data, dataset imbalance, and negative transfer. To address these challenges, we propose PaliGemma-CXR, a multi-task multimodal model capable of performing TB diagnosis, object detection, segmentation, report generation, and VQA. Starting with a dataset of chest X-ray images annotated with TB diagnosis labels and segmentation masks, we curated a multimodal dataset to support additional tasks. By finetuning PaliGemma on this dataset and sampling data using ratios of the inverse of the size of task datasets, we achieved the following results across all tasks: 90.32% accuracy on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report generation, and a mAP of 19.4 and 16.0 on object detection and segmentation, respectively. These results demonstrate that PaliGemma-CXR effectively leverages the interdependence between multiple image interpretation tasks to enhance performance.
摘要：结核病（TB）是一项感染性全球健康挑战。胸部X射线是结核病筛查的标准方法，但许多国家面临能够解释这些图像的放射科医生的严重短缺。机器学习提供了替代方案，因为它可以自动化诸如疾病诊断和报告产生之类的任务。但是，传统方法依赖于特定于任务的模型，该模型无法利用任务之间的相互依赖性。构建一个能够执行多个任务的多任务模型会带来其他挑战，例如多模式数据，数据集不平衡和负转移等稀缺性。为了应对这些挑战，我们提出了Paligemma-CXR，这是一个多任务多模型模型，能够执行TB诊断，对象检测，分割，报告生成和VQA。从带有结核病诊断标签和分割掩码注释的胸部X射线图像的数据集开始，我们策划了一个多模式数据集，以支持其他任务。 By finetuning PaliGemma on this dataset and sampling data using ratios of the inverse of the size of task datasets, we achieved the following results across all tasks: 90.32% accuracy on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report generation, and a mAP of 19.4 and 16.0 on object detection and segmentation, respectively.这些结果表明，Paligemma-CXR有效地利用了多个图像解释任务之间的相互依赖性来增强性能。

Title: PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Authors: Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00196
Pdf URL: https://arxiv.org/pdf/2503.00196
Copy Paste: [[2503.00196]] PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion(https://arxiv.org/abs/2503.00196)
Keywords: generation
Abstract: Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at this https URL.
摘要：由于虚假的相关性，数据失衡和数据集中的文本注释有限，开发用于医学成像的可靠且可概括的深度学习系统会面临重大障碍。应对这些挑战需要构造对医学成像数据带来的独特复杂性的鲁棒性。自然图像域中，视觉基础模型的快速进步促使了如何将它们适应医学成像任务的问题。在这项工作中，我们提出了Prism，该框架利用基础模型使用稳定的扩散来生成高分辨率，语言引导的医学图像反事实。我们的方法表明，在选择性修改伪造相关性（医疗设备）和疾病特征方面表现出了前所未有的精度，从而在保留其他图像特征的同时，可以去除和添加特定属性。通过广泛的评估，我们展示了Prism如何发展反事实的生成，并可以开发更强大的下游分类器来用于临床可部署的解决方案。为了促进更广泛的采用和研究，我们在此HTTPS URL上公开提供了代码。

Title: AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies

Authors: Jian Gao, Weidong Cao, Junyi Yang, Xuan Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00205
Pdf URL: https://arxiv.org/pdf/2503.00205
Copy Paste: [[2503.00205]] AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies(https://arxiv.org/abs/2503.00205)
Keywords: generation, generative
Abstract: The massive and large-scale design of foundational semiconductor integrated circuits (ICs) is crucial to sustaining the advancement of many emerging and future technologies, such as generative AI, 5G/6G, and quantum computing. Excitingly, recent studies have shown the great capabilities of foundational models in expediting the design of digital ICs. Yet, applying generative AI techniques to accelerate the design of analog ICs remains a significant challenge due to critical domain-specific issues, such as the lack of a comprehensive dataset and effective representation methods for analog circuits. This paper proposes, $\textbf{AnalogGenie}$, a $\underline{\textbf{Gen}}$erat$\underline{\textbf{i}}$ve $\underline{\textbf{e}}$ngine for automatic design/discovery of $\underline{\textbf{Analog}}$ circuit topologies--the most challenging and creative task in the conventional manual design flow of analog ICs. AnalogGenie addresses two key gaps in the field: building a foundational comprehensive dataset of analog circuit topology and developing a scalable sequence-based graph representation universal to analog circuits. Experimental results show the remarkable generation performance of AnalogGenie in broadening the variety of analog ICs, increasing the number of devices within a single design, and discovering unseen circuit topologies far beyond any prior arts. Our work paves the way to transform the longstanding time-consuming manual design flow of analog ICs to an automatic and massive manner powered by generative AI. Our source code is available at this https URL.
摘要：基础半导体集成电路（IC）的庞大而大规模的设计对于维持许多新兴和未来技术的发展至关重要，例如生成AI，5G/6G和量子计算。令人兴奋的是，最近的研究表明，基础模型在加快数字IC的设计方面具有重要的能力。但是，由于关键的特定于域特异性问题，例如缺乏全面的数据集和模拟电路的有效表示方法，因此应用生成的AI技术加速模拟IC的设计仍然是一个重大挑战。本文提出，$ \ textbf {abaloggenie} $，a $ \下划线{\ textbf {gen}} $ erat $ \ useatline {\ textbf {i}} $ ve $ \ $ \ $ \ ve $ \ nline {拓扑 - 类似IC的传统手动设计流中最具挑战性和最具创造性的任务。 Analoggenie解决了该领域的两个关键差距：构建模拟电路拓扑的基础综合数据集，并开发基于可扩展的序列的图形表示通用到模拟电路。实验结果表明，模拟基因在扩大类似物的种类中的出色生成性能，增加了单个设计中的设备数量，并发现了远远超过任何先前艺术的看不见的电路拓扑。我们的作品为将类似物IC的长期耗时的手动设计流转换为由生成AI提供动力的自动和庞大的方式。我们的源代码可在此HTTPS URL上找到。

Title: Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality

Authors: Milad Yazdani, Yasamin Medghalchi, Pooria Ashrafian, Ilker Hacihaliloglu, Dena Shahriari
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00266
Pdf URL: https://arxiv.org/pdf/2503.00266
Copy Paste: [[2503.00266]] Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality(https://arxiv.org/abs/2503.00266)
Keywords: generation, generative
Abstract: Deep learning models have emerged as a powerful tool for various medical applications. However, their success depends on large, high-quality datasets that are challenging to obtain due to privacy concerns and costly annotation. Generative models, such as diffusion models, offer a potential solution by synthesizing medical images, but their practical adoption is hindered by long inference times. In this paper, we propose the use of an optimal transport flow matching approach to accelerate image generation. By introducing a straighter mapping between the source and target distribution, our method significantly reduces inference time while preserving and further enhancing the quality of the outputs. Furthermore, this approach is highly adaptable, supporting various medical imaging modalities, conditioning mechanisms (such as class labels and masks), and different spatial dimensions, including 2D and 3D. Beyond image generation, it can also be applied to related tasks such as image enhancement. Our results demonstrate the efficiency and versatility of this framework, making it a promising advancement for medical imaging applications. Code with checkpoints and a synthetic dataset (beneficial for classification and segmentation) is now available on: this https URL.
摘要：深度学习模型已成为各种医疗应用的强大工具。但是，他们的成功取决于由于隐私问题和昂贵的注释而挑战的大型高质量数据集。诸如扩散模型之类的生成模型通过合成医学图像提供了潜在的解决方案，但是长时间的推理时间阻碍了它们的实际采用。在本文中，我们建议使用最佳的传输流匹配方法来加速图像生成。通过引入源和目标分布之间的直接映射，我们的方法在保留并进一步增强了输出质量的同时大大减少了推理时间。此外，这种方法具有很高的适应性，支持各种医学成像方式，调理机制（例如类标签和口罩）以及不同的空间维度，包括2D和3D。除了图像生成之外，它还可以应用于相关任务，例如图像增强。我们的结果证明了该框架的效率和多功能性，这使其成为医学成像应用的有希望的进步。现在可以使用检查点和合成数据集（对分类和分割）的代码：此HTTPS URL。

Title: Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

Authors: Haoxin Li, Yingchen Yu, Qilong Wu, Hanwang Zhang, Boyang Li, Song Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00276
Pdf URL: https://arxiv.org/pdf/2503.00276
Copy Paste: [[2503.00276]] Learning to Animate Images from A Few Videos to Portray Delicate Human Actions(https://arxiv.org/abs/2503.00276)
Keywords: generative
Abstract: Despite recent progress, video generative models still struggle to animate human actions from static images, particularly when handling uncommon actions whose training data are limited. In this paper, we investigate the task of learning to animate human actions from a small number of videos -- 16 or fewer -- which is highly valuable in real-world applications like video and movie production. Few-shot learning of generalizable motion patterns while ensuring smooth transitions from the initial reference image is exceedingly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which improves motion generalization by aligning motion features and inter-frame correspondence relations between videos that share the same motion but have different appearances. This approach minimizes overfitting to visual appearances in the limited training data and enhances the generalization of learned motion patterns. Additionally, FLASH extends the decoder with additional layers to compensate lost details in the latent space, fostering smooth transitions from the initial reference image. Experiments demonstrate that FLASH effectively animates images with unseen human or scene appearances into specified actions while maintaining smooth transitions from the reference image.
摘要：尽管最近取得了进展，但视频生成模型仍在努力从静态图像中为人类的动作动画，尤其是在处理训练数据受到限制的罕见动作时。在本文中，我们研究了学习从少数视频（16个或更少）中为人类行为动画动作的任务，这些视频在视频和电影制作等现实世界应用中非常有价值。在确保从初始参考图像的平稳过渡的同时，很少了解可概括的运动模式，这是极具挑战性的。我们提出了Flash（几乎没有学习来动画和转向人类），从而通过对齐运动功能和共享相同运动但外观不同的视频之间的相互对应关系来改善运动概括。这种方法最大程度地减少了有限的训练数据中的视觉外观，并增强了学习运动模式的概括。此外，Flash用其他图层扩展了解码器，以补偿潜在空间中丢失的细节，从而从初始参考图像中促进了平滑的过渡。实验表明，Flash可以有效地将具有看不见的人类或场景外观的图像有效地动画成指定的动作，同时保持参考图像的平稳过渡。

Title: Remasking Discrete Diffusion Models with Inference-Time Scaling

Authors: Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.00307
Pdf URL: https://arxiv.org/pdf/2503.00307
Copy Paste: [[2503.00307]] Remasking Discrete Diffusion Models with Inference-Time Scaling(https://arxiv.org/abs/2503.00307)
Keywords: generation
Abstract: Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: this https URL.
摘要：扩散模型成功的一部分源于它们执行迭代精致的能力，即在生成过程中反复纠正输出。但是，现代蒙版的离散扩散缺乏此功能：生成令牌时，即使引入错误，也无法再次更新它。在这里，我们通过引入重新启动扩散模型（REMDM）采样器来解决此限制，该方法可以以原则上的方式应用于预审预定的掩蔽扩散模型，并且它是从具有自定义重新启动后退过程的离散扩散模型中得出的。最有趣的是，REMDM赋予离散扩散的一种推理时间计算缩放形式。通过增加采样步骤的数量，REMDM生成了自然语言输出，以接近自回归模型的质量，而当计算预算受到限制时，REMDM可以更好地保持质量。 REMDM还提高了离散图像的蒙版扩散模型的样品质量，在分子设计等科学领域中，REMDM促进了扩散引导，并将可控性的帕累托前沿相对于经典掩盖和均匀的噪声扩散。我们在项目页面上提供代码以及博客文章：此HTTPS URL。

Title: Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Authors: Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00361
Pdf URL: https://arxiv.org/pdf/2503.00361
Copy Paste: [[2503.00361]] Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding(https://arxiv.org/abs/2503.00361)
Keywords: generative
Abstract: Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at this https URL.
摘要：大型视觉模型（LVLM）在视觉内容理解和多模式推理方面获得了令人印象深刻的表现。不幸的是，这些大型模型遇到了严重的幻觉问题，并且倾向于产生捏造的反应。最近，已经提出了几种对比度解码（CD）策略来通过引入干扰的投入来减轻幻觉。尽管取得了巨大进展，但这些CD策略主要在所有输入条件下采用一定大小的方法。在本文中，我们通过广泛的实验重新审视了这一过程。相关结果表明，幻觉原因是混合动力的，每个生成步骤都面临着独特的幻觉挑战。利用这些有意义的见解，我们引入了一个简单而有效的类似章鱼的框架，使该模型能够自适应地识别幻觉类型并创建动态的CD工作流程。我们的章鱼框架不仅胜过四个基准的现有方法，而且还表现出极好的可部署性和可扩展性。代码可在此HTTPS URL上找到。

Title: Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis

Authors: Xuehao Gao, Yang Yang, Shaoyi Du, Guo-Jun Qi, Junwei Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00371
Pdf URL: https://arxiv.org/pdf/2503.00371
Copy Paste: [[2503.00371]] Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis(https://arxiv.org/abs/2503.00371)
Keywords: generation
Abstract: As two intimate reciprocal tasks, scene-aware human motion synthesis and analysis require a joint understanding between multiple modalities, including 3D body motions, 3D scenes, and textual descriptions. In this paper, we integrate these two paired processes into a Co-Evolving Synthesis-Analysis (CESA) pipeline and mutually benefit their learning. Specifically, scene-aware text-to-human synthesis generates diverse indoor motion samples from the same textual description to enrich human-scene interaction intra-class diversity, thus significantly benefiting training a robust human motion analysis system. Reciprocally, human motion analysis would enforce semantic scrutiny on each synthesized motion sample to ensure its semantic consistency with the given textual description, thus improving realistic motion synthesis. Considering that real-world indoor human motions are goal-oriented and path-guided, we propose a cascaded generation strategy that factorizes text-driven scene-specific human motion generation into three stages: goal inferring, path planning, and pose synthesizing. Coupling CESA with this powerful cascaded motion synthesis model, we jointly improve realistic human motion synthesis and robust human motion analysis in 3D scenes.
摘要：作为两个亲密的互惠任务，场景感知的人类运动综合和分析需要多种方式之间的共同理解，包括3D身体运动，3D场景和文本描述。在本文中，我们将这两个配对过程整合到共同发展的合成分析（CESA）管道中，并相互造福他们的学习。具体而言，场景感知文本对人类的综合产生了从相同的文本描述中产生不同的室内运动样本，以丰富人类习惯的相互作用，从而大大受益于培训强大的人类运动分析系统。相度地，人类运动分析将对每个合成运动样本进行语义审查，以确保其语义与给定的文本描述一致，从而改善现实运动的合成。考虑到现实世界中的室内人类动作是面向目标和路径引导的，我们提出了一种级联的生成策略，将文本驱动的特定场景特定的人类运动生成分为三个阶段：推断目标，路径计划和姿势综合。将CESA与这种强大的级联运动合成模型耦合，我们在3D场景中共同改善了现实的人类运动合成和强大的人类运动分析。

Title: EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

Authors: Xuehao Gao, Yang Yang, Shaoyi Du, Yang Wu, Yebin Liu, Guo-Jun Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00382
Pdf URL: https://arxiv.org/pdf/2503.00382
Copy Paste: [[2503.00382]] EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning(https://arxiv.org/abs/2503.00382)
Keywords: generation
Abstract: This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions, which may encounter a performance bottleneck since the huge cross-modality gap. In this paper, we observe that those HOI samples with the same interaction intention toward different targets, e.g., "lift a chair" and "lift a cup", always encapsulate similar action-specific body motion patterns while characterizing different object-specific interaction styles. Thus, learning effective action-specific motion priors and object-specific interaction priors is crucial for a text-to-HOI model and dominates its performances on text-HOI semantic consistency and body-object interaction realism. In light of this, we propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles. Specifically, the first canonical body action inference stage focuses on learning intra-class shareable body motion priors and mapping given text-based semantics to action-specific canonical 3D body motions. Then, in the object-specific interaction inference stage, we focus on object affordance learning and enrich object-specific interaction styles on an inferred action-specific body motion basis. Extensive experiments verify that our proposed text-to-HOI synthesis system significantly outperforms other SOTA methods on three large-scale datasets with better semantic consistency and interaction realism performances.
摘要：本文探讨了一个跨模式合成任务，该任务从给定的基于文本的指令中渗透了3D人类对象相互作用（HOI）。现有的文本到hoi综合方法主要部署从文本到特定于对象的3D车身运动的直接映射，由于巨大的交叉模式间隙，该方法可能会遇到性能瓶颈。在本文中，我们观察到那些具有相同相互作用意图的HOI样品对不同目标，例如“举起椅子”和“抬起杯”，总是封装相似的特定于动作的身体运动模式，同时表征不同对象特异性交互式。因此，学习有效的动作特定运动先验和特定于对象的相互作用先验对于文本到hoi模型至关重要，并且在文本 - hoi语义一致性和身体对象相互作用的现实主义上占主导地位。鉴于此，我们提出了一种新型的文本到hoi任务的身体姿势生成策略：首先推断对象敏捷的规范身体动作，然后富集特定于对象的相互作用样式。具体而言，第一个规范的身体动作推理阶段着重于学习可共享的身体运动先验和映射基于文本的语义，以针对特定动作的规范3D身体运动。然后，在特定于对象的交互推理阶段中，我们将重点放在对象负担能力学习上，并以推断的特定于动作特定的身体运动基础上丰富对象特定的互动样式。广泛的实验验证了我们提出的文本到hoi合成系统在三个具有更好的语义一致性和相互作用现实主义表现的大规模数据集上的其他SOTA方法明显优于其他SOTA方法。

Title: CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering

Authors: Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00413
Pdf URL: https://arxiv.org/pdf/2503.00413
Copy Paste: [[2503.00413]] CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering(https://arxiv.org/abs/2503.00413)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA). We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs. We introduce a Dual-Router MoE (RMoE) strategy to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks/instances, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.
摘要：多模式的大型语言模型（MLLM）由于研究人员在视觉语言任务中的出色理解和产生能力（例如，视觉问题的回答）而引起了研究人员的广泛关注。但是，现实世界中知识更新的迅速速度使MLLM的离线培训成本高昂，而面对非平稳的数据流时，MLLM在学习过程中遭受了灾难性的遗忘。在本文中，我们提出了一个基于MLLMS的双重动量混合物（CL-MOE）框架，用于连续视觉问题回答（VQA）。我们将MLLM与持续学习的融合在一起，以利用LLM中丰富的常识性知识。我们引入了双路由MOE（RMOE）策略，以使用任务级别和实例级别的路由器选择全球和本地专家，以便于最适合该任务的专家分配权重。然后，我们设计了动态动量MOE（MMOE），以根据专家与任务/实例之间的关系动态更新专家的参数，以便模型可以在维护现有知识的同时吸收新知识。广泛的实验结果表明，我们的方法在10个VQA任务上实现了最新的性能，证明了我们方法的有效性。

Title: Auto-encoding Molecules: Graph-Matching Capabilities Matter

Authors: Magnus Cunow, Gerrit Großmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00426
Pdf URL: https://arxiv.org/pdf/2503.00426
Copy Paste: [[2503.00426]] Auto-encoding Molecules: Graph-Matching Capabilities Matter(https://arxiv.org/abs/2503.00426)
Keywords: generation, generative
Abstract: Autoencoders are effective deep learning models that can function as generative models and learn latent representations for downstream tasks. The use of graph autoencoders - with both encoder and decoder implemented as message passing networks - is intriguing due to their ability to generate permutation-invariant graph representations. However, this approach faces difficulties because decoding a graph structure from a single vector is challenging, and comparing input and output graphs requires an effective permutation-invariant similarity measure. As a result, many studies rely on approximate methods. In this work, we explore the effect of graph matching precision on the training behavior and generation capabilities of a Variational Autoencoder (VAE). Our contribution is two-fold: (1) we propose a transformer-based message passing graph decoder as an alternative to a graph neural network decoder, that is more robust and expressive by leveraging global attention mechanisms. (2) We show that the precision of graph matching has significant impact on training behavior and is essential for effective de novo (molecular) graph generation. Code is available at this https URL
摘要：自动编码器是有效的深度学习模型，可以用作生成模型并学习下游任务的潜在表示。使用图形自动编码器的使用 - 用编码器和解码器作为消息传递网络实现 - 由于它们能够生成置换不变的图表表示的能力，因此引人入胜。但是，这种方法面临困难，因为从单个向量解码图形结构很具有挑战性，并且比较输入和输出图需要有效的置换不变性相似度度量。结果，许多研究依赖于近似方法。在这项工作中，我们探讨了图匹配精度对变异自动编码器（VAE）的训练行为和发电能力的影响。我们的贡献是两个方面的：（1）我们提出了一个基于变压器的消息传递图解码器作为图形神经网络解码器的替代方案，通过利用全球注意机制，这更强大和表现力。（2）我们表明，图匹配的精度对训练行为具有重大影响，对于有效的从头（分子）图生成至关重要。代码可在此HTTPS URL上找到

Title: DashCop: Automated E-ticket Generation for Two-Wheeler Traffic Violations Using Dashcam Videos

Authors: Deepti Rawat, Keshav Gupta, Aryamaan Basu Roy, Ravi Kiran Sarvadevabhatla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00428
Pdf URL: https://arxiv.org/pdf/2503.00428
Copy Paste: [[2503.00428]] DashCop: Automated E-ticket Generation for Two-Wheeler Traffic Violations Using Dashcam Videos(https://arxiv.org/abs/2503.00428)
Keywords: generation
Abstract: Motorized two-wheelers are a prevalent and economical means of transportation, particularly in the Asia-Pacific region. However, hazardous driving practices such as triple riding and non-compliance with helmet regulations contribute significantly to accident rates. Addressing these violations through automated enforcement mechanisms can enhance traffic safety. In this paper, we propose DashCop, an end-to-end system for automated E-ticket generation. The system processes vehicle-mounted dashcam videos to detect two-wheeler traffic violations. Our contributions include: (1) a novel Segmentation and Cross-Association (SAC) module to accurately associate riders with their motorcycles, (2) a robust cross-association-based tracking algorithm optimized for the simultaneous presence of riders and motorcycles, and (3) the RideSafe-400 dataset, a comprehensive annotated dashcam video dataset for triple riding and helmet rule violations. Our system demonstrates significant improvements in violation detection, validated through extensive evaluations on the RideSafe-400 dataset.
摘要：电动的两轮车是一种普遍且经济的运输方式，尤其是在亚太地区。但是，诸如三重骑行和与头盔法规不合规之类的危险驾驶实践对事故率显着贡献。通过自动执法机制解决这些违规行为可以提高交通安全。在本文中，我们提出了Dashcop，这是一种用于自动电子机票生成的端到端系统。该系统处理车辆安装的仪表板视频以检测两轮交通违规。 Our contributions include: (1) a novel Segmentation and Cross-Association (SAC) module to accurately associate riders with their motorcycles, (2) a robust cross-association-based tracking algorithm optimized for the simultaneous presence of riders and motorcycles, and (3) the RideSafe-400 dataset, a comprehensive annotated dashcam video dataset for triple riding and helmet rule violations.我们的系统证明了通过对Ridesafe-400数据集的广泛评估来验证违规检测的显着改善。

Title: Using Machine Learning for move sequence visualization and generation in climbing

Authors: Thomas Rimbot, Martin Jaggi, Luis Barba
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00458
Pdf URL: https://arxiv.org/pdf/2503.00458
Copy Paste: [[2503.00458]] Using Machine Learning for move sequence visualization and generation in climbing(https://arxiv.org/abs/2503.00458)
Keywords: generation
Abstract: In this work, we investigate the application of Machine Learning techniques to sport climbing. Expanding upon previous projects, we develop a visualization tool for move sequence evaluation on a given boulder. Then, we look into move sequence prediction from simple holds sequence information using three different Transformer models. While the results are not conclusive, they are a first step in this kind of approach and lay the ground for future work.
摘要：在这项工作中，我们研究了机器学习技术在攀岩运动中的应用。扩展以前的项目，我们开发了一个可视化工具，用于在给定的巨石上移动序列评估。然后，我们使用三个不同的变压器模型从简单保存序列信息中查看移动序列预测。尽管结果不是结论性的，但它们是这种方法的第一步，并为将来的工作奠定了基础。

Title: Periodic Materials Generation using Text-Guided Joint Diffusion Model

Authors: Kishalay Das, Subhojyoti Khastagir, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2503.00522
Pdf URL: https://arxiv.org/pdf/2503.00522
Copy Paste: [[2503.00522]] Periodic Materials Generation using Text-Guided Joint Diffusion Model(https://arxiv.org/abs/2503.00522)
Keywords: generation, generative
Abstract: Equivariant diffusion models have emerged as the prevailing approach for generating novel crystal materials due to their ability to leverage the physical symmetries of periodic material structures. However, current models do not effectively learn the joint distribution of atom types, fractional coordinates, and lattice structure of the crystal material in a cohesive end-to-end diffusion framework. Also, none of these models work under realistic setups, where users specify the desired characteristics that the generated structures must match. In this work, we introduce TGDMat, a novel text-guided diffusion model designed for 3D periodic material generation. Our approach integrates global structural knowledge through textual descriptions at each denoising step while jointly generating atom coordinates, types, and lattice structure using a periodic-E(3)-equivariant graph neural network (GNN). Extensive experiments using popular datasets on benchmark tasks reveal that TGDMat outperforms existing baseline methods by a good margin. Notably, for the structure prediction task, with just one generated sample, TGDMat outperforms all baseline models, highlighting the importance of text-guided diffusion. Further, in the generation task, TGDMat surpasses all baselines and their text-fusion variants, showcasing the effectiveness of the joint diffusion paradigm. Additionally, incorporating textual knowledge reduces overall training and sampling computational overhead while enhancing generative performance when utilizing real-world textual prompts from experts.
摘要：由于其能够利用周期性材料结构的物理对称性，因此出现了模棱两可的扩散模型作为生成新型晶体材料的现行方法。但是，当前模型无法有效地了解原子类型，分数坐标和晶格结构的联合分布在凝聚力的端到端扩散框架中。同样，这些模型都不在现实的设置下工作，用户指定生成结构必须匹配的所需特征。在这项工作中，我们介绍了TGDMAT，这是一种新型的文本引导的扩散模型，专为3D周期材料生成。我们的方法通过在每个DeNoising步骤中通过文本描述整合了全球结构知识，同时使用周期性的E（3）Equivariant图形神经网络（GNN）共同生成原子坐标，类型和晶格结构。在基准任务上使用流行数据集的广泛实验表明，TGDMAT的表现优于现有的基线方法。值得注意的是，对于结构预测任务，只有一个生成的样本，TGDMAT优于所有基线模型，突出了文本引导扩散的重要性。此外，在一代任务中，TGDMAT超过了所有基线及其文本融合变体，展示了关节扩散范式的有效性。此外，结合文本知识可减少整体培训和抽样计算间接费用，同时在利用专家的真实世界文本提示时提高生成性能。

Title: GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model

Authors: Runyi Li, Xuanyu Zhang, Chuhan Tong, Zhipei Xu, Jian Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00531
Pdf URL: https://arxiv.org/pdf/2503.00531
Copy Paste: [[2503.00531]] GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model(https://arxiv.org/abs/2503.00531)
Keywords: generation, generative
Abstract: With the advancement of AIGC technologies, the modalities generated by models have expanded from images and videos to 3D objects, leading to an increasing number of works focused on 3D Gaussian Splatting (3DGS) generative models. Existing research on copyright protection for generative models has primarily concentrated on watermarking in image and text modalities, with little exploration into the copyright protection of 3D object generative models. In this paper, we propose the first bit watermarking framework for 3DGS generative models, named GaussianSeal, to enable the decoding of bits as copyright identifiers from the rendered outputs of generated 3DGS. By incorporating adaptive bit modulation modules into the generative model and embedding them into the network blocks in an adaptive way, we achieve high-precision bit decoding with minimal training overhead while maintaining the fidelity of the model's outputs. Experiments demonstrate that our method outperforms post-processing watermarking approaches for 3DGS objects, achieving superior performance of watermark decoding accuracy and preserving the quality of the generated results.
摘要：随着AIGC技术的发展，模型产生的方式已从图像和视频扩展到3D对象，导致越来越多的作品集中在3D高斯分裂（3DGS）生成模型上。现有关于生成模型版权保护的研究主要集中在图像和文本方式中的水印，几乎没有探索3D对象生成模型的版权保护。在本文中，我们提出了一个名为Gaussianseal的3DGS生成模型的第一个位水印框架，以使位置的解码为版权标识符，从生成的3DGS的渲染输出中。通过将自适应位调制模块纳入生成模型，并以自适应方式将它们嵌入网络块中，我们可以通过最小的训练开销来实现高精度的位解码，同时保持模型输出的保真度。实验表明，我们的方法优于3DGS对象的后处理水印方法，从而实现了水印解码精度的卓越性能并保留了生成的结果的质量。

Title: What Makes a Good Diffusion Planner for Decision Making?

Authors: Haofei Lu, Dongqi Han, Yifei Shen, Dongsheng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00535
Pdf URL: https://arxiv.org/pdf/2503.00535
Copy Paste: [[2503.00535]] What Makes a Good Diffusion Planner for Decision Making?(https://arxiv.org/abs/2503.00535)
Keywords: generation
Abstract: Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.
摘要：扩散模型最近在解决决策问题方面表现出了巨大的潜力，尤其是在生成行为计划（也称为扩散计划）方面。尽管许多研究表明了扩散计划的令人印象深刻的表现，但良好扩散计划者的关键组成部分背后的机制尚不清楚，并且在现有研究中，设计选择非常不一致。在这项工作中，我们通过在离线增强学习（RL）设置中进行的系统经验实验来解决这个问题，从而提供了对扩散计划的基本组成部分的实用见解。我们培训和评估了6,000多个扩散模型，并确定了关键组件，例如引导采样，网络架构，行动生成和计划策略。我们透露，与以前的扩散计划中的共同实践相反的某些设计选择实际上会导致更好的性能，例如，选择选择的无条件采样可以比引导采样和变形金刚用作u-net作为denoising网络更好。基于这些见解，我们建议一个简单而强大的扩散计划基线，可以在标准离线RL基准测试中获得最新的结果。

Title: Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing

Authors: Yanjun Li, Zhaoyang Li, Honghui Chen, Lizhi Xu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.00548
Pdf URL: https://arxiv.org/pdf/2503.00548
Copy Paste: [[2503.00548]] Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing(https://arxiv.org/abs/2503.00548)
Keywords: generation
Abstract: Video Scene Graph Generation (VidSGG) aims to capture dynamic relationships among entities by sequentially analyzing video frames and integrating visual and semantic information. However, VidSGG is challenged by significant biases that skew predictions. To mitigate these biases, we propose a VIsual and Semantic Awareness (VISA) framework for unbiased VidSGG. VISA addresses visual bias through memory-enhanced temporal integration that enhances object representations and concurrently reduces semantic bias by iteratively integrating object features with comprehensive semantic information derived from triplet relationships. This visual-semantics dual debiasing approach results in more unbiased representations of complex scene dynamics. Extensive experiments demonstrate the effectiveness of our method, where VISA outperforms existing unbiased VidSGG approaches by a substantial margin (e.g., +13.1% improvement in mR@20 and mR@50 for the SGCLS task under Semi Constraint).
摘要：视频场景图（Vidsgg）旨在通过顺序分析视频帧并集成视觉和语义信息来捕获实体之间的动态关系。但是，Vidsgg受到偏差预测的重大偏见的挑战。为了减轻这些偏见，我们为无偏见的Vidsgg提出了一个视觉和语义意识（Visa）框架。 Visa通过内存增强的时间集成来解决视觉偏见，从而增强对象表示并同时减少语义偏见，通过迭代地整合对象特征与来自三重态关系的综合语义信息。这种视觉 - 二重性偏见方法导致复杂场景动态的更公正的表示。广泛的实验证明了我们方法的有效性，在该方法中，签证的表现优于现有的无偏见的vidsgg方法（例如，MR@20的 +13.1％提高了 +13.1％，而MR@50在半约束下的SGCL任务）。

Title: AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

Authors: Sohan Patnaik, Rishabh Jain, Balaji Krishnamurthy, Mausoom Sarkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00591
Pdf URL: https://arxiv.org/pdf/2503.00591
Copy Paste: [[2503.00591]] AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models(https://arxiv.org/abs/2503.00591)
Keywords: generation, generative
Abstract: Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment(AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16 improvement over current State-of-The-Art methods, thereby highlighting the potential of MLLMs in aesthetic-aware layout generation.
摘要：视觉布局在图形设计字段中至关重要，例如广告，海报和Web界面。生成模型在内容感知的布局生成中的应用最近引起了人们的关注。但是，这些模型无法理解布局设计的上下文美学要求，并且不与类似人类的偏好保持一致，主要将其视为预测任务，而无需考虑最终的渲染输出。为了克服这些问题，我们提供了美学感知的偏好对齐（AAPA），这是一种训练多模式大型语言模型（MLLM）进行布局预测的新型技术，该预测使用MLLM的美学偏好来直接偏好优化图形布局。我们建议使用AAPA的布局质量启发式方法提出一个数据过滤协议，以确保在高质量布局上进行培训。此外，我们引入了一个新颖的评估度量，该指标使用另一个MLLM根据美学标准计算针对地面真相布局的生成布局的获胜率。我们还证明了AAPA对不同量表（1B至8B参数）和LLM家族（Qwen，Phi，internlm）的适用性。通过进行彻底的定性和定量分析，我们验证了方法对两个具有挑战性的基准（Crello and WebUI）的功效，即Crello和WebUI，展示了17％，对当前最新方法的改善有16个改善，从而突出了MLLM在美观的布局生成中的潜力。

Title: SolidMark: Evaluating Image Memorization in Generative Models

Authors: Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00592
Pdf URL: https://arxiv.org/pdf/2503.00592
Copy Paste: [[2503.00592]] SolidMark: Evaluating Image Memorization in Generative Models(https://arxiv.org/abs/2503.00592)
Keywords: generation, generative
Abstract: Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not. This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at this https URL.
摘要：最近的作品表明，扩散模型能够记住训练图像并在一代中发射。但是，用于评估记忆及其缓解技术的指标遇到了数据集依赖性偏见，并难以检测给定的特定图像是否已被记住。本文始于对扩散模型中围绕记忆指标的问题的全面探索。然后，为了减轻这些问题，我们介绍了$ \ rm \ style {font-variant：small-caps} {solidmark} $，一种新颖的评估方法，可提供每个图像记忆分数。然后，我们重新评估现有的记忆缓解技术。我们还表明，$ \ rm \ style {font-variant：small-caps} {solidmark} $能够评估细颗粒的像素级记忆。最后，我们发布了基于$ \ rm \ style {font-variant：small-caps} {solidmark} $的各种模型，以促进进一步的研究以理解生成模型中的记忆现象。我们所有的代码都可以在此HTTPS URL上找到。

Title: Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.00639
Pdf URL: https://arxiv.org/pdf/2503.00639
Copy Paste: [[2503.00639]] Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning(https://arxiv.org/abs/2503.00639)
Keywords: generative
Abstract: Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.
摘要：解开的表示学习旨在揭示观察到的数据基础的潜在变量，并且一般而言，需要进行相当强烈的假设以确保可识别性。某些方法依赖于辅助变量（例如域指数）指示的潜在变量的分布进行了足够的更改，但是获取足够的域通常是具有挑战性的。替代方法利用了对混合程序的结构稀疏假设，但是这种约束通常在实践中（部分）违反。有趣的是，我们发现这两个看似无关的假设实际上可以相互补充以实现可识别性。具体而言，当以辅助变量为条件时，稀疏混合过程假设对从估计到真正的潜在变量的映射进行了结构性约束，因此补偿了潜在的分布变化。在这种见解的基础上，我们提出了一个可识别性理论，其关于分布变化和稀疏混合程序的限制性较小，从而增强了对现实世界情景的适用性。此外，我们开发了一个结合了域编码网络和稀疏混合约束的估算框架，并分别基于变异自动编码器和生成对抗网络提供了两个实现。合成和现实数据集的实验结果支持我们的理论结果。

Title: Development of an Unpaired Deep Neural Network for Synthesizing X-ray Fluoroscopic Images from Digitally Reconstructed Tomography in Image Guided Radiotherapy

Authors: Chisako Hayashi, Shinichiro Mori, Yasukuni Mori, Lim Taehyeung, Hiroki Suyari, Hitoshi Ishikawa
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.00665
Pdf URL: https://arxiv.org/pdf/2503.00665
Copy Paste: [[2503.00665]] Development of an Unpaired Deep Neural Network for Synthesizing X-ray Fluoroscopic Images from Digitally Reconstructed Tomography in Image Guided Radiotherapy(https://arxiv.org/abs/2503.00665)
Keywords: generation
Abstract: Purpose The purpose of this study was to develop and evaluate a deep neural network (DNN) capable of generating flat-panel detector (FPD) images from digitally reconstructed radiography (DRR) images in lung cancer treatment, with the aim of improving clinical workflows in image-guided radiotherapy. Methods A modified CycleGAN architecture was trained on paired DRR-FPD image data obtained from patients with lung tumors. The training dataset consisted of over 400 DRR-FPD image pairs, and the final model was evaluated on an independent set of 100 FPD images. Mean absolute error (MAE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Kernel Inception Distance (KID) were used to quantify the similarity between synthetic and ground-truth FPD images. Computation time for generating synthetic images was also measured. Results Despite some positional mismatches in the DRR-FPD pairs, the synthetic FPD images closely resembled the ground-truth FPD images. The proposed DNN achieved notable improvements over both input DRR images and a U-Net-based method in terms of MAE, PSNR, SSIM, and KID. The average image generation time was on the order of milliseconds per image, indicating its potential for real-time application. Qualitative evaluations showed that the DNN successfully reproduced image noise patterns akin to real FPD images, reducing the need for manual noise adjustments. Conclusions The proposed DNN effectively converted DRR images into realistic FPD images for thoracic cases, offering a fast and practical method that could streamline patient setup verification and enhance overall clinical workflow. Future work should validate the model across different imaging systems and address remaining challenges in marker visualization, thereby fostering broader clinical adoption.
摘要：目的本研究的目的是开发和评估能够从肺癌治疗中数字重建放射线摄影（DRR）图像产生平板探测器（FPD）图像的深神经网络（DNN），目的是改善图像引导的放射治疗中的临床工作流程。方法对从肺部肿瘤患者获得的配对的DRR-FPD图像数据进行了修改的Cyclegan结构。训练数据集由400多个DRR-FPD图像对组成，并在独立的100个FPD图像集上评估了最终模型。使用平均绝对误差（MAE），峰值信噪比（PSNR），结构相似性指数度量（SSIM）和内核成立距离（KID）来量化合成和地面真相FPD图像之间的相似性。还测量了生成合成图像的计算时间。尽管在DRR-FPD对中存在一些位置不匹配，但合成FPD图像与地面真相FPD图像非常相似。拟议的DNN在MAE，PSNR，SSIM和KID方面对输入DRR图像和基于U-NET的方法都取得了显着改进。平均图像生成时间是每张图像毫秒的顺序，表明其实时应用的潜力。定性评估表明，DNN成功地复制了类似于真实FPD图像的图像噪声模式，从而减少了对手动噪声调整的需求。结论拟议的DNN有效地将DRR图像转换为胸部病例的现实FPD图像，提供了一种快速且实用的方法，可以简化患者的设置验证并增强整体临床工作流程。未来的工作应验证不同成像系统的模型，并应对标记可视化的剩余挑战，从而促进更广泛的临床采用。

Title: Dur360BEV: A Real-world Single 360-degree Camera Dataset and Benchmark for Bird-Eye View Mapping in Autonomous Driving

Authors: Wenke E, Chao Yuan, Li Li, Yixin Sun, Yona Falinie A. Gaus, Amir Atapour-Abarghouei, Toby P. Breckon
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.00675
Pdf URL: https://arxiv.org/pdf/2503.00675
Copy Paste: [[2503.00675]] Dur360BEV: A Real-world Single 360-degree Camera Dataset and Benchmark for Bird-Eye View Mapping in Autonomous Driving(https://arxiv.org/abs/2503.00675)
Keywords: generation
Abstract: We present Dur360BEV, a novel spherical camera autonomous driving dataset equipped with a high-resolution 128-channel 3D LiDAR and a RTK-refined GNSS/INS system, along with a benchmark architecture designed to generate Bird-Eye-View (BEV) maps using only a single spherical camera. This dataset and benchmark address the challenges of BEV generation in autonomous driving, particularly by reducing hardware complexity through the use of a single 360-degree camera instead of multiple perspective cameras. Within our benchmark architecture, we propose a novel spherical-image-to-BEV (SI2BEV) module that leverages spherical imagery and a refined sampling strategy to project features from 2D to 3D. Our approach also includes an innovative application of Focal Loss, specifically adapted to address the extreme class imbalance often encountered in BEV segmentation tasks. Through extensive experiments, we demonstrate that this application of Focal Loss significantly improves segmentation performance on the Dur360BEV dataset. The results show that our benchmark not only simplifies the sensor setup but also achieves competitive performance.
摘要：我们提出了Dur360Bev，这是一种新型的球形摄像机自动驾驶数据集，配备了高分辨率128通道3D LIDAR和RTK改装的GNSS/INS系统，以及旨在仅使用单个球形摄像机生成鸟瞰图（BEV）映射的基准架构。该数据集和基准测定了BEV生成在自动驾驶中的挑战，尤其是通过使用单个360度相机而不是多个透视摄像机来降低硬件复杂性。在我们的基准架构中，我们提出了一种新型的球形图像到贝维（SI2BEV）模块，该模块利用球形图像和精致的采样策略，将项目特征从2D到3D。我们的方法还包括焦点损失的创新应用，专门针对BEV细分任务中经常遇到的极端类不平衡。通过广泛的实验，我们证明了这种局灶性损失的应用可显着提高DUR360BEV数据集的分割性能。结果表明，我们的基准不仅简化了传感器的设置，而且可以实现竞争性能。

Title: Proteina: Scaling Flow-based Protein Structure Generative Models

Authors: Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00710
Pdf URL: https://arxiv.org/pdf/2503.00710
Copy Paste: [[2503.00710]] Proteina: Scaling Flow-based Protein Structure Generative Models(https://arxiv.org/abs/2503.00710)
Keywords: generation, generative
Abstract: Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.
摘要：最近，蛋白质结构的扩散和基于流动的生成模型已成为从头蛋白质设计的强大工具。在这里，我们开发了Proteina，这是一种新的大型基于流动的蛋白质主链发电机，它利用层次折叠类标签进行调理，并依靠量身定制的可伸缩变压器体系结构，其参数与以前的模型一样多。为了有意义地量化性能，我们引入了一组新的指标，这些指标可以直接测量带有参考集的生成蛋白的分布相似性，从而补充现有指标。我们进一步探讨了对数百万合成蛋白结构的缩放训练数据，并探索改进的训练和采样食谱，适合于蛋白质主链产生。这包括诸如用于蛋白质骨架的洛拉（Lora）的微调策略，包括无分类器指导和蛋白质骨架自动化的新指导方法以及新的调整后的培训目标。 Proteina在从头蛋白质主链设计上实现最先进的性能，并以前所未有的长度生产多样的和可设计的蛋白质，最多800个残基。分层条件提供了新颖的控制，可以实现高级辅助结构指导以及低级折叠特定的生成。

Title: OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records

Authors: Zhijiang Wan, Qianhao Yu, Jia Mao, Wenfeng Duan, Cheng Ding
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00711
Pdf URL: https://arxiv.org/pdf/2503.00711
Copy Paste: [[2503.00711]] OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records(https://arxiv.org/abs/2503.00711)
Keywords: generative
Abstract: This study introduces OpenECG, a large-scale benchmark of 1.2 million 12-lead ECG recordings from nine centers, to evaluate ECG foundation models (ECG-FMs) trained on public datasets. We investigate three self-supervised learning methods (SimCLR, BYOL, MAE) with ResNet-50 and Vision Transformer architectures, assessing model generalization through leave-one-dataset-out experiments and data scaling analysis. Results show that pre-training on diverse datasets significantly improves generalization, with BYOL and MAE outperforming SimCLR, highlighting the efficacy of feature-consistency and generative learning over contrastive approaches. Data scaling experiments reveal that performance saturates at 60-70% of total data for BYOL and MAE, while SimCLR requires more data. These findings demonstrate that publicly available ECG data can match or surpass proprietary datasets in training robust ECG-FMs, paving the way for scalable, clinically meaningful AI-driven ECG analysis.
摘要：这项研究介绍了OpenECG，这是一个从9个中心的120万个12铅ECG录音的大规模基准，以评估在公共数据集中培训的ECG基金会模型（ECG-FMS）。我们研究了三种具有Resnet-50和Vision Transformer体系结构的自我监督学习方法（Simclr，Byol，Mae），通过丢弃的一对数据进行实验和数据扩展分析来评估模型的概括。结果表明，在不同数据集上进行预训练可以显着提高概括，而BYOL和MAE的表现优于SIMCLR，强调了特征符合性和生成性学习的功效，而不是对比度方法。数据扩展实验表明，性能在BYOL和MAE的总数据的60-70％饱和，而SIMCLR需要更多数据。这些发现表明，在培训强大的ECG-FMS中，公开可用的ECG数据可以匹配或超越专有数据集，为可扩展的，临床上有意义的AI驱动的ECG分析铺平了道路。

Title: Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Authors: Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, Xueliang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00743
Pdf URL: https://arxiv.org/pdf/2503.00743
Copy Paste: [[2503.00743]] Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models(https://arxiv.org/abs/2503.00743)
Keywords: quality assessment
Abstract: Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic understanding. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS visionlanguage data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior interpretation accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) testtime scaling, enabling significant improvements in VLM performance for RS tasks.
摘要：视觉语言模型（VLM）通过语言引导的语义理解来解释遥感图像（RS）图像具有巨大的潜力。但是，这些VLM的有效性在很大程度上取决于高质量的图像文本训练数据，该数据捕获了视觉内容和语言描述之间丰富的语义关系。与自然图像不同，RS缺乏网络数据中的大规模交织的图像文本对，从而使数据收集具有挑战性。尽管当前的方法主要依赖于基于规则的方法或旗舰VLM进行数据综合，但明显不存在对此类合成生成的RS VisionLanguage数据进行自动质量评估的系统框架。为了填补这一空白，我们提出了一个新颖的分数模型，该模型在大规模RS VisionLanguage偏好数据中训练，以进行自动质量评估。我们的经验结果表明，与Full Data微调和基于剪贴画的排名方法相比，通过我们的分数模型排名的数据中，微调剪辑或高级VLM（例如QWEN2-VL）具有优异的解释精度。此外，我们演示了我们的评分模型用于增强学习（RL）培训和最佳N（BON）测试时间缩放的应用，从而为RS任务提供了VLM性能的显着改善。

Title: MR-EIT: Multi-Resolution Reconstruction for Electrical Impedance Tomography via Data-Driven and Unsupervised Dual-Mode Neural Networks

Authors: Fangming Shi, Jinzhen Liu, Xiangqian Meng, Yapeng Zhou, Hui Xiong
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00762
Pdf URL: https://arxiv.org/pdf/2503.00762
Copy Paste: [[2503.00762]] MR-EIT: Multi-Resolution Reconstruction for Electrical Impedance Tomography via Data-Driven and Unsupervised Dual-Mode Neural Networks(https://arxiv.org/abs/2503.00762)
Keywords: super-resolution
Abstract: This paper presents a multi-resolution reconstruction method for Electrical Impedance Tomography (EIT), referred to as MR-EIT, which is capable of operating in both supervised and unsupervised learning modes. MR-EIT integrates an ordered feature extraction module and an unordered coordinate feature expression module. The former achieves the mapping from voltage to two-dimensional conductivity features through pre-training, while the latter realizes multi-resolution reconstruction independent of the order and size of the input sequence by utilizing symmetric functions and local feature extraction mechanisms. In the data-driven mode, MR-EIT reconstructs high-resolution images from low-resolution data of finite element meshes through two stages of pre-training and joint training, and demonstrates excellent performance in simulation experiments. In the unsupervised learning mode, MR-EIT does not require pre-training data and performs iterative optimization solely based on measured voltages to rapidly achieve image reconstruction from low to high resolution. It shows robustness to noise and efficient super-resolution reconstruction capabilities in both simulation and real water tank experiments. Experimental results indicate that MR-EIT outperforms the comparison methods in terms of Structural Similarity (SSIM) and Relative Image Error (RIE), especially in the unsupervised learning mode, where it can significantly reduce the number of iterations and improve image reconstruction quality.
摘要：本文提出了一种用于电阻断层扫描（EIT）的多分辨率重建方法，称为MR-EIT，该方法能够以受监督和无监督的学习模式进行操作。 MR-EIT集成了有序的特征提取模块和无序的坐标特征表达模块。前者通过预训练实现了从电压到二维电导率特征的映射，而后者通过利用对称函数和局部特征提取机制实现了多分辨率重建，而不是输入序列的顺序和大小。在数据驱动的模式下，MR-EIT通过有限元元素网格的低分辨率数据重建高分辨率图像，通过训练和关节训练的两个阶段，并在模拟实验中表现出卓越的性能。在无监督的学习模式下，MR-EIT不需要预训练数据，并且仅基于测量的电压进行迭代优化，以快速实现从低分辨率到高分辨率的图像重建。它显示出对噪声和有效的超分辨率重建功能的鲁棒性，在模拟和真实水箱实验中。实验结果表明，MR-EIT在结构相似性（SSIM）和相对图像误差（RIE）方面优于比较方法，尤其是在无监督的学习模式下，它可以显着减少迭代次数并改善图像重建质量。

Title: Evaluating and Predicting Distorted Human Body Parts for Generated Images

Authors: Lu Ma, Kaibo Cao, Hao Liang, Jiaxin Lin, Zhuang Li, Yuhong Liu, Jihong Zhang, Wentao Zhang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00811
Pdf URL: https://arxiv.org/pdf/2503.00811
Copy Paste: [[2503.00811]] Evaluating and Predicting Distorted Human Body Parts for Generated Images(https://arxiv.org/abs/2503.00811)
Keywords: quality assessment
Abstract: Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{this https URL}{this https URL}.
摘要：文本对图像（T2I）模型的最新进展使高质量的图像合成，但产生解剖学上准确的人物仍然具有挑战性。 AI生成的图像经常表现出扭曲，例如增殖的四肢，缺失的手指，肢体变形或融合的身体部位。现有的评估指标（例如Inception评分（IS）和Fréchet成立距离（FID）缺乏检测这些扭曲的粒度，而基于人类的基于人类偏好的指标则集中于抽象质量评估，而不是解剖学忠诚度。为了解决这一差距，我们建立了第一个标准，用于识别AI生成的图像中人体扭曲的标准，并引入失真-5K，这是一个全面的数据集，其中包含4,700个带注释的图像，这些图像是跨不同样式和失真类型的正常和畸形的人物。基于此数据集，我们提出了VIT-HD，这是一种基于视觉变压器的模型，该模型量身定制，用于检测AI生成的图像中人体扭曲的模型，该模型的表现优于最先进的分割模型和视觉语言模型，在失真定位上实现了0.899的F1得分，IOU为0.831。此外，我们使用500个以人为中心的提示来构建人类失真基准测试，以使用训练有素的VIT-HD评估四个流行的T2I模型，揭示了将近50 \％的生成图像包含扭曲。这项工作开创了一种系统的方法，可以评估AI生成的人类的解剖准确性，提供工具，以提高T2I模型的忠诚度及其现实世界的适用性。失真-5K数据集，受过训练的VIT-HD将很快在我们的GitHub存储库中发布：\ href {this https url} {this https url}。

Title: Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models

Authors: Jeffrey Gu, Serena Yeung-Levy
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00838
Pdf URL: https://arxiv.org/pdf/2503.00838
Copy Paste: [[2503.00838]] Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models(https://arxiv.org/abs/2503.00838)
Keywords: generation
Abstract: Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models. Hypernetworks, neural networks that generate some or all of the parameters of another neural network, have become an increasingly important technique for conditioning and generalizing implicit neural representations (INRs), which represent signals or objects such as audio or 3D shapes using a neural network. However, despite the potential benefits of incorporating foundation models in hypernetwork methods, this research direction has not been investigated, likely due to the dissimilarity of the weight generation task with other visual tasks. To address this gap, we (1) show how foundation models can improve hypernetworks with Transformer-based architectures, (2) provide an empirical analysis of the benefits of foundation models for hypernetworks through the lens of the generalizable INR task, showing that leveraging foundation models improves performance, generalizability, and data efficiency across a variety of algorithms and modalities. We also provide further analysis in examining the design space of foundation model-based hypernetworks, including examining the choice of foundation models, algorithms, and the effect of scaling foundation models.
摘要：当适应各种下游任务时，大型的预训练模型或基础模型表现出令人印象深刻的性能，通常是表现出色的专业模型。 Hypernetworks，生成另一个神经网络的某些或全部参数的神经网络已成为调节和推广隐式神经表示（INR）的越来越重要的技术，该技术代表使用神经网络的声音或对象（例如音频或3D形状）。然而，尽管将基础模型纳入超网络方法的潜在好处，但尚未研究该研究方向，这可能是由于体重产生任务与其他视觉任务的差异所致。为了解决这一差距，我们（1）展示了基础模型如何通过基于变压器的体系结构来改善超网络，（2）通过可推广的INR任务的镜头提供了基础模型对超网络的好处的经验分析，这表明利用基础模型可以提高各种Algoriths和Modals和模式的跨性能，普遍性和数据效率。我们还提供了进一步的分析，以检查基于基础模型的超网络的设计空间，包括研究基础模型，算法和扩展基础模型的效果。

Title: PSRGS:Progressive Spectral Residual of 3D Gaussian for High-Frequency Recovery

Authors: BoCheng Li, WenJuan Zhang, Bing Zhang, YiLing Yao, YaNing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00848
Pdf URL: https://arxiv.org/pdf/2503.00848
Copy Paste: [[2503.00848]] PSRGS:Progressive Spectral Residual of 3D Gaussian for High-Frequency Recovery(https://arxiv.org/abs/2503.00848)
Keywords: restoration, generation
Abstract: 3D Gaussian Splatting (3D GS) achieves impressive results in novel view synthesis for small, single-object scenes through Gaussian ellipsoid initialization and adaptive density control. However, when applied to large-scale remote sensing scenes, 3D GS faces challenges: the point clouds generated by Structure-from-Motion (SfM) are often sparse, and the inherent smoothing behavior of 3D GS leads to over-reconstruction in high-frequency regions, where have detailed textures and color variations. This results in the generation of large, opaque Gaussian ellipsoids that cause gradient artifacts. Moreover, the simultaneous optimization of both geometry and texture may lead to densification of Gaussian ellipsoids at incorrect geometric locations, resulting in artifacts in other views. To address these issues, we propose PSRGS, a progressive optimization scheme based on spectral residual maps. Specifically, we create a spectral residual significance map to separate low-frequency and high-frequency regions. In the low-frequency region, we apply depth-aware and depth-smooth losses to initialize the scene geometry with low threshold. For the high-frequency region, we use gradient features with higher threshold to split and clone ellipsoids, refining the scene. The sampling rate is determined by feature responses and gradient loss. Finally, we introduce a pre-trained network that jointly computes perceptual loss from multiple views, ensuring accurate restoration of high-frequency details in both Gaussian ellipsoids geometry and color. We conduct experiments on multiple datasets to assess the effectiveness of our method, which demonstrates competitive rendering quality, especially in recovering texture details in high-frequency regions.
摘要：3D高斯脱落（3D GS）通过高斯椭圆形初始化和自适应密度控制，在小型单对象场景的新型视图合成中取得了令人印象深刻的结果。但是，当应用于大规模遥感场景时，3D GS面对挑战：由结构 - 运动中（SFM）产生的点云通常很少，并且3D GS的固有平滑行为会导致高频区域中的过度重构，在这里具有详细的文本和色彩变化。这导致产生大型不透明的高斯椭圆形，引起梯度伪像。此外，几何和纹理的同时优化可能会导致在不正确的几何位置在高斯椭圆形的致密化，从而导致其他视图中的伪影。为了解决这些问题，我们提出了PSRG，这是一种基于光谱残差图的进行性优化方案。具体而言，我们创建一个光谱残差显着性图，以分离低频和高频区域。在低频区域，我们应用深度感知和深度平滑损失，以低阈值初始化场景几何形状。对于高频区域，我们使用具有较高阈值的梯度特征来分裂和克隆椭圆形，从而完善了场景。采样率由特征响应和梯度损失确定。最后，我们引入了一个预先训练的网络，该网络共同从多种视图中计算感知损失，以确保在高斯椭圆形几何形状和颜色中准确恢复高频细节。我们在多个数据集上进行实验，以评估我们方法的有效性，这表明了竞争性渲染质量，尤其是在恢复高频区域的纹理细节方面。

Title: Zero-Shot Head Swapping in Real-World Scenarios

Authors: Sohyun Jeong, Taewoong Kang, Hyojin Jang, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00861
Pdf URL: https://arxiv.org/pdf/2503.00861
Copy Paste: [[2503.00861]] Zero-Shot Head Swapping in Real-World Scenarios(https://arxiv.org/abs/2503.00861)
Keywords: generation
Abstract: With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.
摘要：随着对个性化图像的媒体和社交网络的需求不断增长，对先进的头部交换技术的需求增加了，将整个头部图像与身体形象的身体整合在一起。但是，传统的头部交换方法在很大程度上依赖于以面部为中心的裁剪数据，主要是面向面的视图，这限制了它们在现实世界中的有效性。此外，它们旨在指示需要编辑区域的掩蔽方法针对这些类型的数据集进行了优化，但很难在复杂的情况下实现无缝混合，例如原始数据包括诸如长发之类的功能，例如延伸到掩盖区域之外。为了克服这些局限性并增强了各种且复杂的场景中的适应性，我们提出了一种新颖的头交换方法，HID，对包括完整头部和上半身在内的图像以及从额叶到侧视图的手柄，同时自动生成上下文意识到的掩码。对于自动蒙版生成，我们引入了IOMASK，该IOMASK可以使头部和身体无缝混合，从而有效地应对整合挑战。我们进一步介绍了头发注射模块，以更精确地捕获头发细节。我们的实验表明，所提出的方法在交换中实现了最先进的表现，从而在各种挑战性的条件下提供了视觉一致和现实的结果。

Title: From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization

Authors: Chao Yuan, Guiwei Zhang, Changxiao Ma, Tianyi Zhang, Guanglin Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00938
Pdf URL: https://arxiv.org/pdf/2503.00938
Copy Paste: [[2503.00938]] From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization(https://arxiv.org/abs/2503.00938)
Keywords: generation, generative
Abstract: Person re-identification (ReID) aims to extract accurate identity representation features. However, during feature extraction, individual samples are inevitably affected by noise (background, occlusions, and model limitations). Considering that features from the same identity follow a normal distribution around identity centers after training, we propose a Training-Free Feature Centralization ReID framework (Pose2ID) by aggregating the same identity features to reduce individual noise and enhance the stability of identity representation, which preserves the feature's original distribution for following strategies such as re-ranking. Specifically, to obtain samples of the same identity, we introduce two components:Identity-Guided Pedestrian Generation: by leveraging identity features to guide the generation process, we obtain high-quality images with diverse poses, ensuring identity consistency even in complex scenarios such as infrared, and this http URL Feature Centralization: it explores each sample's potential positive samples from its neighborhood. Experiments demonstrate that our generative model exhibits strong generalization capabilities and maintains high identity consistency. With the Feature Centralization framework, we achieve impressive performance even with an ImageNet pre-trained model without ReID training, reaching mAP/Rank-1 of 52.81/78.92 on Market1501. Moreover, our method sets new state-of-the-art results across standard, cross-modality, and occluded ReID tasks, showcasing strong adaptability.
摘要：人重新识别（REID）旨在提取准确的身份表示特征。但是，在特征提取过程中，单个样本不可避免地会受到噪声（背景，遮挡和模型限制）的影响。考虑到训练后同一身份的特征遵循身份中心周围的正态分布，我们提出了一个无训练的特征集中式REID框架（pose2ID），通过汇总相同的身份特征以降低单个噪声并增强身份表示的稳定性，该特征可以保留以下策略的特征原始分布，例如重新分配。具体来说，要获取具有相同身份的样本，我们介绍了两个组成部分：身份引导的行人生成：利用身份特征来指导生成过程，我们获得具有多样化姿势的高质量图像，即使在诸如Infrared等复杂的方案（例如Infrared）和此HTTP url集中式中的复杂场景中，也可以确保身份的一致性：它探索了每个样品的阳性样品，从而探索了每个样品的阳性样品。实验表明，我们的生成模型具有强大的概括能力并保持高标识一致性。借助功能集中式框架，即使没有REID训练的ImageNet预培训模型，我们也可以实现令人印象深刻的性能，在Market1501上达到了MAP/RANK-1的52.81/78.92。此外，我们的方法在标准，跨模式和遮挡的REID任务中设定了新的最新结果，展示了强大的适应性。

Title: Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

Authors: Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00948
Pdf URL: https://arxiv.org/pdf/2503.00948
Copy Paste: [[2503.00948]] Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think(https://arxiv.org/abs/2503.00948)
Keywords: generation
Abstract: Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. To address these limitations, we propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time. Specifically, our framework consists of three separate stages: (1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly. (3) With the above two-stage models excelling in motion controllability and degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of our framework over existing methods.
摘要：图像到视频（I2V）生成旨在根据给定的图像和条件（例如文本）合成视频剪辑。该任务的主要挑战在于同时产生自然动作，同时保留图像的原始外观。但是，当前的I2V扩散模型（I2V-DMS）通常会产生具有有限运动学位的视频，或表现出与文本条件相抵触的无法控制的运动。为了解决这些局限性，我们提出了一个新颖的推断和去耦框架，该框架首次将模型合并到I2V域。具体而言，我们的框架由三个单独的阶段组成：（1）从基础I2V-DM开始，我们使用轻质，可学习的适应器将文本条件显式地将文本条件注入时间模块，并微调集成模型以提高运动可控性。（2）我们引入了一种无训练的外推策略，以扩大运动的动态范围，从而有效地扭转了微调过程，以显着增强运动程度。（3）在上述两阶段模型中，我们在运动可控性和程度上都表现出色，我们将与每种类型的运动能力相关的相关参数解散，并将其注入基础I2V-DM。由于I2V-DM在各种脱索时间步骤下处理不同级别的运动可控性和动力学，因此我们会随着时间的推移相应地调整运动吸引参数。已经进行了广泛的定性和定量实验，以证明我们的框架优于现有方法。

Title: Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models

Authors: Xingzhuo Guo, Yu Zhang, Baixu Chen, Haoran Xu, Jianmin Wang, Mingsheng Long
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00951
Pdf URL: https://arxiv.org/pdf/2503.00951
Copy Paste: [[2503.00951]] Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models(https://arxiv.org/abs/2503.00951)
Keywords: generation, generative
Abstract: Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: this https URL.
摘要：扩散模型通过通过远期过程逐步向数据添加噪声，然后逆转此过程以生成逼真的样本，从而成为强大的生成框架。尽管这些模型在各种任务和方式上都取得了强大的性能，但它们在时间预测学习中的应用仍未得到充实。现有方法将预测性学习视为有条件的生成问题，但通常无法完全利用数据中固有的时间动态，从而导致产生时间相干序列的挑战。为了解决这个问题，我们引入了动力扩散（Dydiff），这是一个理论上合理的框架，结合了时间上意识到的前进和反向过程。动力扩散在每个扩散步骤上明确模拟了时间跃迁，从而建立了对先前状态的依赖性，以更好地捕获时间动力学。通过重新聚集技巧，动态扩散可以实现有效的训练，并推断类似于任何标准扩散模型。跨科学时空预测，视频预测和时间序列的广泛实验表明，动态扩散始终提高时间预测任务中的性能，从而填补了现有方法中的重要空白。代码可在此存储库中找到：此HTTPS URL。

Title: Using Synthetic Images to Augment Small Medical Image Datasets

Authors: Minh H. Vu, Lorenzo Tronchin, Tufve Nyholm, Tommy Löfstedt
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00962
Pdf URL: https://arxiv.org/pdf/2503.00962
Copy Paste: [[2503.00962]] Using Synthetic Images to Augment Small Medical Image Datasets(https://arxiv.org/abs/2503.00962)
Keywords: generative
Abstract: Recent years have witnessed a growing academic and industrial interest in deep learning (DL) for medical imaging. To perform well, DL models require very large labeled datasets. However, most medical imaging datasets are small, with a limited number of annotated samples. The reason they are small is usually because delineating medical images is time-consuming and demanding for oncologists. There are various techniques that can be used to augment a dataset, for example, to apply affine transformations or elastic transformations to available images, or to add synthetic images generated by a Generative Adversarial Network (GAN). In this work, we have developed a novel conditional variant of a current GAN method, the StyleGAN2, to generate multi-modal high-resolution medical images with the purpose to augment small medical imaging datasets with these synthetic images. We use the synthetic and real images from six datasets to train models for the downstream task of semantic segmentation. The quality of the generated medical images and the effect of this augmentation on the segmentation performance were evaluated afterward. Finally, the results indicate that the downstream segmentation models did not benefit from the generated images. Further work and analyses are required to establish how this augmentation affects the segmentation performance.
摘要：近年来，对医学成像的深度学习（DL）的学术和工业兴趣越来越大。为了表现良好，DL型号需要非常大的标记数据集。但是，大多数医学成像数据集很小，带注释的样本数量有限。它们很小的原因通常是因为描述的医学图像对肿瘤学家来说是耗时的和要求的。有多种技术可用于增强数据集，例如，将仿射转换或弹性转换应用于可用图像，或添加由生成对抗网络（GAN）生成的合成图像。在这项工作中，我们开发了一种新型的有条件变体，用于当前的GAN方法The stylegan2，以生成多模式的高分辨率医学图像，目的是用这些合成图像增强小型医学成像数据集。我们使用来自六个数据集的合成图像和真实图像来训练模型，以完成语义分割的下游任务。随后评估了生成的医学图像的质量以及这种增强对分割性能的影响。最后，结果表明下游分割模型并未受益于生成的图像。需要进一步的工作和分析来确定这种增强如何影响细分性能。

Title: Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model

Authors: Guanlue Li, Chenran Jiang, Ziqi Gao, Yu Liu, Chenyang Liu, Jiean Chen, Yong Huang, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00975
Pdf URL: https://arxiv.org/pdf/2503.00975
Copy Paste: [[2503.00975]] Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model(https://arxiv.org/abs/2503.00975)
Keywords: generation
Abstract: Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom- and motif-wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom-Motif Consistency Diffusion Model (AMDiff), utilizing a joint-training paradigm for multi-view learning. This model features a hierarchical diffusion architecture that integrates both atom- and motif-level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier-free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin-dependent kinase 4 (CDK4), demonstrate the model's capability in structure-based de novo drug design. Overall, AMDiff bridges the gap between atom-view and motif-view drug discovery and speeds up the process of target-aware molecular generation.
摘要：与靶蛋白结合的有效产生的分子结构或新化学实体对于药物发现中的铅鉴定和优化至关重要。尽管在3D分子生成的原子和图案深度学习模型方面取得了进步，但当前方法通常在有效性和可靠性方面遇到困难。为了解决这些问题，我们利用联合培训范式来开发原子-MOTIF一致性扩散模型（AMDIFF）进行多视图学习。该模型具有分层扩散结构，该结构既集成了分子的原子和基序级别的视图，从而可以全面探索互补信息。通过利用无分类器的指导并将结合位点特征作为条件输入，AMDIFF确保分子在不同靶标之间产生强大的分子。与现有方法相比，AMDIFF在生成适合各种蛋白质口袋的分子方面具有出色的有效性和新颖性。靶向蛋白激酶的案例研究，包括变性淋巴瘤激酶（ALK）和细胞周期蛋白依赖性激酶4（CDK4），证明了该模型在基于结构的从头开始药物设计中的能力。总体而言，AMDIFF桥接了原子视图与图案视图发现之间的差距，并加快了目标感知分子产生的过程。

Title: Underdamped Diffusion Bridges with Applications to Sampling

Authors: Denis Blessing, Julius Berner, Lorenz Richter, Gerhard Neumann
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.01006
Pdf URL: https://arxiv.org/pdf/2503.01006
Copy Paste: [[2503.01006]] Underdamped Diffusion Bridges with Applications to Sampling(https://arxiv.org/abs/2503.01006)
Keywords: generative
Abstract: We provide a general framework for learning diffusion bridges that transport prior to target distributions. It includes existing diffusion models for generative modeling, but also underdamped versions with degenerate diffusion matrices, where the noise only acts in certain dimensions. Extending previous findings, our framework allows to rigorously show that score matching in the underdamped case is indeed equivalent to maximizing a lower bound on the likelihood. Motivated by superior convergence properties and compatibility with sophisticated numerical integration schemes of underdamped stochastic processes, we propose \emph{underdamped diffusion bridges}, where a general density evolution is learned rather than prescribed by a fixed noising process. We apply our method to the challenging task of sampling from unnormalized densities without access to samples from the target distribution. Across a diverse range of sampling problems, our approach demonstrates state-of-the-art performance, notably outperforming alternative methods, while requiring significantly fewer discretization steps and no hyperparameter tuning.
摘要：我们提供了一个通用框架，用于学习在目标分布之前运输的扩散桥。它包括用于生成建模的现有扩散模型，但还包括具有退化扩散矩阵的失业版本，其中噪声仅在某些维度上起作用。为了扩展以前的发现，我们的框架可以严格地表明，在不足的情况下，得分匹配确实相当于最大程度地提高可能性上的下限。我们提出了\ emph {不足的扩散桥}的上等收敛性能和与不足的随机过程的复杂数值整合方案的兼容性，在这些\ emph {不足的扩散桥}中，在其中学习了一般密度演化，而不是通过固定的no缩合过程进行规定。我们将我们的方法应用于来自非均衡密度采样的具有挑战性的任务，而无需从目标分布中访问样品。在各种采样问题中，我们的方法表明了最先进的性能，尤其优于替代方法，同时需要更少的离散步骤，并且没有超参数调整。

Title: MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Authors: Ziyang Zhang, Yang Yu, Yucheng Chen, Xulei Yang, Si Yong Yeo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01019
Pdf URL: https://arxiv.org/pdf/2503.01019
Copy Paste: [[2503.01019]] MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations(https://arxiv.org/abs/2503.01019)
Keywords: generation
Abstract: Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.
摘要：尽管视觉预训练（VLP）取得了重大进展，但当前的方法主要强调特征提取和跨模式理解，而对产生或转化视觉内容的关注有限。这个差距阻碍了该模型从文本提示中综合相干和新颖的视觉表示的能力，从而降低了多模式学习的有效性。在这项工作中，我们提出了Medunifier，这是一个针对医疗数据量身定制的统一VLP框架。 Medunifier无缝将文本接地的图像生成功能与多模式学习策略（包括图像文本对比度对齐，图像文本匹配和图像接地文本生成生成）相结合。与连续视觉表示的传统方法不同，我们的方法采用了视觉矢量量化，这不仅促进了更加凝聚力的学习策略，以实现跨模式理解，而且通过有效利用离散表示形式来提高多模式生成质量。对既定基准的实验，包括单i-模式任务（监督微调），跨模式任务（图像 - 文本检索和零照片分类）以及多模式任务（医疗报告生成，图像综合），它可以实现各种任务。 Medunifier还为医疗保健中的各种语言和视觉任务提供了高度适应性的工具，这标志着开发用于医疗应用的可推广的AI模型的进步。

Title: All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Authors: Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01067
Pdf URL: https://arxiv.org/pdf/2503.01067
Copy Paste: [[2503.01067]] All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning(https://arxiv.org/abs/2503.01067)
Keywords: generation
Abstract: From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on the dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.
摘要：从第一原理的角度来看，基础模型微调（FT）的最强结果是通过相对复杂的两阶段训练程序实现的。具体而言，在使用它作为下游强化学习（RL）过程的一部分之前，首先在某些数据集（例如人类偏好）上训练奖励模型（RM）（例如，人类偏好），而不是通过Offline最大值估计来直接优化数据集中的策略参数。实际上，从信息理论的角度来看，我们只能通过通过奖励模型来丢失信息，并且不能通过政策采样创建任何新信息。为了解释这一差异，我们通过理论和经验镜头仔细检查了关于RL在FT中的价值的几个假设。 Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.

Title: Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

Authors: Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01103
Pdf URL: https://arxiv.org/pdf/2503.01103
Copy Paste: [[2503.01103]] Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator(https://arxiv.org/abs/2503.01103)
Keywords: generation, generative
Abstract: While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that bridges likelihood-based generative training and the GAN objective to bypass this fundamental constraint. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58 to new records of 1.30/0.97 on CIFAR-10/ImageNet-64 datasets, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.
摘要：尽管基于可能性的生成模型，尤其是扩散和自回归模型，但在视觉生成方面取得了显着的保真度，但最大似然估计（MLE）目标固有地遭受了模式覆盖趋势，该趋势将生成质量限制在有限的模型容量下。在这项工作中，我们建议直接判别优化（DDO）作为一个统一的框架，该框架桥接了基于可能性的生成训练，并绕过了这一基本约束的GAN目标。我们的关键见解是使用可学习的目标模型和固定参考模型之间的似然比参数化歧视器，并与直接优先优化理念（DPO）绘制相似之处。与GAN不同，该参数化消除了生成器和歧视网络的联合训练的需求，从而使训练有素的模型直接，有效，有效地进行了训练模型，以超出MLE的限制。 DDO可以以自我播放方式进行迭代进行渐进模型的细化，每轮需要不到1％的训练训练时期。我们的实验通过显着推进先前的SOTA扩散模型EDM来证明DDO的有效性，从而将FID得分从1.79/1.58降低到CIFAR-10/ImagEnet-64数据集中的1.30/0.97的新记录，并始终如一地提高了无指南和CFG增强自动型号的256 $ 256 $ 256 $ 256 $ 256。

Title: VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors

Authors: Juil Koo, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01107
Pdf URL: https://arxiv.org/pdf/2503.01107
Copy Paste: [[2503.01107]] VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors(https://arxiv.org/abs/2503.01107)
Keywords: generative
Abstract: Generative methods for image and video editing use generative models as priors to perform edits despite incomplete information, such as changing the composition of 3D objects shown in a single image. Recent methods have shown promising composition editing results in the image setting, but in the video setting, editing methods have focused on editing object's appearance and motion, or camera motion, and as a result, methods to edit object composition in videos are still missing. We propose \name as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner. This is achieved by lifting intermediate features of a generative model to a 3D reconstruction that is shared between all frames, editing the reconstruction, and projecting the features on the edited reconstruction back to each frame. To the best of our knowledge, this is the first generative approach to edit object compositions in videos. Our approach is simple and training-free, while outperforming state-of-the-art image editing baselines.
摘要：图像和视频编辑的生成方法使用生成模型作为先验，尽管信息不完整，例如更改单个图像中显示的3D对象的组成。最近的方法显示了图像设置中有希望的构图编辑结果，但是在视频设置中，编辑方法集中在编辑对象的外观和运动或相机运动，因此，仍然缺少视频中编辑对象组成的方法。我们将\名称作为一种用相机运动的静态场景视频中编辑3D对象组成的方法。我们的方法允许以时间一致的方式在视频的所有帧中编辑3D对象的3D位置。这是通过将生成模型的中间特征提升到所有帧之间共享，编辑重建并将其在编辑的重建上投影回到每个帧上的3D重建的中间特征来实现的。据我们所知，这是在视频中编辑对象组成的第一种生成方法。我们的方法简单且无训练，同时表现不佳。

Title: WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

Authors: Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01115
Pdf URL: https://arxiv.org/pdf/2503.01115
Copy Paste: [[2503.01115]] WeGen: A Unified Model for Interactive Multimodal Generation as We Chat(https://arxiv.org/abs/2503.01115)
Keywords: generation, generative
Abstract: Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce WeGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable WeGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in WeGen and show it achieves state-of-the-art performance across various visual generation benchmarks. These also demonstrate the potential of WeGen as a user-friendly design copilot as desired. The code and models will be available at this https URL.
摘要：现有的多模式生成模型作为合格的设计副驾驶不足，因为一旦指令较少详细或缺乏与所提供的参考保持一致性的能力，它们通常很难产生想象力的输出。在这项工作中，我们介绍了Wegen，该模型统一了多模式的生成和理解，并促进了它们在迭代生成中的相互作用。它可以产生具有较高创造力的多样化结果，以较少详细的说明。它可以逐步完善上一代结果，或按照与用户聊天的说明中的参考文献中整合特定内容。在此过程中，它能够保留用户已经满足的零件的一致性。为此，我们策划了一个从Internet视频中提取的大规模数据集，该数据集包含丰富的对象动态和迄今为止先进的基础模型的自动标记的动力学描述。这两个信息交织成单个序列，以使Wegen能够学习一致性感知的生成，其中生成了指定的动力学，而未指定内容的一致性则与指令保持一致。此外，我们引入了一种迅速的自我屈服机制，以增强产生多样性。广泛的实验表明，统一韦格（Wegen）的多模式理解和产生的有效性，并表明它在各种视觉生成基准中实现了最先进的性能。这些还表明了Wegen作为所需的用户友好设计的副标士的潜力。代码和模型将在此HTTPS URL上可用。

Title: ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

Authors: Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01122
Pdf URL: https://arxiv.org/pdf/2503.01122
Copy Paste: [[2503.01122]] ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization(https://arxiv.org/abs/2503.01122)
Keywords: generation
Abstract: Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.
摘要：图像个性化由于仅使用几个参考图像自定义文本到图像生成的能力而引起了人们的关注。但是，图像个性化的一个关键挑战是概念耦合的问题，其中有限的参考图像导致模型在个性化目标和其他概念之间形成不必要的关联。当前的方法试图间接解决此问题，从而导致文本控制和个性化保真度之间的次优平衡。在本文中，我们通过统计分析对概念耦合问题采取了直接方法，表明它源于依赖差异的两个不同的来源。因此，我们提出了两个互补的插件损失函数：将脱致损失和先前的切换损失造成，每种损失旨在最大程度地减少一种类型的依赖性差异。广泛的实验表明，我们的方法在文本控制和个性化忠诚度之间取得了较高的权衡。

Title: CoInD: Enabling Logical Compositions in Diffusion Models

Authors: Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01145
Pdf URL: https://arxiv.org/pdf/2503.01145
Copy Paste: [[2503.01145]] CoInD: Enabling Logical Compositions in Diffusion Models(https://arxiv.org/abs/2503.01145)
Keywords: generation, generative
Abstract: How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes' conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher's divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.
摘要：我们如何学习具有统计独立属性的任意逻辑组成的数据来采样数据？盛行的解决方案是从表达为属性条件边际分布的分布中的样本，假设它们在统计上是独立的。本文表明，即使在训练过程中观察到所有属性组成，标准条件扩散模型也违反了这一假设。而且，当仅观察到组成的一部分时，这种违规就会显着严重。我们建议Coind解决这个问题。它通过最大程度地减少费舍尔在关节和边缘分布之间的差异来明确实施条件边际分布之间的统计独立性。 Coind的理论优势反映在定性和定量实验中，证明了属性任意逻辑组成的样本的更为忠诚和受控的生成。对于依赖有条件独立边缘的假设的当前解决方案与涉及非操作的逻辑组成以及在训练过程中仅观察到组成的一部分时，益处更为明显。

Title: Split Gibbs Discrete Diffusion Posterior Sampling

Authors: Wenda Chu, Yang Song, Yisong Yue
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.01161
Pdf URL: https://arxiv.org/pdf/2503.01161
Copy Paste: [[2503.01161]] Split Gibbs Discrete Diffusion Posterior Sampling(https://arxiv.org/abs/2503.01161)
Keywords: generation
Abstract: We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SG-DPS. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate that SG-DPS converges to the true posterior distribution on synthetic benchmarks, and enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, achieving up to 2x improved performance compared to existing baselines.
摘要：我们使用离散扩散模型研究了离散空间中后验采样的问题。尽管连续扩散模型的后验采样方法取得了显着的进步，但离散扩散模型的类似方法仍然具有挑战性。在这项工作中，我们基于分裂Gibbs采样引入了一个原则上的插入插件离散扩散后采样算法，我们称为SG-DPS。我们的算法可以在离散状态空间中的奖励引导产生和解决反问题。我们证明，SG-DPS在合成基准测试中收敛到真正的后验分布，并在一系列基准测试基准上享有最先进的后验采样性能，与现有基础线相比，可提高性能高达2倍。

Title: Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Authors: Haoxin Li, Boyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01167
Pdf URL: https://arxiv.org/pdf/2503.01167
Copy Paste: [[2503.01167]] Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data(https://arxiv.org/abs/2503.01167)
Keywords: generation, generative
Abstract: Despite impressive advancements in various multimodal tasks, vision-language models (VLMs) still struggle with compositional understanding due to limited exposure to training samples that contain subtle variations within paired examples. With advances in multimodal generative models, a natural solution is to generate synthetic samples with subtle variations for training VLMs. However, generating and training on synthetic samples with subtle variations presents two challenges: difficulty in accurately creating precise variations and inconsistency in cross-modal alignment quality. To address these challenges, we propose SVD-GT (Subtle Variation Data Generation and Training), which integrates image feature injection into a text-to-image generative model to enhance the quality of synthetic variations and employs an adaptive margin loss to differentiate samples using adaptive margins, which help filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluations on four compositional understanding benchmarks demonstrate that SVD-GT significantly improves the compositionality of VLMs, boosting the average accuracy of CLIP by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.
摘要：尽管在各种多模式任务中取得了令人印象深刻的进步，但由于对训练样本的暴露有限，视觉语言模型（VLM）仍在与成分示例中含有细微变化的训练样本的情况下，仍然困难。随着多模式生成模型的进步，一种自然解决方案是生成具有训练VLMS微妙变化的合成样品。但是，对具有微妙变化的合成样品产生和培训提出了两个挑战：难以准确创造精确的变化和跨模式对齐质量的不一致性。为了应对这些挑战，我们提出了SVD-GT（细微的变化数据生成和培训），该SVD-GT将图像特征注入整合到文本对图像生成模型中，以增强合成变化的质量，并采用适应性余量损失来使用适应性余量来区分样品，从而有助于过滤潜在的不良合成样品，并专注于合成的样品和范围的难题。对四个组成理解基准的评估表明，SVD-GT显着提高了VLM的组成性，在所有基准测试基准中将夹子的平均准确性提高了8％，并且在三个基准测试基准上胜过2％的先进方法。

Title: HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

Authors: Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, Yanwei Fu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.01175
Pdf URL: https://arxiv.org/pdf/2503.01175
Copy Paste: [[2503.01175]] HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation(https://arxiv.org/abs/2503.01175)
Keywords: generation
Abstract: Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation. More information, codes, and demos are available here: this https URL
摘要：共同语音的手势是至关重要的非语言提示，可以提高人类交流中的语音清晰度和表现力，这引起了多模式研究的越来越多的关注。尽管现有方法在手势准确性方面取得了进步，但在产生多样化和连贯的手势方面仍然存在挑战，因为大多数方法都假定多模式输入之间的独立性，并且缺乏对其相互作用的明确建模。在这项工作中，我们提出了一种新型的多模式学习方法，名为Hop for共同语音的手势生成，该方法捕获了手势运动，音频节奏和文本语义之间的异质纠缠，从而可以产生协调一致的手势。通过利用时空图建模，我们实现了音频和动作的比对。此外，为了增强模态连贯性，我们基于重编程模块构建了音频文本语义表示，这对跨模式适应非常有利。我们的方法使三峰系统能够学习彼此的特征，并以拓扑纠缠的形式表示它们。广泛的实验表明，Hop可以实现最先进的性能，提供更自然和表现力的同音言论产生。更多信息，代码和演示可在此处提供：此HTTPS URL

Title: DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution

Authors: Xingyuan Li, Zirui Wang, Yang Zou, Zhixin Chen, Jun Ma, Zhiying Jiang, Long Ma, Jinyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01187
Pdf URL: https://arxiv.org/pdf/2503.01187
Copy Paste: [[2503.01187]] DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution(https://arxiv.org/abs/2503.01187)
Keywords: super-resolution
Abstract: Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance. Code is available at this https URL
摘要：红外成像对于自动驾驶和机器人操作作为支持方式至关重要，因为它在具有挑战性的环境中的可靠表现。尽管它很受欢迎，但红外摄像机的局限性（例如空间分辨率低和复杂的降解）始终挑战成像质量和随后的视觉任务。因此，已经开发了红外图像超分辨率（IISR）来应对这一挑战。尽管扩散模型的最新发展具有大大提高此字段，但当前的方法可以忽略红外成像的唯一模态特性，或者忽略了机器感知要求。为了弥合这些间隙，我们提出了Difiisr，这是一种针对视觉质量和感知性能优化的红外图像超分辨率扩散模型。我们的方法通过在反向过程中将源自视觉和感知先验的梯度注入噪声来实现基于任务的扩散指导。具体而言，我们引入了红外热频谱分布调节，以保持视觉保真度，以确保重建的红外图像通过匹配其频率成分与高分辨率图像紧密对齐。随后，我们将各种视觉基础模型纳入了下游视觉任务的感知指南，从而注入可概括的感知特征有益于检测和分割。结果，我们的方法在获得最新的下游任务性能的同时，获得了卓越的视觉结果。代码可在此HTTPS URL上找到

Title: Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling

Authors: Jonathan Fhima, Jan Van Eijgen, Lennert Beeckmans, Thomas Jacobs, Moti Freiman, Luis Filipe Nakayama, Ingeborg Stalmans, Chaim Baskin, Joachim A. Behar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01190
Pdf URL: https://arxiv.org/pdf/2503.01190
Copy Paste: [[2503.01190]] Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling(https://arxiv.org/abs/2503.01190)
Keywords: generation, generative
Abstract: Generalization in medical segmentation models is challenging due to limited annotated datasets and imaging variability. To address this, we propose Retinal Layout-Aware Diffusion (RLAD), a novel diffusion-based framework for generating controllable layout-aware images. RLAD conditions image generation on multiple key layout components extracted from real images, ensuring high structural fidelity while enabling diversity in other components. Applied to retinal fundus imaging, we augmented the training datasets by synthesizing paired retinal images and vessel segmentations conditioned on extracted blood vessels from real images, while varying other layout components such as lesions and the optic disc. Experiments demonstrated that RLAD-generated data improved generalization in retinal vessel segmentation by up to 8.1%. Furthermore, we present REYIA, a comprehensive dataset comprising 586 manually segmented retinal images. To foster reproducibility and drive innovation, both our code and dataset will be made publicly accessible.
摘要：由于有限的注释数据集和成像变异性，医疗细分模型中的概括是具有挑战性的。为了解决这个问题，我们提出了视网膜布局感知扩散（RLAD），这是一种基于扩散的新型框架，用于生成可控的布局感知图像。 RLAD条件在从真实图像中提取的多个密钥布局组件上产生图像图像，从而确保了高结构保真度，同时在其他组件中实现了多样性。应用于视网膜眼底成像，我们通过合成从真实图像提取的血管上的配对视网膜图像和血管分割来增强训练数据集，同时改变其他布局组件（例如病变和光盘）。实验表明，RLAD生成的数据改善了视网膜血管分割的概括高达8.1％。此外，我们提出了Reyia，这是一个全面的数据集，其中包含586个手动分割的视网膜图像。为了促进可重复性和推动创新，我们的代码和数据集都可以公开访问。

Title: A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping

Authors: Jialei He, Zhihao Zhan, Zhituo Tu, Xiang Zhu, Jie Yuan
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01202
Pdf URL: https://arxiv.org/pdf/2503.01202
Copy Paste: [[2503.01202]] A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping(https://arxiv.org/abs/2503.01202)
Keywords: generation
Abstract: Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to overcome the limitations of conventional orthoimage generation methods in terms of temporal performance, system robustness, and geographic reference accuracy. A prior-pose-optimized feature matching method is introduced to enhance matching speed and accuracy, reducing the number of required features and providing precise references for the Structure from Motion (SfM) process. The proposed method exhibits robustness in low-texture scenes like farmlands, where feature matching is difficult. Experiments show that our approach achieves accurate feature matching orthoimage generation in a short time. The proposed drone system effectively aids in farmland detection and management.
摘要：从无人机（UAV）中快速生成大规模的矫正图，一直是空中映射领域的研究重点。多传感器无人机系统，集成了全局定位系统（GPS），惯性测量单元（IMU），4D毫米波雷达和相机，可以为此问题提供有效的解决方案。在本文中，我们利用多传感器数据来克服在时间性能，系统鲁棒性和地理参考准确性方面的常规矫正生成方法的局限性。引入了先前的置优化功能匹配方法，以提高匹配速度和准确性，从而减少所需功能的数量，并从运动（SFM）过程中为结构提供精确的参考。所提出的方法在低文字场景（如农田）中表现出鲁棒性，在该场景中，特征匹配很困难。实验表明，我们的方法在短时间内实现了准确的功能匹配矫形图。拟议的无人机系统有效地帮助了农田检测和管理。

Title: Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling

Authors: Daksh Mittal, Ang Li, Tzu-Ching Yen, Daniel Guetta, Hongseok Namkoong
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.01215
Pdf URL: https://arxiv.org/pdf/2503.01215
Copy Paste: [[2503.01215]] Architectural and Inferential Inductive Biases For Exchangeable Sequence Modeling(https://arxiv.org/abs/2503.01215)
Keywords: generation
Abstract: Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences - i.i.d. observations when conditioned on some latent factor - enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Muller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. We illustrate our findings using controlled synthetic settings, demonstrating how custom architectures can significantly underperform standard causal masks, underscoring the need for new architectural innovations.
摘要：自回旋模型已成为建模可交换序列的强大框架-I.I.D。在某个潜在因素上进行调节时的观察 - 从缺失数据（而不是潜在）中直接建模不确定性。由后推理的关键作用在决策中作为子例程（例如，积极学习，匪徒）的动机，我们研究了最有效的可交换序列建模最有效的推论和建筑感应偏见。对于推理阶段，我们重点介绍了普遍的单步生成方法的基本局限性：无法区分认知和核心不确定性。取而代之的是，贝叶斯统计中的一系列作品倡导多步自回归产生；我们证明了这种“正确的方法”实现了卓越的不确定性量化，可以转化为下游决策任务的更好性能。这自然会导致下一个问题：哪些架构最适合多步推断？我们确定了最近提出的可交换序列变压器体系结构之间的细微差异（Muller等，2022; Nguyen＆Grover，2022; Ye＆Namkoong，2024），实际上，尽管引入了大量的计算高架，但实际上它们仍无法保证交换性。我们使用受控的合成设置说明了我们的发现，并展示了自定义体系结构如何显着低于标准的因果面具，从而强调了对新建筑创新的需求。

Title: Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion

Authors: Jiqing Wu, Ingrid Berg, Yawei Li, Ender Konukoglu, Viktor H. Koelzer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01220
Pdf URL: https://arxiv.org/pdf/2503.01220
Copy Paste: [[2503.01220]] Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion(https://arxiv.org/abs/2503.01220)
Keywords: generative
Abstract: Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions. Emerging tissue profiling technologies enable the construction of a comprehensive atlas of the mammalian brain with sub-cellular resolution and spatially resolved gene expression data. However, such tera-scale volumetric datasets present significant computational challenges in understanding complex brain functions within their native 3D spatial context. Here, we propose the novel generative approach $\textbf{Tera-MIND}$, which can simulate $\textbf{Tera}$-scale $\textbf{M}$ouse bra$\textbf{IN}s$ in 3D using a patch-based and boundary-aware $\textbf{D}$iffusion model. Taking spatial transcriptomic data as the conditional input, we generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale. Through the lens of 3D $gene$-$gene$ self-attention, we identify spatial molecular interactions for key transcriptomic pathways in the murine brain, exemplified by glutamatergic and dopaminergic neuronal systems. Importantly, these $in$-$silico$ biological findings are consistent and reproducible across three tera-scale virtual mouse brains. Therefore, Tera-MIND showcases a promising path toward efficient and generative simulations of whole organ systems for biomedical research. Project website: $\href{this http URL}{https}$
摘要：分子定义的大脑结构的整体3D建模对于理解复杂的大脑功能至关重要。新兴的组织分析技术可以构建具有亚细胞分辨率和空间分辨的基因表达数据的哺乳动物大脑的综合地图集。但是，这种TERA规模的体积数据集在理解其本机3D空间环境中的复杂大脑功能方面提出了重大的计算挑战。在这里，我们提出了一种新颖的生成方法$ \ textbf {tera-mind} $，它可以模拟$ \ textbf {tera} $ - 比例$ \ textbf {m} $ ouse bra $ \ textbf {in} in} in} in} in} s $在3D中使用基于patch and datch-ware $ \ text-aware $ \ textbff {d ifuse difuse dime}以空间转录组数据为条件输入，我们在Teravoxel量表上生成具有全面的细胞形态细节的虚拟小鼠大脑。通过3D $基因$ - $基因$自我注意的镜头，我们确定了鼠大脑中关键转录组途径的空间分子相互作用，以谷氨酸能和多巴胺能神经元系统为例。重要的是，在三个TERA级虚拟鼠标大脑中，这些$ in $ - $ silico $生物发现是一致且可再现的。因此，TERA-MIND展示了对生物医学研究的整个器官系统有效和生成模拟的有希望的途径。项目网站：$ \ href {此http url} {https} $

Title: Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Authors: Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.01222
Pdf URL: https://arxiv.org/pdf/2503.01222
Copy Paste: [[2503.01222]] Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG(https://arxiv.org/abs/2503.01222)
Keywords: generation
Abstract: High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.
摘要：高分辨率（HR）图像感知仍然是多模式大语言模型（MLLM）的关键挑战。为了克服现有方法的局限性，本文从先前的专用启发式方法转移了，并通过增强MLLM的长期文化能力来重新审视人力资源感知的最基本思想，这是由于最新的长篇文化技术的进步，例如回溯性（ReTival-Exationed Exenation（rag of Generation）（rag for）的长期发展。为此，本文介绍了第一项研究，探讨了用抹布来应对人力资源感知挑战的使用。具体而言，我们提出了检索提示的感知（RAP），这是一个无训练的框架，可以检索和融合相关的图像作物，同时使用拟议的空间意识布局保存空间上下文。为了适应不同的任务，提议的检索探索搜索（重新搜索）基于模型置信度和检索分数动态选择最佳作物数量。 HR基准测试的实验结果表明RAP具有显着的有效性，而Llava-V1.5-13B在$ V^*$台上提高了43％，而HR Bench的实验提高了19％。

Title: Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text

Authors: Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Junteng Zhao, Yunming Ye, Kola Ye, Yao He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01261
Pdf URL: https://arxiv.org/pdf/2503.01261
Copy Paste: [[2503.01261]] Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text(https://arxiv.org/abs/2503.01261)
Keywords: generation
Abstract: Image quantization is a crucial technique in image generation, aimed at learning a codebook that encodes an image into a discrete token sequence. Recent advancements have seen researchers exploring learning multi-modal codebook (i.e., text-aligned codebook) by utilizing image caption semantics, aiming to enhance codebook performance in cross-modal tasks. However, existing image-text paired datasets exhibit a notable flaw in that the text descriptions tend to be overly concise, failing to adequately describe the images and provide sufficient semantic knowledge, resulting in limited alignment of text and codebook at a fine-grained level. In this paper, we propose a novel Text-Augmented Codebook Learning framework, named TA-VQ, which generates longer text for each image using the visual-language model for improved text-aligned codebook learning. However, the long text presents two key challenges: how to encode text and how to align codebook and text. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence, so that the long text can be fully encoded without losing any key semantic knowledge. Following this, a hierarchical encoder and novel sampling-based alignment strategy are designed to achieve fine-grained codebook-text alignment. Additionally, our method can be seamlessly integrated into existing VQ models. Extensive experiments in reconstruction and various downstream tasks demonstrate its effectiveness compared to previous state-of-the-art approaches.
摘要：图像量化是图像生成中的一种至关重要的技术，旨在学习将图像编码为离散令牌序列的代码手册。最近的进步是，研究人员通过使用图像字幕语义来探索学习多模式代码簿（即文本对准代码簿），旨在在跨模式任务中提高代码书的性能。但是，现有的图像文本配对数据集表现出一个明显的缺陷，因为文本描述往往过于简洁，无法充分描述图像并提供足够的语义知识，从而在细粒度上限制了文本和代码书的有限对齐。在本文中，我们提出了一个新颖的文本代码书学习框架，名为TA-VQ，它使用视觉语言模型为每个图像生成更长的文本，以改善与文本分配的代码书学习。但是，长期文字提出了两个关键挑战：如何编码文本以及如何对齐代码书和文本。为了应对两个挑战，我们建议将长文本分为多个粒度，即单词，短语和句子，以便可以完全编码长文本而不会失去任何关键的语义知识。此后，设计了层次编码器和基于新颖的采样对准策略，旨在实现细颗粒的代码本文本对齐。此外，我们的方法可以无缝集成到现有的VQ模型中。与以前的最新方法相比，重建和各种下游任务的广泛实验证明了其有效性。

Title: Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual

Authors: Chong Wang, Lanqing Guo, Zixuan Fu, Siyuan Yang, Hao Cheng, Alex C. Kot, Bihan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01288
Pdf URL: https://arxiv.org/pdf/2503.01288
Copy Paste: [[2503.01288]] Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual(https://arxiv.org/abs/2503.01288)
Keywords: restoration, generative
Abstract: Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned \textit{discriminative denoiser} as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained \textit{generative diffusion model}, has gained great popularity for solving IR problems through stochastic sampling. The IR results using PnP with a pre-trained diffusion model demonstrate distinct advantages compared to those using discriminative denoisers, \ie improved perceptual quality while sacrificing the data fidelity. The unsatisfactory results are due to the lack of integration of these strategies in the IR tasks. In this work, we propose a novel zero-shot IR scheme, dubbed Reconciling Diffusion Model in Dual (RDMD), which leverages only a \textbf{single} pre-trained diffusion model to construct \textbf{two} complementary regularizers. Specifically, the diffusion model in RDMD will iteratively perform deterministic denoising and stochastic sampling, aiming to achieve high-fidelity image restoration with appealing perceptual quality. RDMD also allows users to customize the distortion-perception tradeoff with a single hyperparameter, enhancing the adaptability of the restoration process in different practical scenarios. Extensive experiments on several IR tasks demonstrate that our proposed method could achieve superior results compared to existing approaches on both the FFHQ and ImageNet datasets.
摘要：插件播放（PNP）方法提供了一种迭代策略，以零射击方式解决图像恢复（IR）问题，使用学习的\ textIt {Incliminative denoiser}作为隐式先验。最近，采用了基于抽样的变体，该变体利用了预先训练的\ textIt {生成扩散模型}，它通过随机抽样赢得了解决IR问题的广泛流行。与使用判别性denoisers相比，使用PNP和预训练的扩散模型的IR结果显示出明显的优势，\ IE提高了感知质量，同时牺牲了数据保真度。结果不令人满意的是由于缺乏在IR任务中这些策略的整合。在这项工作中，我们提出了一种新型的零摄影IR方案，称为双重（RDMD）中的核算扩散模型，该模型仅利用\ textbf {single}预训练的扩散模型来构建\ textbf {textbf {两个}互补的正则化合物。具体而言，RDMD中的扩散模型将迭代地执行确定性的denoising和随机抽样，旨在实现具有吸引人的感知质量的高保真图像恢复。 RDMD还允许用户使用单个超参数自定义失真感知的权衡，从而在不同的实际情况下增强了修复过程的适应性。对几个IR任务的广泛实验表明，与FFHQ和Imagenet数据集的现有方法相比，我们提出的方法可以取得优越的结果。

Title: SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

Authors: Peishan Cong, Ziyi Wang, Yuexin Ma, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01291
Pdf URL: https://arxiv.org/pdf/2503.01291
Copy Paste: [[2503.01291]] SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance(https://arxiv.org/abs/2503.01291)
Keywords: generation, generative
Abstract: Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at this https URL.
摘要：在给定的动态环境中产生合理和高质量的人类互动动作对于理解，建模，转移和将人类行为应用于虚拟机器人和物理机器人至关重要。在本文中，我们引入了一种有效的方法，用于动态上下文的人类运动生成，该方法完全利用了文本交界 - 在生成过程中的文本交界多级语义和几何指导，从而提高了语义合理性和生成运动的几何正确性。我们的方法在三个数据集上实现了最先进的性能，并展示了各种交互情况的卓越概括能力。可以在此HTTPS URL上找到项目页面和代码。

Title: Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting

Authors: Rong Zhang, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01294
Pdf URL: https://arxiv.org/pdf/2503.01294
Copy Paste: [[2503.01294]] Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting(https://arxiv.org/abs/2503.01294)
Keywords: generation
Abstract: In this paper, we propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) for fine-grained controllable apparel showcase image generation. The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images. Different from existing methods, our framework takes a garment image segmented from a dressed mannequin or a person as the input, eliminating the need for learning cloth deformation and ensuring faithful preservation of garment details. The proposed framework consists of two stages. In the first stage, we introduce a garment-adaptive pose prediction model that generates diverse poses given the garment. Then, in the next stage, we generate apparel showcase images, conditioned on the garment and the predicted poses, along with specified text prompts and facial images. Notably, a multi-scale appearance customization module (MS-ACM) is designed to allow both overall and fine-grained text-based control over the generated model's appearance. Moreover, we leverage a lightweight feature fusion operation without introducing any extra encoders or modules to integrate multiple conditions, which is more efficient. Extensive experiments validate the superior performance of our framework compared to state-of-the-art methods.
摘要：在本文中，我们提出了一种基于潜在扩散模型（LDM）的新型以服装为中心的支出（GCO）框架，用于细粒可控服装的图像产生。拟议的框架旨在通过文本提示和面部图像自定义时装模型。与现有方法不同，我们的框架将服装图像从穿着的人体模型或人作为输入中，消除了学习布变形的需求，并确保忠实地保存服装细节。提出的框架由两个阶段组成。在第一阶段，我们引入了一种服装自适应姿势预测模型，该模型在服装鉴于服装的情况下产生了多种姿势。然后，在下一阶段，我们生成服装展示图像，以服装和预测的姿势以及指定的文本提示和面部图像。值得注意的是，多尺度外观定制模块（MS-ACM）旨在允许对生成模型的外观进行整体和细粒度的基于文本的控制。此外，我们利用轻巧的功能融合操作，而无需引入任何额外的编码器或模块来集成多种条件，这更有效。与最先进的方法相比，广泛的实验验证了我们框架的出色性能。

Title: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

Authors: Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, Haoyuan Li, Weilong Dai, Mingli Song, Jie Song, Hao Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01298
Pdf URL: https://arxiv.org/pdf/2503.01298
Copy Paste: [[2503.01298]] MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation(https://arxiv.org/abs/2503.01298)
Keywords: generation, generative
Abstract: Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
摘要：统一的生成模型在文本和图像生成中都表现出非凡的性能。但是，在生成具有各种交织条件的复杂图像时，它们往往不佳，这很难仅依靠直接的文本到图像生成。为了应对这一挑战，我们引入了MINT，这是一种创新的统一生成模型，并以天然多模式的思想链（MCOT）授权，以首次增强图像生成。首先，我们设计了变压器专家（MTXPERT）的混合物，这是一种有效支持自然语言产生（NLG）和视觉能力的专家平行结构，同时避免了潜在的模态冲突，从而阻碍了每种方式的全部潜力。在此基础上，我们提出了一种创新的MCOT培训范式，这是一种用于增强图像产生的多模式思维，推理和反思的分步方法。该范式使薄荷与细微的，元素的脱钩对齐方式以及对文本和视觉组件的全面理解。此外，它促进了先进的多模式推理和自我反射，从而实现了这些元素之间逻辑关系中牢固基础的图像的构造。值得注意的是，已验证了MINT，以在多个基准测试中表现出卓越的性能，以进行文本对图像（T2I）和图像对文本（I2T）任务。

Title: CacheQuant: Comprehensively Accelerated Diffusion Models

Authors: Xuewen Liu, Zhikai Li, Qingyi Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01323
Pdf URL: https://arxiv.org/pdf/2503.01323
Copy Paste: [[2503.01323]] CacheQuant: Comprehensively Accelerated Diffusion Models(https://arxiv.org/abs/2503.01323)
Keywords: generative
Abstract: Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score. Our code are open-sourced: this https URL .
摘要：扩散模型在图像合成领域逐渐获得了突出，展示了显着的生成能力。然而，由于时间和结构层面的冗余，推理缓慢和复杂的网络阻碍了其在现实情况下的低延迟应用。扩散模型的当前加速方法分别集中在时间和结构水平上。但是，在每个级别上的独立优化以进一步推动加速度限制了限制的绩效降低。另一方面，在两个级别上集成优化可以使加速度效应加剧。不幸的是，我们发现这两个级别的优化不是完全正交的。进行单独的优化然后简单地整合它们会导致性能不令人满意。为了解决这个问题，我们提出了一种新型的无训练范式Cachequant，它通过共同优化模型缓存和量化技术来全面加速扩散模型。具体而言，我们采用动态编程方法来确定最佳缓存计划，其中仔细考虑了缓存和量化的属性以最大程度地减少错误。此外，我们提出了解耦的误差校正，以进一步减轻耦合和累积的误差。实验结果表明，Cachequant在MS-Coco上稳定扩散可实现5.18加速和4压缩，夹得分仅为0.02。我们的代码是开源的：此HTTPS URL。

Title: Group Relative Policy Optimization for Image Captioning

Authors: Xu Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01333
Pdf URL: https://arxiv.org/pdf/2503.01333
Copy Paste: [[2503.01333]] Group Relative Policy Optimization for Image Captioning(https://arxiv.org/abs/2503.01333)
Keywords: generation
Abstract: Image captioning tasks usually use two-stage training to complete model optimization. The first stage uses cross-entropy as the loss function for optimization, and the second stage uses self-critical sequence training (SCST) for reinforcement learning optimization. However, the SCST algorithm has certain defects. SCST relies only on a single greedy decoding result as a baseline. If the model itself is not stable enough, the greedy decoding result may be relatively worst, which will lead to a high variance of advantage estimation, further leading to unstable policy updates. In addition, SCST only compares one sampling result with the greedy decoding result, and the generation diversity is limited, which may fall into a local optimum. In this paper, we propose using the latest Group Relative Policy Optimization (GRPO) reinforcement learning algorithm as an optimization solution for the second stage. GRPO generates multiple candidate captions for the input image and then continuously optimizes the model through intragroup comparison. By constraining the amplitude of policy updates and KL divergence, the stability of the model during training is greatly guaranteed. In addition, compared to SCST, which only samples one answer, GRPO samples and generates multiple answers. Multiple candidate answers in the group cover a wider solution space. Combined with KL divergence constraints, GRPO can improve diversity while ensuring model stability. The code for this article is available at this https URL.
摘要：图像字幕任务通常使用两阶段训练来完成模型优化。第一阶段使用跨肠道作为优化的损失函数，第二阶段使用自我批评序列训练（SCST）进行增强学习优化。但是，SCST算法具有某些缺陷。 SCST仅依靠单个贪婪的解码结果作为基线。如果模型本身不够稳定，则贪婪的解码结果可能相对较差，这将导致优势估计的较高差异，从而进一步导致不稳定的策略更新。此外，SCST仅将一个采样结果与贪婪的解码结果进行比较，并且产生多样性受到限制，这可能落入局部最佳效果。在本文中，我们建议使用最新的组相对策略优化（GRPO）增强算法作为第二阶段的优化解决方案。 GRPO为输入图像生成多个候选字幕，然后通过组内比较连续优化模型。通过限制政策更新和KL差异的幅度，可以保证培训期间模型的稳定性。此外，与SCST相比，SCST仅示例一个答案，grpo样本并生成多个答案。小组中的多个候选答案涵盖了更广泛的解决方案空间。结合KL差异约束，GRPO可以改善多样性，同时确保模型稳定性。本文的代码可在此HTTPS URL上找到。

Title: Wavelet-Enhanced Desnowing: A Novel Single Image Restoration Approach for Traffic Surveillance under Adverse Weather Conditions

Authors: Zihan Shen, Yu Xuan, Qingyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01339
Pdf URL: https://arxiv.org/pdf/2503.01339
Copy Paste: [[2503.01339]] Wavelet-Enhanced Desnowing: A Novel Single Image Restoration Approach for Traffic Surveillance under Adverse Weather Conditions(https://arxiv.org/abs/2503.01339)
Keywords: restoration
Abstract: Image restoration under adverse weather conditions refers to the process of removing degradation caused by weather particles while improving visual quality. Most existing deweathering methods rely on increasing the network scale and data volume to achieve better performance which requires more expensive computing power. Also, many methods lack generalization for specific applications. In the traffic surveillance screener, the main challenges are snow removal and veil effect elimination. In this paper, we propose a wavelet-enhanced snow removal method that use a Dual-Tree Complex Wavelet Transform feature enhancement module and a dynamic convolution acceleration module to address snow degradation in surveillance images. We also use a residual learning restoration module to remove veil effects caused by rain, snow, and fog. The proposed architecture extracts and analyzes information from snow-covered regions, significantly improving snow removal performance. And the residual learning restoration module removes veiling effects in images, enhancing clarity and detail. Experiments show that it performs better than some popular desnowing methods. Our approach also demonstrates effectiveness and accuracy when applied to real traffic surveillance images.
摘要：在不利天气条件下的图像恢复是指去除由天气颗粒引起的降解的过程，同时改善视觉质量。大多数现有的脱气方法都依赖于增加网络量表和数据量以实现更好的性能，这需要更昂贵的计算能力。同样，许多方法对特定应用缺乏概括。在交通监视筛选器中，主要挑战是清除降雪和消除面纱效应。在本文中，我们提出了一种使用双树复合小波变换功能增强模块和动态卷积加速模块来解决监视图像中的降雪降解的方法。我们还使用剩余的学习恢复模块来消除由雨，雪和雾引起的面纱效应。拟议的建筑提取和分析来自雪地覆盖区域的信息，从而显着改善了降雪的性能。剩余的学习恢复模块消除了图像中的蔬菜效果，从而提高了清晰度和细节。实验表明，它的性能比某些流行的终止方法更好。当应用于实际交通监视图像时，我们的方法还表明了有效性和准确性。

Title: Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

Authors: Ramanathan Rajendiran, Debaditya Roy, Basura Fernando
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01416
Pdf URL: https://arxiv.org/pdf/2503.01416
Copy Paste: [[2503.01416]] Learning to Generate Long-term Future Narrations Describing Activities of Daily Living(https://arxiv.org/abs/2503.01416)
Keywords: generation
Abstract: Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $\textit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.
摘要：预期未来的事件对于医疗保健，智能家庭技术和监视等各种应用领域至关重要。叙事事件描述提供了上下文丰富的信息，增强了系统的未来计划和决策能力。我们提出了一项新颖的任务：$ \ textit {长期未来叙事生成} $，它通过生成对未来日常活动的详细叙述，超越了传统的行动预期。我们介绍了一个视觉语言模型Vina，该模型专为解决这一挑战性任务而设计。维娜（Vina）整合了长期视频和相应的叙述，以生成一系列未来的叙述，这些叙述可以预测延长时间范围内随后的事件和行动。 Vina扩展了现有的多模型模型，这些模型仅执行短期预测或描述观察到的视频，通过为更广泛的日常活动生成长期的未来叙述。我们还提出了一个新颖的下游应用程序，该应用程序利用了称为“未来视频检索”的生成的叙述，以帮助用户通过可视化未来来改善任务的计划。我们在最大的中心数据集EGO4D上评估未来的叙述产生。

Title: DLF: Extreme Image Compression with Dual-generative Latent Fusion

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01428
Pdf URL: https://arxiv.org/pdf/2503.01428
Copy Paste: [[2503.01428]] DLF: Extreme Image Compression with Dual-generative Latent Fusion(https://arxiv.org/abs/2503.01428)
Keywords: generative
Abstract: Recent studies in extreme image compression have achieved remarkable performance by compressing the tokens from generative tokenizers. However, these methods often prioritize clustering common semantics within the dataset, while overlooking the diverse details of individual objects. Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. DLF decomposes the latent into semantic and detail elements, compressing them through two distinct branches. The semantic branch clusters high-level information into compact tokens, while the detail branch encodes perceptually critical details to enhance the overall fidelity. Additionally, we propose a cross-branch interactive design to reduce redundancy between the two branches, thereby minimizing the overall bit cost. Experimental results demonstrate the impressive reconstruction quality of DLF even below 0.01 bits per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore, DLF surpasses recent diffusion-based codecs in visual fidelity while maintaining a comparable level of generative realism. Code will be available later.
摘要：极端图像压缩的最新研究通过压缩生成代币剂的代币来实现出色的性能。但是，这些方法通常优先考虑数据集中的共同语义，同时忽略各个对象的各种细节。因此，这导致了次优的重建保真度，尤其是在低比特率下。为了解决这个问题，我们引入了双基因潜在融合（DLF）范式。 DLF将潜在分解为语义和细节元素，通过两个不同的分支将它们压缩。语义分支将高级信息集中到紧凑的代币中，而细节分支则对感知关键的细节进行编码以增强整体忠诚度。此外，我们提出了跨分支交互式设计，以降低两个分支之间的冗余，从而最大程度地减少了总成本。实验结果表明，即使每个像素（BPP）低于0.01位）的DLF的重建质量令人印象深刻。在CLIC2020测试集上，与MS-ILLM相比，我们的方法可在LPIP上节省高达27.93％的比特率，而DIST的比特率节省为53.55％。此外，DLF在视觉保真度中超过了最新的基于扩散的编解码器，同时保持了可比的生成现实主义水平。代码将在以后可用。

Title: Generative Human Geometry Distribution

Authors: Xiangjun Tang, Biao Zhang, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01448
Pdf URL: https://arxiv.org/pdf/2503.01448
Copy Paste: [[2503.01448]] Generative Human Geometry Distribution(https://arxiv.org/abs/2503.01448)
Keywords: generation, generative
Abstract: Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-pose interactions. Geometry distributions, which can model the geometry of a single human as a distribution, provide a promising representation for high-fidelity synthesis. However, applying geometry distributions for human generation requires learning a dataset-level distribution over numerous individual geometry distributions. To address the resulting challenges, we propose a novel 3D human generative framework that, for the first time, models the distribution of human geometry distributions. Our framework operates in two stages: first, generating the human geometry distribution, and second, synthesizing high-fidelity humans by sampling from this distribution. We validate our method on two tasks: pose-conditioned 3D human generation and single-view-based novel pose generation. Experimental results demonstrate that our approach achieves the best quantitative results in terms of realism and geometric fidelity, outperforming state-of-the-art generative methods.
摘要：现实的人类几何形状生成是一项重要但又具有挑战性的任务，既需要保存精美的服装细节，又需要对服装置互动的准确建模。几何分布可以建模单个人的几何形状作为分布，为高保真综合提供了有希望的表示。但是，应用几何分布需要在许多单个几何分布上学习数据集级别的分布。为了应对由此产生的挑战，我们提出了一个新颖的3D人类生成框架，该框架首次建模人类几何分布的分布。我们的框架分为两个阶段：首先，产生人类的几何分布，其次通过从该分布中抽样来综合高保真人类。我们在两个任务上验证了我们的方法：姿势条件的3D人类一代和基于单视图的小说姿势产生。实验结果表明，我们的方法在现实主义和几何保真度方面取得了最佳的定量结果，表现优于最先进的生成方法。

Title: InversionGNN: A Dual Path Network for Multi-Property Molecular Optimization

Authors: Yifan Niu, Ziqi Gao, Tingyang Xu, Yang Liu, Yatao Bian, Yu Rong, Junzhou Huang, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01488
Pdf URL: https://arxiv.org/pdf/2503.01488
Copy Paste: [[2503.01488]] InversionGNN: A Dual Path Network for Multi-Property Molecular Optimization(https://arxiv.org/abs/2503.01488)
Keywords: generation
Abstract: Exploring chemical space to find novel molecules that simultaneously satisfy multiple properties is crucial in drug discovery. However, existing methods often struggle with trading off multiple properties due to the conflicting or correlated nature of chemical properties. To tackle this issue, we introduce InversionGNN framework, an effective yet sample-efficient dual-path graph neural network (GNN) for multi-objective drug discovery. In the direct prediction path of InversionGNN, we train the model for multi-property prediction to acquire knowledge of the optimal combination of functional groups. Then the learned chemical knowledge helps the inversion generation path to generate molecules with required properties. In order to decode the complex knowledge of multiple properties in the inversion path, we propose a gradient-based Pareto search method to balance conflicting properties and generate Pareto optimal molecules. Additionally, InversionGNN is able to search the full Pareto front approximately in discrete chemical space. Comprehensive experimental evaluations show that InversionGNN is both effective and sample-efficient in various discrete multi-objective settings including drug discovery.
摘要：探索化学空间以找到同时满足多种特性的新分子对于药物发现至关重要。但是，由于化学性质的性质冲突或相关性，现有方法通常与交易多个财产的交易困难。为了解决此问题，我们引入了Inversiongnn框架，这是一种有效但有效的双路径图神经网络（GNN），用于多目标药物发现。在Inversiongnn的直接预测路径中，我们训练模型的多专业预测，以获取官能团的最佳组合知识。然后，学识渊博的化学知识有助于反转产生具有所需特性的分子。为了解码反转路径中多个属性的复杂知识，我们提出了一种基于梯度的帕累托搜索方法，以平衡相互冲突的属性并生成帕累托最佳分子。此外，Inversiongnn能够在离散的化学空间中搜索完整的帕累托前沿。全面的实验评估表明，在包括药物发现在内的各种离散的多目标环境中，反versiongnn在各种离散的多目标环境中既有效又有效。

Title: AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

Authors: Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01565
Pdf URL: https://arxiv.org/pdf/2503.01565
Copy Paste: [[2503.01565]] AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning(https://arxiv.org/abs/2503.01565)
Keywords: super-resolution
Abstract: In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are automatically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network's ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets. For SPF-LUT, with more than a 50% reduction in storage space and about a 2/3 reduction in inference time, our method still maintains performance comparable to the original. The code is available at this https URL.
摘要：近年来，HI-DPI屏幕的日益普及导致对高分辨率图像的需求不断上升。但是，边缘设备的有限计算能力在部署复杂的超分辨率神经网络方面构成了挑战，强调了对有效方法的需求。尽管先前的工作取得了重大进展，但他们尚未完全利用像素级信息。此外，他们对固定采样模式的依赖限制了精度和在低分辨率图像中捕获细节的能力。为了应对这些挑战，我们介绍了两个旨在在基于查找表（LUT）的超分辨率网络中有效捕获和利用像素信息的插件模块。我们的方法引入了自动采样（AutoSample），这是一种灵活的LUT采样方法，在训练过程中自动学习采样权重以适应像素变化并扩展接受场而不增加推理成本。我们还合并了自适应残差学习（ADARL），以增强层间连接，实现详细的信息流并提高网络重建细节的能力。我们的方法在保持相似的存储尺寸的同时，可以在Mulut和SPF-LUT上取得重大的性能提高。具体而言，对于Mulut，我们在五个数据集中平均实现了大约+0.20 dB改进的PSNR改进。对于SPF-LUT，存储空间降低了50％以上，推理时间降低了约2/3，我们的方法仍然保持与原始的性能。该代码可在此HTTPS URL上找到。

Title: MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting

Authors: Mojtaba Safari, Shansong Wang, Zach Eidex, Qiang Li, Erik H. Middlebrooks, David S. Yu, Xiaofeng Yang
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.01576
Pdf URL: https://arxiv.org/pdf/2503.01576
Copy Paste: [[2503.01576]] MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting(https://arxiv.org/abs/2503.01576)
Keywords: restoration, super-resolution
Abstract: Objective:This study introduces a residual error-shifting mechanism that drastically reduces sampling steps while preserving critical anatomical details, thus accelerating MRI reconstruction. Approach:We propose a novel diffusion-based SR framework called Res-SRDiff, which integrates residual error shifting into the forward diffusion process. This enables efficient HR image reconstruction by aligning the degraded HR and LR this http URL evaluated Res-SRDiff on ultra-high-field brain T1 MP2RAGE maps and T2-weighted prostate images, comparing it with Bicubic, Pix2pix, CycleGAN, and a conventional denoising diffusion probabilistic model with vision transformer backbone (TM-DDPM), using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), gradient magnitude similarity deviation (GMSD), and learned perceptual image patch similarity (LPIPS). Main results: Res-SRDiff significantly outperformed all comparative methods in terms of PSNR, SSIM, and GMSD across both datasets, with statistically significant improvements (p-values<<0.05). The model achieved high-fidelity image restoration with only four sampling steps, drastically reducing computational time to under one second per slice, which is substantially faster than conventional TM-DDPM with around 20 seconds per slice. Qualitative analyses further demonstrated that Res-SRDiff effectively preserved fine anatomical details and lesion morphology in both brain and pelvic MRI images. Significance: Our findings show that Res-SRDiff is an efficient and accurate MRI SR method, markedly improving computational efficiency and image quality. Integrating residual error shifting into the diffusion process allows for rapid and robust HR image reconstruction, enhancing clinical MRI workflows and advancing medical imaging research. The source at:this https URL
摘要：目的：这项研究引入了一种残留的错误转移机制，该机制大大降低了采样步骤，同时保留关键的解剖学细节，从而加速MRI重建。方法：我们提出了一个新型基于扩散的SR框架，称为RES-SRDIFF，该框架将残留误差转移到正向扩散过程中。通过使降级的HR和LR对齐该HTTP URL的降级，可以在超高场脑T1 MP2RAGE图和T2加权前列腺图像上进行评估，从而实现有效的人力资源图像重建，并将其与双子，cix2pix，Cycledan和常规的替代模型（TM）进行比较（T2），并将其比较使用定量指标，例如峰值信噪比（PSNR），结构相似性指数（SSIM），梯度幅度相似性偏差（GMSD）和学到的知觉图像贴片相似性（LPIPS）。主要结果：在两个数据集中，RES-SRDIFF在PSNR，SSIM和GMSD方面显着优于所有比较方法，并具有统计学上的显着改进（P值<< 0.05）。该模型仅使用四个采样步骤实现了高保真图像恢复，将计算时间大幅度降低到每片一秒钟以下，这比常规TM-DDPM的速度要快得多，每切约20秒。定性分析进一步表明，在大脑和骨盆MRI图像中，RES-SRDIFF有效地保留了精细的解剖细节和病变形态。意义：我们的发现表明，RES-SRDIFF是一种有效且准确的MRI SR方法，可显着提高计算效率和图像质量。将剩余误差转移到扩散过程中，可以快速且健壮的人力资源图像重建，增强临床MRI工作流程并进行医学成像研究。来源AT：此HTTPS URL

Title: Advancing vision-language models in front-end development via data synthesis

Authors: Tong Ge, Yashu Liu, Jieping Ye, Tianyi Li, Chao Wang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01619
Pdf URL: https://arxiv.org/pdf/2503.01619
Copy Paste: [[2503.01619]] Advancing vision-language models in front-end development via data synthesis(https://arxiv.org/abs/2503.01619)
Keywords: generation
Abstract: Modern front-end (FE) development, especially when leveraging the unique features of frameworks like React and Vue, presents distinctive challenges. These include managing modular architectures, ensuring synchronization between data and visual outputs for declarative rendering, and adapting reusable components to various scenarios. Such complexities make it particularly difficult for state-of-the-art large vision-language models (VLMs) to generate accurate and functional code directly from design images. To address these challenges, we propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of FE development. This workflow automates the extraction of self-contained\footnote{A \textbf{self-contained} code snippet is one that encapsulates all necessary logic, styling, and dependencies, ensuring it functions independently without requiring external imports or context.} code snippets from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code. To further expand the scope and utility of the synthesis, we introduce three data synthesis strategies: Evolution-based synthesis, which enables scalable and diverse dataset expansion; Waterfall-Model-based synthesis, which generates logically coherent code derived from system requirements; and Additive Development synthesis, which iteratively increases the complexity of human-authored components. We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $\text{pass}@k$ metric. Our results suggest that a code VLM trained to interpret images before code generation may achieve better performance.
摘要：现代的前端（FE）开发，尤其是在利用React和Vue等框架的独特特征时提出了独特的挑战。其中包括管理模块化体系结构，确保数据和视觉输出之间的声明性渲染，以及将可重复使用的组件调整为各种情况。这样的复杂性使最先进的大视觉模型（VLM）直接从设计图像中生成准确和功能代码变得特别困难。为了应对这些挑战，我们提出了一个反思性的代理工作流，该工作流程综合了高质量的图像文本数据，以捕获FE开发的多种特征。该工作流程可自动提取独立的\脚注{a \ textbf {self-oncontained}代码段，它可以包含所有必要的逻辑，样式和依赖项，从而确保其独立起作用，从而在不需要外部导入或上下文的情况下。}代码snippets code snippets code snippets code snipp sempets condipts snippers condipts snippers of Illights desortions，构成了构建视觉上的启用详细信息，并启动了详细启动详细范围。为了进一步扩大合成的范围和效用，我们介绍了三种数据综合策略：基于进化的综合，可以扩展和多样化的数据集扩展；基于瀑布模型的合成，该合成生成从系统要求得出的逻辑上一致的代码；以及添加剂的发展合成，迭代地增加了人为作者成分的复杂性。我们构建了在合成数据集上训练的大型视觉模型，即火焰，并通过$ \ text {pass}@k $ metric演示了其在生成反应代码的有效性。我们的结果表明，在代码生成之前，经过培训可以解释图像的代码VLM可以实现更好的性能。

Title: DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Authors: Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01645
Pdf URL: https://arxiv.org/pdf/2503.01645
Copy Paste: [[2503.01645]] DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models(https://arxiv.org/abs/2503.01645)
Keywords: generation
Abstract: In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
摘要：在本文中，我们提出了DesignDiffusion，这是一个简单而有效的框架，用于从文本描述中综合设计图像的新任务。主要的挑战在于生成准确且风格的文本和视觉内容。现有的作品在视觉文本生成的相关任务中通常集中于在给定特定区域内生成文本，这限制了生成模型的创造力，如果应用于设计图像生成，则在文本和视觉元素之间导致样式或颜色不一致。为了解决这个问题，我们提出了一个端到端的，基于一个阶段扩散的框架，避免了复杂的组件，例如位置和布局建模。具体而言，提出的框架直接从用户提示中综合了文本和视觉设计元素。它利用从视觉文本得出的独特字符嵌入来增强输入提示，以及字符定位损失，以增强文本生成期间的监督。此外，我们采用自我播放的直接偏好优化微调策略来提高合成视觉文本的质量和准确性。广泛的实验表明，DesignDiffusion在设计图像生成中实现了最先进的性能。

Title: ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts

Authors: Linhao Huang, Jing Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01667
Pdf URL: https://arxiv.org/pdf/2503.01667
Copy Paste: [[2503.01667]] ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts(https://arxiv.org/abs/2503.01667)
Keywords: generation
Abstract: Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: this https URL.
摘要：最近无训练的布局到图像扩散模型在生成具有可控布局的高质量图像方面表现出色。这些模型遵循一个单阶段的框架：鼓励模型通过定义基于注意图的损失，将每个概念的注意力图集中在其相应区域。但是，这些模型仍然难以准确遵循具有重大重叠的布局，通常会导致属性泄漏和缺失实体等问题。在本文中，我们提出了托洛（Tolo），这是一个两阶段的，无训练的布局至图像生成框架，用于高空布局。我们的框架由两个阶段组成：聚合阶段和分离阶段，每个阶段都基于注意力图。为了提供更有效的评估，我们根据输入布局的联合（IOU）相交的相互作用对HRS数据集进行了分区，从而为布局到图像生成的新数据集具有不同的重叠级别。通过该数据集的大量实验，我们证明了Tolo在处理高度叠层布局时会显着提高现有方法的性能。我们的代码和数据集可在此处提供：此HTTPS URL。

Title: Using (Not so) Large Language Models for Generating Simulation Models in a Formal DSL -- A Study on Reaction Networks

Authors: Justin N. Kreikemeyer, Miłosz Jankowski, Pia Wilsdorf, Adelinde M. Uhrmacher
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.01675
Pdf URL: https://arxiv.org/pdf/2503.01675
Copy Paste: [[2503.01675]] Using (Not so) Large Language Models for Generating Simulation Models in a Formal DSL -- A Study on Reaction Networks(https://arxiv.org/abs/2503.01675)
Keywords: generation
Abstract: Formal languages are an integral part of modeling and simulation. They allow the distillation of knowledge into concise simulation models amenable to automatic execution, interpretation, and analysis. However, the arguably most humanly accessible means of expressing models is through natural language, which is not easily interpretable by computers. Here, we evaluate how a Large Language Model (LLM) might be used for formalizing natural language into simulation models. Existing studies only explored using very large LLMs, like the commercial GPT models, without fine-tuning model weights. To close this gap, we show how an open-weights, 7B-parameter Mistral model can be fine-tuned to translate natural language descriptions to reaction network models in a domain-specific language, offering a self-hostable, compute-, and memory efficient alternative. To this end, we develop a synthetic data generator to serve as the basis for fine-tuning and evaluation. Our quantitative evaluation shows that our fine-tuned Mistral model can recover the ground truth simulation model in up to 84.5% of cases. In addition, our small-scale user study demonstrates the model's practical potential for one-time generation as well as interactive modeling in various domains. While promising, in its current form, the fine-tuned small LLM cannot catch up with large LLMs. We conclude that higher-quality training data are required, and expect future small and open-source LLMs to offer new opportunities.
摘要：形式语言是建模和仿真的组成部分。它们允许将知识蒸馏成简明的模拟模型，可自动执行，解释和分析。但是，可以说最容易获得模型的方法是通过自然语言，这是计算机不容易解释的。在这里，我们评估了如何将大型语言模型（LLM）用于将自然语言形式化为模拟模型。现有研究仅使用非常大的LLM（例如商业GPT模型）进行探索，而无需微调模型权重。为了缩小这一差距，我们展示了如何对7B参数Mistral模型进行微调，以将自然语言描述转化为特定领域的语言中的反应网络模型，从而提供自我托管，计算和记忆有效的替代方案。为此，我们开发了一个合成数据生成器，以作为微调和评估的基础。我们的定量评估表明，我们的微调Mistral模型可以在多达84.5％的病例中恢复地面真相模拟模型。此外，我们的小规模用户研究展示了该模型在各个领域中一次性以及交互式建模的实际潜力。虽然有希望，但以目前的形式，微调的小LLM无法追赶大型LLM。我们得出的结论是，需要更高质量的培训数据，并期望将来的小型和开源的LLM提供新的机会。

Title: SAGE: A Framework of Precise Retrieval for RAG

Authors: Jintao Zhang, Guoliang Li, Jinyang Su
Subjects: cs.LG, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01713
Pdf URL: https://arxiv.org/pdf/2503.01713
Copy Paste: [[2503.01713]] SAGE: A Framework of Precise Retrieval for RAG(https://arxiv.org/abs/2503.01713)
Keywords: generation
Abstract: Retrieval-augmented generation (RAG) has demonstrated significant proficiency in conducting question-answering (QA) tasks within a specified corpus. Nonetheless, numerous failure instances of RAG in QA still exist. These failures are not solely attributable to the limitations of Large Language Models (LLMs); instead, they predominantly arise from the retrieval of inaccurate information for LLMs due to two limitations: (1) Current RAG methods segment the corpus without considering semantics, making it difficult to find relevant context due to impaired correlation between questions and the segments. (2) There is a trade-off between missing essential context with fewer context retrieved and getting irrelevant context with more context retrieved. In this paper, we introduce a RAG framework (SAGE), to overcome these limitations. First, to address the segmentation issue without considering semantics, we propose to train a semantic segmentation model. This model is trained to segment the corpus into semantically complete chunks. Second, to ensure that only the most relevant chunks are retrieved while the irrelevant ones are ignored, we design a chunk selection algorithm to dynamically select chunks based on the decreasing speed of the relevance score, leading to a more relevant selection. Third, to further ensure the precision of the retrieved chunks, we propose letting LLMs assess whether retrieved chunks are excessive or lacking and then adjust the amount of context accordingly. Experiments show that SAGE outperforms baselines by 61.25% in the quality of QA on average. Moreover, by avoiding retrieving noisy context, SAGE lowers the cost of the tokens consumed in LLM inference and achieves a 49.41% enhancement in cost efficiency on average. Additionally, our work offers valuable insights for boosting RAG.
摘要：检索提示的生成（RAG）表明，在指定的语料库内执行提问（QA）任务方面表现出了很高的熟练程度。尽管如此，仍然存在质量检查中抹布的许多故障实例。这些失败不仅归因于大语言模型（LLMS）的局限性；取而代之的是，由于两个局限性，它们主要源于LLM的不准确信息的检索：（1）当前的抹布方法在不考虑语义的情况下分段语料库，因此由于问题与段之间的相关性受损而难以找到相关的上下文。（2）在缺少基本环境的情况下，检索到更少的上下文与获得更多背景的背景之间存在权衡。在本文中，我们引入了一个抹布框架（SAGE），以克服这些限制。首先，为了解决细分问题而无需考虑语义，我们建议培训语义分割模型。该模型经过培训，可以将语料库分为语义上的完整块。其次，为确保在忽略无关的块时仅检索最相关的块，我们根据相关性分数的降低速度设计了块选择算法，以动态选择块，从而导致更相关的选择。第三，为了进一步确保检索到的块的精确度，我们建议让LLMS评估检索到的块是否过多还是缺乏，然后相应地调整上下文量。实验表明，Sage平均质量的质量优于基准的61.25％。此外，通过避免检索嘈杂的环境，Sage降低了在LLM推理中消耗的令牌的成本，并平均达到了49.41％的成本效率。此外，我们的工作还提供了有价值的见解，以增加抹布。

Title: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

Authors: Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos Vougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, Maja Pantic
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01715
Pdf URL: https://arxiv.org/pdf/2503.01715
Copy Paste: [[2503.01715]] KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation(https://arxiv.org/abs/2503.01715)
Keywords: generation
Abstract: Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.
摘要：当前的音频驱动的面部动画方法为简短的视频获得了令人印象深刻的结果，但是延长到更长的持续时间时会遇到错误积累和身份漂移。现有的方法试图通过外部空间控制来减轻这种情况，从而提高了长期一致性，但会损害运动的自然性。我们提出了KeyFace是一种新颖的基于两阶段扩散的框架，以解决这些问题。在第一阶段，按照音频输入和身份框架为条件的较低帧速率生成密钥帧，以捕获长时间的基本面部表情和运动。在第二阶段，插值模型填补了钥匙帧之间的空白，从而确保了平滑的过渡和时间连贯性。为了进一步增强现实主义，我们结合了连续的情感表征，并处理多种非语音发声（NSV），例如笑声和叹息。我们还介绍了两个新的评估指标，用于评估唇部同步和NSV生成。实验结果表明，钥匙表面在长时间生成自然，连贯的面部动画的最先进方法，成功地涵盖了NSV和连续的情绪。

Title: Quality Measures for Dynamic Graph Generative Models

Authors: Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01720
Pdf URL: https://arxiv.org/pdf/2503.01720
Copy Paste: [[2503.01720]] Quality Measures for Dynamic Graph Generative Models(https://arxiv.org/abs/2503.01720)
Keywords: generative
Abstract: Deep generative models have recently achieved significant success in modeling graph data, including dynamic graphs, where topology and features evolve over time. However, unlike in vision and natural language domains, evaluating generative models for dynamic graphs is challenging due to the difficulty of visualizing their output, making quantitative metrics essential. In this work, we develop a new quality metric for evaluating generative models of dynamic graphs. Current metrics for dynamic graphs typically involve discretizing the continuous-evolution of graphs into static snapshots and then applying conventional graph similarity measures. This approach has several limitations: (a) it models temporally related events as i.i.d. samples, failing to capture the non-uniform evolution of dynamic graphs; (b) it lacks a unified measure that is sensitive to both features and topology; (c) it fails to provide a scalar metric, requiring multiple metrics without clear superiority; and (d) it requires explicitly instantiating each static snapshot, leading to impractical runtime demands that hinder evaluation at scale. We propose a novel metric based on the \textit{Johnson-Lindenstrauss} lemma, applying random projections directly to dynamic graph data. This results in an expressive, scalar, and application-agnostic measure of dynamic graph similarity that overcomes the limitations of traditional methods. We also provide a comprehensive empirical evaluation of metrics for continuous-time dynamic graphs, demonstrating the effectiveness of our approach compared to existing methods. Our implementation is available at this https URL.
摘要：深层生成模型最近在建模图数据（包括动态图）方面取得了重大成功，其中拓扑和特征随着时间的推移而发展。但是，与视觉和自然语言领域不同，由于难以可视化其输出的难度，因此对动态图进行评估的动态图表很具有挑战性。在这项工作中，我们开发了一个新的质量指标，用于评估动态图的生成模型。动态图的当前指标通常涉及将图形的连续进化转化为静态快照，然后应用常规的图形相似性度量。该方法有几个局限性：（a）它将时间与时间相关的事件建模为I.I.D.样品，未能捕获动态图的不均匀演化；（b）缺乏对特征和拓扑敏感的统一措施；（c）它无法提供标量指标，需要没有明显优势的多个指标；（d）它需要明确实例化每个静态快照，从而导致不切实际的运行时间要求妨碍评估。我们提出了一个基于\ textit {johnson-lindenstrauss}引理的新型度量，将随机投影直接应用于动态图数据。这导致了动态图相似性的表达性，标量和应用不足的度量，从而克服了传统方法的局限性。我们还为连续时间动态图提供了全面的经验评估，这证明了与现有方法相比，我们的方法的有效性。我们的实现可在此HTTPS URL上获得。

Title: VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Authors: Wenhao Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01739
Pdf URL: https://arxiv.org/pdf/2503.01739
Copy Paste: [[2503.01739]] VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation(https://arxiv.org/abs/2503.01739)
Keywords: generation, generative
Abstract: Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal ($0.29\%$) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over $1.09$ million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify $1,291$ user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about $1.09$ million video clips. Our experiments reveal that (1) current $16$ text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset is publicly available at this https URL under the CC BY 4.0 License.
摘要：文本到视频生成模型将文本提示转换为动态视觉内容，从而在电影制作，游戏和教育中提供广泛的应用程序。但是，他们的现实性能通常没有用户期望。一个关键原因是，这些模型尚未接受与用户想要创建的一些主题有关的视频进行培训。在本文中，我们提出了Videoufo，这是第一个专门策划的视频数据集，以与用户在现实世界中的重点保持一致。除此之外，我们的VideouFo还具有：（1）最小（$ 0.29 \％$）与现有视频数据集重叠，以及（2）在Creative Commons许可下通过YouTube的官方API搜索的视频。这两个属性为未来的研究人员提供了更大的自由，可以扩大其培训资源。 Videoufo包含超过$ 109的$ 100万美元的视频剪辑，每个视频片段都与简短和详细的标题（Description）配对。具体来说，通过聚类，我们首先从百万尺度的真实文本到视频提示数据集Vidprom中确定$ 1,291 $以用户为中心的主题。然后，我们使用这些主题从YouTube检索视频，将检索到的视频拆分为剪辑，并为每个剪辑生成简短和详细的字幕。在用指定主题验证剪辑后，我们的视频剪辑约为10.9美元。我们的实验表明，（1）当前的$ 16 $文本到视频模型并未在所有以用户为中心的主题中实现一致的性能；（2）在Videoufo上训练的简单模型在表现最差的主题上优于其他模型。该数据集可在CC by 4.0许可下在此HTTPS URL上公开获得。

Title: Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling

Authors: Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01754
Pdf URL: https://arxiv.org/pdf/2503.01754
Copy Paste: [[2503.01754]] Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling(https://arxiv.org/abs/2503.01754)
Keywords: generation
Abstract: Multi-modal large language models have seen rapid advancement alongside large language models. However, while language models can effectively leverage chain-of-thought prompting for zero or few-shot learning, similar prompting strategies are less effective for multi-modal LLMs due to modality gaps and task complexity. To address this challenge, we explore two prompting approaches: a dual-query method that separates multi-modal input analysis and answer generation into two prompting steps, and an ensemble prompting method that combines multiple prompt variations to arrive at the final answer. Although these approaches enhance the model's reasoning capabilities without fine-tuning, they introduce significant inference overhead. Therefore, building on top of these two prompting techniques, we propose a self-distillation framework such that the model can improve itself without any annotated data. Our self-distillation framework learns representation intervention modules from the reasoning traces collected from ensembled dual-query prompts, in the form of hidden representations. The lightweight intervention modules operate in parallel with the frozen original model, which makes it possible to maintain computational efficiency while significantly improving model capability. We evaluate our method on five widely-used VQA benchmarks, demonstrating its effectiveness in performing multi-hop reasoning for complex tasks.
摘要：多模式的大型语言模型与大语言模型一起迅速发展。但是，尽管语言模型可以有效利用零或几次学习的促进链链，但是由于模态差距和任务复杂性，类似的提示策略对多模式LLM的有效性较小。为了应对这一挑战，我们探讨了两种提示方法：一种双疑问方法，将多模式输入分析和回答生成分为两个提示步骤，以及一个合奏提示方法，该方法结合了多个快速变化以得出最终答案。尽管这些方法在不进行微调的情况下增强了模型的推理能力，但它们引入了明显的推理开销。因此，在这两种提示技术的基础上，我们提出了一个自我验证框架，以便该模型可以在没有任何带注释的数据的情况下改善自身。我们的自distillation框架以隐藏表示形式从结合的双质量提示中收集的推理痕迹中学习表示干预模块。轻巧的干预模块与冷冻原始模型并行运行，这使得可以保持计算效率，同时显着提高模型能力。我们在五个广泛使用的VQA基准上评估了我们的方法，证明了其在对复杂任务进行多跳的推理方面的有效性。

Title: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

Authors: Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
Subjects: cs.LG, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01776
Pdf URL: https://arxiv.org/pdf/2503.01776
Copy Paste: [[2503.01776]] Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation(https://arxiv.org/abs/2503.01776)
Keywords: generative
Abstract: Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at this https URL
摘要：许多大规模系统依靠高质量的深度表示（嵌入）来促进检索，搜索和生成型建模等任务。 Matryoshka表示学习（MRL）最近作为自适应嵌入长度的解决方案出现，但是它需要完整的模型再培训，并且在短长度上遭受了明显的性能降解。在本文中，我们表明，稀疏编码为实现自适应表现提供了令人信服的替代方案，并以最小的开销和更高的忠诚度。我们提出了对比度稀疏表示（CSR），这种方法将预先训练的嵌入量稀疏到高维但有选择性激活的特征空间中。通过利用轻量级自动编码和任务意识到的对比目标，CSR可以保留语义质量，同时允许在不同的稀疏度级别的灵活，具有成本效益的推断。关于图像，文本和多模式基准的广泛实验表明，CSR的表现始终优于MRL的准确性和检索速度，而大幅度通常也将训练时间切成了MRL所需的一小部分。我们的结果将稀疏编码确定为在效率和保真度都至关重要的现实应用程序中自适应表示学习的强大范式。代码可在此HTTPS URL上找到