2025-04-15

Title: Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning

Authors: Younghwan Lee, Tung M. Luu, Donghoon Lee, Chang D. Yoo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08772
Pdf URL: https://arxiv.org/pdf/2504.08772
Copy Paste: [[2504.08772]] Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning(https://arxiv.org/abs/2504.08772)
Keywords: generation
Abstract: In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.
摘要：在离线增强学习（RL）中，从固定数据集中学习为与环境实时互动昂贵或风险的域提供了一个有希望的解决方案。但是，为离线数据集设计密集的奖励信号需要大量的人为努力和领域的专业知识。通过人类反馈（RLHF）的增强学习已成为一种替代方案，但由于人类的过程，它仍然昂贵，这引起了人们对自动奖励生成模型的兴趣。为了解决这个问题，我们通过大型视觉语言模型（RG-VLM）提出奖励生成，该模型利用LVLM的推理能力在不参与的情况下从离线数据中产生奖励。 RG-VLM改善了长途任务中的概括，并且可以与稀疏的奖励信号无缝集成以增强任务性能，从而表明其作为辅助奖励信号的潜力。

Title: Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

Authors: Lucas Beerens, Desmond J. Higham
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08782
Pdf URL: https://arxiv.org/pdf/2504.08782
Copy Paste: [[2504.08782]] Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models(https://arxiv.org/abs/2504.08782)
Keywords: generation, generative
Abstract: We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at this https URL .
摘要：我们引入了一种新的攻击范式，将隐藏的对抗能力嵌入通过微调直接到扩散模型中，而无需更改其可观察到的行为或在推理过程中需要修改。与先前的针对特定图像或调整生成过程以产生对抗性输出的方法不同，我们的方法将对抗功能集成到模型本身中。由此产生的篡改模型产生的高质量图像与原始模型无法区分，但是这些图像以高速率导致下游分类器中的错误分类。错误分类可以针对特定的输出类。用户可以采用这种折衷的模型，因为它与标准扩散模型的功能相同，因此它不知道其嵌入式对抗性性质。我们证明了我们方法的有效性和隐身性，揭示了引起新安全问题的秘密攻击向量。这些发现暴露了由于使用外部供应模型而引起的风险，并强调了迫切需要进行强大的模型验证和防御机制，以防止生成模型中的隐藏威胁。该代码可在此HTTPS URL上找到。

Title: PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration

Authors: Zheyuan Lai, Yingming Pu
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08810
Pdf URL: https://arxiv.org/pdf/2504.08810
Copy Paste: [[2504.08810]] PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration(https://arxiv.org/abs/2504.08810)
Keywords: generation
Abstract: Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard-interpretability. To bridge this gap, we introduce a principles-guided material discovery system powered by language inferential multi-agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated-and-transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \href{this https URL}{GitHub}.
摘要：复杂的化学空间和有限的知识范围与偏见构成了人类科学家的巨大挑战，但在自动化材料发现中。现有的智能方法更多地依赖于数值计算，从而导致效率低下的探索和遇到障碍性的结果。为了弥合这一差距，我们引入了一个原则引导的材料发现系统，该系统由语言推理多代理系统（MAS）供电，即PRIM。我们的框架将自动假设的产生与MAS圆桌系统中的实验验证相结合，从而在维持科学严谨的同时实现了系统的探索。基于我们的框架，纳米螺旋螺旋的案例研究表明了较高的材料勘探速率和财产价值，同时提供了透明的推理途径。这种方法为材料发现开发了一种自动化和透明范式，对功能材料的合理设计具有广泛的影响。代码在我们的\ href {this HTTPS url} {github}上公开可用。

Title: Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization

Authors: Zirui Chen, Zhaoyang Zhang, Ziqing Xing, Ridong Li, Zhaohui Yang, Richeng Jin, Chongwen Huang, Yuzhi Yang, Mérouane Debbah
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2504.08811
Pdf URL: https://arxiv.org/pdf/2504.08811
Copy Paste: [[2504.08811]] Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization(https://arxiv.org/abs/2504.08811)
Keywords: generation
Abstract: Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is mainly due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite the data of each scenario has its distinct reference frame, its generation generally follows the same underlying physical rule. Based on these findings, this article proposes a brand-new universal deep learning framework named analogical learning (AL), which provides a highly efficient way to implicitly retrieve the reference frame information associated with a scenario and then to make accurate prediction by relative analogy across scenarios. Specifically, an elegant bipartite neural network architecture called Mateformer is designed, the first part of which calculates the relativity within multiple feature spaces between the input data and a small amount of embedded data from the current scenario, while the second part uses these relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments show that AL achieves state-of-the-art accuracy, stable transferability and robust adaptation to new scenarios without any tuning, and outperforming conventional methods with a precision improvement of nearly two orders of magnitude. All data and code are available at this https URL.
摘要：当在各种情况下部署时，现有的学习模型通常表现出较差的概括。主要是由于数据的基础参考框架随部署环境和设置而变化。但是，尽管每种情况的数据都有其独特的参考框架，但其一代通常遵循相同的基础物理规则。基于这些发现，本文提出了一个名为类似学习（AL）的全新通用深度学习框架，该框架提供了一种高效的方法，可以隐式地检索与方案相关的参考框架信息，然后在各场景中通过相对类比进行准确的预测。具体而言，设计了一种称为MateFormer的优雅两分神经网络体系结构，其第一部分计算了输入数据和少量嵌入式数据之间的相对性，而第二部分则使用这些相对论来指导非线性类似。我们将Al应用于典型的多幕科学习问题，即蜂窝网络中智能无线定位的问题。广泛的实验表明，AL可实现最先进的准确性，稳定的可传递性和对新场景的稳健适应性，而无需进行任何调整，并且优于常规方法，并精确提高了近两个数量级。所有数据和代码均可在此HTTPS URL上找到。

Title: Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

Authors: G. Charbel N. Kindji (IRISA, MALT), Elisa Fromont (MALT, IRISA), Lina Maria Rojas-Barahona, Tanguy Urvoy
Subjects: cs.LG, cs.AI, cs.DB, cs.NE
Abstract URL: https://arxiv.org/abs/2504.08829
Pdf URL: https://arxiv.org/pdf/2504.08829
Copy Paste: [[2504.08829]] Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild(https://arxiv.org/abs/2504.08829)
Keywords: generative
Abstract: The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.
摘要：生成模型的增长引起了人们对已发表内容的真实性的主要关注。为了解决这个问题，已经为统一结构化介质（例如图像或文本）提出了几种综合内容检测方法。但是，尽管它在工业和政府中很重要，但在检测合成表格数据方面几乎没有完成。由于其结构的多样性，这种数据形式很复杂：列的数量和类型可能会因一个表而异。我们解决了检测合成表格数据“''in the''的棘手问题，即当模型被部署在从未见过的表结构上时。我们介绍了一个新颖的基准变压器体系结构，并表明它表现出色的模型。此外，我们研究了域适应技术的应用以增强模型的有效性，从而提供了更强大的数据检测解决方案。

Title: ML For Hardware Design Interpretability: Challenges and Opportunities

Authors: Raymond Baartmans, Andrew Ensinger, Victor Agostinelli, Lizhong Chen
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2504.08852
Pdf URL: https://arxiv.org/pdf/2504.08852
Copy Paste: [[2504.08852]] ML For Hardware Design Interpretability: Challenges and Opportunities(https://arxiv.org/abs/2504.08852)
Keywords: generation
Abstract: The increasing size and complexity of machine learning (ML) models have driven the growing need for custom hardware accelerators capable of efficiently supporting ML workloads. However, the design of such accelerators remains a time-consuming process, heavily relying on engineers to manually ensure design interpretability through clear documentation and effective communication. Recent advances in large language models (LLMs) offer a promising opportunity to automate these design interpretability tasks, particularly the generation of natural language descriptions for register-transfer level (RTL) code, what we refer to as "RTL-to-NL tasks." In this paper, we examine how design interpretability, particularly in RTL-to-NL tasks, influences the efficiency of the hardware design process. We review existing work adapting LLMs for these tasks, highlight key challenges that remain unaddressed, including those related to data, computation, and model development, and identify opportunities to address them. By doing so, we aim to guide future research in leveraging ML to automate RTL-to-NL tasks and improve hardware design interpretability, thereby accelerating the hardware design process and meeting the increasing demand for custom hardware accelerators in machine learning and beyond.
摘要：机器学习（ML）模型的规模和复杂性的不断增长驱动了对能够有效支持ML工作负载的自定义硬件加速器的需求。但是，此类加速器的设计仍然是一个耗时的过程，严重依靠工程师通过清晰的文档和有效的沟通来手动确保设计可解释性。大型语言模型（LLMS）的最新进展为自动化这些设计可解释性任务提供了有前途的机会，尤其是为寄存器转移级别（RTL）代码的自然语言描述的产生，我们称之为“ RTL-TO-NL任务”。在本文中，我们研究了设计可解释性，尤其是在RTL到NL任务中如何影响硬件设计过程的效率。我们回顾了适应这些任务的LLM的现有工作，重点介绍了尚未解决的关键挑战，包括与数据，计算和模型开发相关的关键挑战，并确定了解决这些问题的机会。通过这样做，我们旨在指导未来的研究，以利用ML自动化RTL-TO-NL任务并改善硬件设计可解释性，从而加速硬件设计过程，并满足对机器学习及其他方面的自定义硬件加速器的不断增长的需求。

Title: Knowledge Graph-extended Retrieval Augmented Generation for Question Answering

Authors: Jasper Linders, Jakub M. Tomczak
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08893
Pdf URL: https://arxiv.org/pdf/2504.08893
Copy Paste: [[2504.08893]] Knowledge Graph-extended Retrieval Augmented Generation for Question Answering(https://arxiv.org/abs/2504.08893)
Keywords: generation
Abstract: Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG's potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.
摘要：大型语言模型（LLM）和知识图（kgs）提供了一种有希望的方法来解决鲁棒和可解释的问题答案（QA）。尽管LLM擅长自然语言的理解，但他们却遭受了知识差距和幻觉的困扰。公斤提供结构化的知识，但缺乏自然语言互动。理想情况下，AI系统应该既适合丢失事实，又要易于沟通。本文提出了一种整合LLM和KG的系统，而无需培训，从而确保以最少的人为努力来确保不同KG的适应性。所得的方法可以归类为带有kg的检索增强发电（RAG）的特定形式，因此，它被称为知识图形扩展的检索增强生成（kg-rag）。它包括一个问题分解模块，以增强多跳信息检索并回答解释性。使用内在的学习（ICL）和经营链（COT）提示，它会产生明确的推理链，以改善真实性。 Metaqa基准测试的实验表明，多跳问题的精度提高了，尽管与KG基准的LLM相比，单跳性能的折衷略有折衷。这些发现证明了KG-rag通过与结构化知识检索桥接非结构化的语言理解来提高质量检查的潜力。

Title: Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries

Authors: Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08896
Pdf URL: https://arxiv.org/pdf/2504.08896
Copy Paste: [[2504.08896]] Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries(https://arxiv.org/abs/2504.08896)
Keywords: generation
Abstract: In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation this http URL, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.
摘要：在基础模型和大型语言模型（LLM）的时代，欧几里得空间已成为机器学习体系结构的事实上的几何环境。但是，最近的文献表明，这种选择具有根本的局限性。在大规模的情况下，现实世界中的数据通常表现出固有的非欧盟统治结构，例如多路关系，层次结构，对称性和非异型缩放，在各种领域，例如语言，视觉和自然科学。在欧几里得空间的约束中有效捕获这些结构是一项挑战。该立场论文认为，超越欧几里得的几何形状不仅是可选的增强，而且是维护下一代基础模型的规模定律的必要性。通过采用这些几何形状，基础模型可以更有效地利用上述结构。通过动态重新配置以匹配下游应用程序几何形状的任务感知适应性可以进一步提高效率和表现力。我们的立场得到了一系列关于此HTTP URL的普遍基础的理论和实证研究的支持，我们概述了将非欧亚人几何形状集成到基础模型中的路线图，包括通过微调，从SCRATCH培训进行微调，训练和混合方法来构建几何基础模型的策略。

Title: LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Authors: Pascal Chang, Sergio Sancho, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08902
Pdf URL: https://arxiv.org/pdf/2504.08902
Copy Paste: [[2504.08902]] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping(https://arxiv.org/abs/2504.08902)
Keywords: generative
Abstract: Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.
摘要：变形是指有意扭曲的一类图像，直接观看时使它们无法识别。它们的真实形式只有从特定的角度看时才能揭示自己，这可以通过某些catadioptric设备（例如镜子或镜头）。尽管这些数学设备的构建最早可以追溯到17世纪，但只有从特定的有利位置观看并在正常视为时倾向于失去意义时，它们才能解释。在本文中，我们以产生的转折重新审视了这些著名的错觉。在潜在的整流流模型的帮助下，我们提出了一种创建变形图像的方法，该方法直接在直接查看时仍保留有效的解释。为此，我们引入了拉普拉斯金字塔翘曲，这是产生高质量视觉效果的频率感知图像翘曲技术的关键。我们的作品将视觉上的字眼（Arxiv：2311.17919）扩展到潜在的空间模型，并扩展到更广泛的空间变换，从而创造了新颖的生成感知幻象。

Title: An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline

Authors: Junkyum Kim, Divya Mahajan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08930
Pdf URL: https://arxiv.org/pdf/2504.08930
Copy Paste: [[2504.08930]] An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline(https://arxiv.org/abs/2504.08930)
Keywords: generation
Abstract: Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases, enabling external knowledge retrieval to support language model reasoning. While RAG enables efficient question answering with smaller LLMs, existing optimizations for vector search and LLM serving have largely been developed in isolation. As a result, their integration often leads to suboptimal end-to-end performance. ... This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems that enhances the responsiveness of the system by jointly optimizing vector search and LLM serving across CPU and GPU system. A key challenge is to determine which indices and how much of the vector index should reside on the GPU and adjusting LLM batch sizes to balance the pipeline for lower Time-To-First-Token (TTFT) and meeting user-defined Service-Level Objectives (SLOs). To address this, we leverage the insight that cluster access in vector databases exhibits access skew, where a subset of clusters are queried significantly more frequently than others. VectorLiteRAG exploits this property through an optimized memory distribution strategy, dynamically allocating the minimum number of vector indices corresponding to frequently accessed clusters onto the GPU HBM to ensure a balanced pipeline with the LLM for high responsiveness. This adaptive partitioning scheme is guided by a statistical model that informs memory allocation and workload distribution. Our evaluation demonstrates that VectorLiteRAG improves vector search responsiveness by 2x, significantly reduces end-to-end TTFT in RAG systems by intelligently balancing memory resources between vector search and LLM execution.
摘要：检索增强生成（RAG）系统通过将大型语言模型（LLM）与矢量数据库集成，从而增强响应质量，从而使外部知识检索能够支持语言模型推理。尽管RAG可以使用较小的LLM进行有效的问题回答，但现有的矢量搜索和LLM服务的优化已在很大程度上孤立地开发出来。结果，它们的集成通常会导致次端到端的性能。 ...本文介绍了VectorLiterag，这是一种专门为RAG系统设计的优化矢量索引分配机制，该机制通过共同优化跨CPU和GPU系统的矢量搜索和LLM服务来增强系统的响应能力。一个关键的挑战是确定在GPU上应驻留的索引和多少矢量索引，并调整LLM批次尺寸，以平衡较低时间段（TTFT）的管道和会议用户定义的服务级别目标（SLOS）。为了解决这个问题，我们利用了媒介数据库中的群集访问的见解，展示了访问偏斜，其中一部分集群比其他群集更频繁地查询。 VectorLiterag通过优化的内存分布策略来利用此属性，并动态分配了与经常访问的群集相对应的矢量指数，以确保具有LLM的平衡管道，以获得高响应能力。这种自适应分区方案的指导下是一个统计模型，该模型为内存分配和工作负载分配提供了信息。我们的评估表明，向量Literag通过2倍提高了向量搜索响应能力，从而通过在矢量搜索和LLM执行之间智能平衡内存资源来大大降低抹布系统中的端到端TTFT。

Title: MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Authors: Yilin Wang, Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Xinxin Zuo, Juwei Lu, Hai Jiang, Li Cheng
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.08959
Pdf URL: https://arxiv.org/pdf/2504.08959
Copy Paste: [[2504.08959]] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer(https://arxiv.org/abs/2504.08959)
Keywords: generation, generative
Abstract: Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: this https URL
摘要：生成性掩蔽的变压器在各种内容生成任务中都表现出了巨大的成功，这主要是由于它们有效地对大规模数据集分布进行了高度一致性进行建模。但是，在动画域中，大型数据集并不总是可用。应用生成胶面膜建模从单个MOCAP参考生成不同的实例可能会导致过度拟合，这一挑战尚未得到探索。在这项工作中，我们提出了MotionDreamer，这是一种局部掩盖的建模范式，旨在从具有任意拓扑和持续时间的给定运动中学习内部运动模式。通过将给定的运动嵌入使用新颖的分布正则化方法中的量化令牌中，MotionDreamer为局部运动模式构建了强大而有益的代码手册。此外，在我们的蒙版变压器中引入了滑动窗口的本地关注，从而使自然而多样的动画产生与参考运动模式非常相似。正如通过全面的实验所证明的那样，MotionDreamer的表现优于通常在忠诚和多样性中基于gan或扩散的最先进方法。由于基于量化的方法的一致性和鲁棒性，运动训练器还可以有效执行下游任务，例如时间运动编辑，\ textColor {update} {Crowd Animation} {Crowd Animation}，并使用单个参考运动。访问我们的项目页面：此HTTPS URL

Title: AGENT: An Aerial Vehicle Generation and Design Tool Using Large Language Models

Authors: Colin Samplawski, Adam D. Cobb, Susmit Jha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08981
Pdf URL: https://arxiv.org/pdf/2504.08981
Copy Paste: [[2504.08981]] AGENT: An Aerial Vehicle Generation and Design Tool Using Large Language Models(https://arxiv.org/abs/2504.08981)
Keywords: generation
Abstract: Computer-aided design (CAD) is a promising application area for emerging artificial intelligence methods. Traditional workflows for cyberphysical systems create detailed digital models which can be evaluated by physics simulators in order to narrow the search space before creating physical prototypes. A major bottleneck of this approach is that the simulators are often computationally expensive and slow. Recent advancements in AI methods offer the possibility to accelerate these pipelines. We use the recently released AircraftVerse dataset, which is especially suited for developing and evaluating large language models for designs. AircraftVerse contains a diverse set of UAV designs represented via textual design trees together with detailed physics simulation results. Following the recent success of large language models (LLMs), we propose AGENT (Aircraft GENeraTor). AGENT is a comprehensive design tool built on the CodeT5+ LLM which learns powerful representations of aircraft textual designs directly from JSON files. We develop a curriculum of training tasks which imbues a single model with a suite of useful features. AGENT is able to generate designs conditioned on properties of flight dynamics (hover time, maximum speed, etc.). Additionally, AGENT can issue evaluations of designs allowing it to act as a surrogate model of the physics simulation that underlies the AircraftVerse dataset. We present a series of experiments which demonstrate our system's abilities. We are able to achieve strong performance using the smallest member of the CodeT5+ family (220M parameters). This allows for a flexible and powerful system which can be executed on a single GPU enabling a clear path toward future deployment.
摘要：计算机辅助设计（CAD）是新兴人工智能方法的有前途的应用领域。用于网络物理系统的传统工作流创建了详细的数字模型，可以通过物理模拟器进行评估，以便在创建物理原型之前缩小搜索空间。这种方法的主要瓶颈是模拟器通常在计算上昂贵且缓慢。 AI方法的最新进展提供了加速这些管道的可能性。我们使用最近发布的飞机数据集，该数据集特别适合开发和评估设计大型语言模型。飞机包含通过文本设计树代表的各种无人机设计以及详细的物理模拟结果。在大型语言模型（LLM）最近成功之后，我们提出了代理（飞机发生器）。代理是一种构建在CODET5+ LLM上的综合设计工具，它直接从JSON文件中学习了飞机文本设计的强大表示。我们开发了培训任务的课程，该课程将单个模型带有一套有用的功能。代理能够生成以飞行动力学（悬停时间，最大速度等）的特性为条件的设计。此外，代理可以对设计进行评估，使其可以充当飞机数据集基础的物理模拟的替代模型。我们提出了一系列实验，这些实验证明了我们的系统能力。我们能够使用CODET5+家族（220m参数）的最小成员来实现强大的性能。这允许灵活而强大的系统可以在单个GPU上执行，从而为未来的部署提供了清晰的途径。

Title: Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Authors: Gen Li, Yang Xiao, Jie Ji, Kaiyuan Deng, Bo Hui, Linke Guo, Xiaolong Ma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09039
Pdf URL: https://arxiv.org/pdf/2504.09039
Copy Paste: [[2504.09039]] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization(https://arxiv.org/abs/2504.09039)
Keywords: generation, generative
Abstract: Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbf{Dynamic Mask coupled with Concept-Aware Loss}, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbf{Dynamic Mask} mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbf{Concept-Aware Loss} explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.
摘要：文本对图像（T2I）扩散模型在从文本提示中生成高质量的图像方面取得了巨大的成功。但是，它们存储大量知识的能力在必要遗忘的情况下引起了人们的关注，例如消除受版权保护的内容，减少偏见或消除有害的概念。尽管现有的未学习方法可以消除某些概念，但由于不稳定，剩余的知识持久性和发电质量退化而导致的多概念忘记遗忘。为了应对这些挑战，我们提出\ textbf {动态掩码，再加上概念感知损失}，这是一个新颖的学习框架，设计用于扩散模型中的多概念遗忘。我们的\ textbf {动态掩码}机制基于当前优化状态自适应地更新渐变掩模，从而可以选择性修改，以防止干扰无关的知识。此外，我们的\ textbf {概念感知损失}通过通过超类对齐来实现语义一致性来显式指导学习过程，而基于知识蒸馏的正则化损失确保了以前未经学习的概念在顺序未学习期间仍被遗忘。我们进行广泛的实验来评估我们的方法。结果表明，我们的方法在忘记有效性，输出保真度和语义连贯性方面优于现有的未学习技术，尤其是在多概念方案中。我们的工作为生成模型中的稳定且高保真的框架提供了一个有原则且灵活的框架。该代码将公开发布。

Title: UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

Authors: Shuning Sun, Yu Zhang, Chen Wu, Dianjie Lu, Dianjie Lu, Guijuan Zhan, Yang Weng, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09069
Pdf URL: https://arxiv.org/pdf/2504.09069
Copy Paste: [[2504.09069]] UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance(https://arxiv.org/abs/2504.09069)
Keywords: restoration
Abstract: Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a "single-task single-model" paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers stateof-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.
摘要：视频成像通常会受到复杂降解的影响，例如模糊，噪声和压缩伪像。传统的恢复方法遵循“单任务单模”范式，导致概括和较高的计算成本，从而限制了它们在具有不同降解类型的现实情况下的适用性。我们提出了Uniflowrestore，这是一个通用的视频修复框架，在迅速和物理知识的矢量字段下，将恢复为及时的进化。物理感知的骨干物理学将降解先验编码为势能，而促使派生剂会产生与任务相关的提示作为动量。这些组件定义了一个哈密顿系统，其矢量场会集成惯性动力学，衰减的物理梯度和及时的基于指导。该系统通过固定步骤求解器进行了优化，以实现跨任务的有效统一修复。实验表明，Uniflowrestore以强大的概括和效率提供了现状的性能。定量结果表明，uniflowrestore在视频降解任务上达到了最高的PSNR（33.89 dB）和SSIM（0.97），同时在所有评估的任务中保持最高或第二高分。

Title: Synthetic Aircraft Trajectory Generation Using Time-Based VQ-VAE

Authors: Abdulmajid Murad, Massimiliano Ruocco
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09101
Pdf URL: https://arxiv.org/pdf/2504.09101
Copy Paste: [[2504.09101]] Synthetic Aircraft Trajectory Generation Using Time-Based VQ-VAE(https://arxiv.org/abs/2504.09101)
Keywords: generation
Abstract: In modern air traffic management, generating synthetic flight trajectories has emerged as a promising solution for addressing data scarcity, protecting sensitive information, and supporting large-scale analyses. In this paper, we propose a novel method for trajectory synthesis by adapting the Time-Based Vector Quantized Variational Autoencoder (TimeVQVAE). Our approach leverages time-frequency domain processing, vector quantization, and transformer-based priors to capture both global and local dynamics in flight data. By discretizing the latent space and integrating transformer priors, the model learns long-range spatiotemporal dependencies and preserves coherence across entire flight paths. We evaluate the adapted TimeVQVAE using an extensive suite of quality, statistical, and distributional metrics, as well as a flyability assessment conducted in an open-source air traffic simulator. Results indicate that TimeVQVAE outperforms a temporal convolutional VAE baseline, generating synthetic trajectories that mirror real flight data in terms of spatial accuracy, temporal consistency, and statistical properties. Furthermore, the simulator-based assessment shows that most generated trajectories maintain operational feasibility, although occasional outliers underscore the potential need for additional domain-specific constraints. Overall, our findings underscore the importance of multi-scale representation learning for capturing complex flight behaviors and demonstrate the promise of TimeVQVAE in producing representative synthetic trajectories for downstream tasks such as model training, airspace design, and air traffic forecasting.
摘要：在现代空中交通管理中，生成合成的飞行轨迹已成为解决数据稀缺，保护敏感信息和支持大规模分析的有希望的解决方案。在本文中，我们通过调整基于时间的矢量量化变异自动编码器（TimeVQVAE）提出了一种新颖的轨迹合成方法。我们的方法利用时频域处理，矢量量化和基于变压器的先验来捕获飞行数据中的全球和局部动态。通过离散潜在空间并整合变压器先验，该模型可以学习远程时空依赖性，并保持整个飞行路径的连贯性。我们使用广泛的质量，统计和分配指标以及在开源空中交通模拟器中进行的可飞行评估来评估改编的TimeVQVAE。结果表明，TimeVQVAE的表现优于时间卷积VAE基线，从而产生合成轨迹，这些轨迹以空间精度，时间一致性和统计特性来反映实际飞行数据。此外，基于模拟器的评估表明，大多数生成的轨迹保持可行性，尽管偶尔异常值强调了对其他特定域特异性约束的潜在需求。总体而言，我们的发现强调了多尺度表示学习对于捕获复杂的飞行行为的重要性，并证明了TimeVQVAE在生产代表性的合成轨迹方面的承诺，用于下游任务，例如模型培训，空域设计和空中交通预测。

Title: Multi-modal and Multi-view Fundus Image Fusion for Retinopathy Diagnosis via Multi-scale Cross-attention and Shifted Window Self-attention

Authors: Yonghao Huang, Leiting Chen, Chuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09106
Pdf URL: https://arxiv.org/pdf/2504.09106
Copy Paste: [[2504.09106]] Multi-modal and Multi-view Fundus Image Fusion for Retinopathy Diagnosis via Multi-scale Cross-attention and Shifted Window Self-attention(https://arxiv.org/abs/2504.09106)
Keywords: generation
Abstract: The joint interpretation of multi-modal and multi-view fundus images is critical for retinopathy prevention, as different views can show the complete 3D eyeball field and different modalities can provide complementary lesion areas. Compared with single images, the sequence relationships in multi-modal and multi-view fundus images contain long-range dependencies in lesion features. By modeling the long-range dependencies in these sequences, lesion areas can be more comprehensively mined, and modality-specific lesions can be detected. To learn the long-range dependency relationship and fuse complementary multi-scale lesion features between different fundus modalities, we design a multi-modal fundus image fusion method based on multi-scale cross-attention, which solves the static receptive field problem in previous multi-modal medical fusion methods based on attention. To capture multi-view relative positional relationships between different views and fuse comprehensive lesion features between different views, we design a multi-view fundus image fusion method based on shifted window self-attention, which also solves the computational complexity of the multi-view fundus fusion method based on self-attention is quadratic to the size and number of multi-view fundus images. Finally, we design a multi-task retinopathy diagnosis framework to help ophthalmologists reduce workload and improve diagnostic accuracy by combining the proposed two fusion methods. The experimental results of retinopathy classification and report generation tasks indicate our method's potential to improve the efficiency and reliability of retinopathy diagnosis in clinical practice, achieving a classification accuracy of 82.53\% and a report generation BlEU-1 of 0.543.
摘要：多模式和多视图底面图像的联合解释对于预防视网膜病至关重要，因为不同的观点可以显示完整的3D眼球场，并且不同的方式可以提供互补的病变区域。与单个图像相比，多模式和多视图底部图像中的序列关系包含病变特征中的远程依赖性。通过对这些序列中的远距离依赖性进行建模，可以更全面地开采病变区域，并可以检测到特定于模态特异性病变。为了学习不同的底面模态之间的远距离依赖关系和融合互补的多尺度病变特征，我们设计了一种基于多尺度跨注意的多模式底面图像融合方法，该方法在以前的多模式医学融合方法中解决了基于注意的多模式医学融合方法中的静态接受场问题。为了捕获不同视图和融合不同视图之间的多视图之间的相对位置关系，我们设计了一种基于移动窗口自我发挥的多视图眼底图像融合方法，该方法还解决了基于自我的多视文化基础融合方法的计算复杂性，对自我的多次融合，对多视频的大小和多个视频遗产图像的数量是Quadratic的。最后，我们设计了一个多任务视网膜病变诊断框架，以通过结合提出的两种融合方法来帮助眼科医生减少工作量并提高诊断准确性。视网膜病变分类和报告生成任务的实验结果表明，我们的方法在临床实践中提高视网膜病变诊断的效率和可靠性的潜力，达到82.53 \％的分类准确性和报告的BLEU-1 0.543。

Title: MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation

Authors: Changhao Li, Yu Xin, Xiaowei Zhou, Ariel Shamir, Hao Zhang, Ligang Liu, Ruizhen Hu
Subjects: cs.CV, cs.CG
Abstract URL: https://arxiv.org/abs/2504.09149
Pdf URL: https://arxiv.org/pdf/2504.09149
Copy Paste: [[2504.09149]] MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation(https://arxiv.org/abs/2504.09149)
Keywords: generation
Abstract: We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features.
摘要：我们引入了蒙版锚定球形距离（MASH），这是3D形状的新型多视图和参数化表示。由多视图几何形状启发，并受到感知形状理解对学习3D形状的重要性的启发，MASH代表3D形状作为可观察的局部表面贴片的集合，每个形状由从锚点发出的球形距离函数定义。我们进一步利用球形谐波的紧凑性来编码MASH函数，并与广义视图锥和参数化底座相结合，掩盖了球形函数的空间范围以达到位置。我们开发了一种可区分的优化算法，能够将任何点云转换为MASH表示形式，可以准确地近似具有任意的几何形状和拓扑结构的地面真相表面。广泛的实验表明，MASH对于多种应用，包括表面重建，形状产生，完成和混合的多功能性，由于其独特的代表性涵盖了隐式和明确的特征，因此实现了卓越的性能。

Title: MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data

Authors: Wentao Li, Yizhe Chen, Jiangjie Qiu, Xiaonan Wang
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.09152
Pdf URL: https://arxiv.org/pdf/2504.09152
Copy Paste: [[2504.09152]] MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data(https://arxiv.org/abs/2504.09152)
Keywords: generation, generative
Abstract: Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.
摘要：在材料科学领域，长期以来，数据稀缺和高昂的注释成本一直是持续的挑战。受到其在计算机视觉等其他领域的潜力的启发，我们提出了Matwheel框架，该框架框架使用条件生成模型生成的合成数据训练材料属性预测模型。我们探索两种情况：完全监督和半监督的学习。使用CGCNN作为条件生成模型进行属性预测和con-cdvae，从Matminer数据库上进行了两个数据筛选材料属性数据集的实验。结果表明，合成数据在极端的数据筛选方案中具有潜力，在所有两个任务中实现接近或超过实际样本的性能。我们还发现，伪标签对生成的数据质量的影响很小。未来的工作将整合高级模型并优化发电条件，以提高材料数据飞轮的有效性。

Title: Type-Constrained Code Generation with Language Models

Authors: Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, Martin Vechev
Subjects: cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2504.09246
Pdf URL: https://arxiv.org/pdf/2504.09246
Copy Paste: [[2504.09246]] Type-Constrained Code Generation with Language Models(https://arxiv.org/abs/2504.09246)
Keywords: generation
Abstract: Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic language features. This leaves typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. We develop novel prefix automata for this purpose and introduce a sound approach to enforce well-typedness based on type inference and a search over inhabitable types. We formalize our approach on a simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on HumanEval shows that our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including SOTA open-weight models with more than 30B parameters.
摘要：大型语言模型（LLM）在代码生成方面取得了显着的成功。但是，它们仍然经常产生不可译出的输出，因为他们的下一言论过程没有建模代码的正式方面。尽管受限的解码是减轻此问题的一种有前途的方法，但它仅应用于处理特定于领域的语言或句法语言特征。这会留下键入错误，这些错误超出了语法的领域，并且通常难以充分地约束。为了应对这一挑战，我们引入了一种利用类型系统来指导代码生成的类型约束解码方法。为此，我们开发了新颖的前缀自动机，并引入了一种基于类型推理和对可居住类型的搜索来实施良好型的方法。我们将我们的方法正式使用简单的语言，并将其扩展到打字稿以证明实用性。我们对HumaneVal的评估表明，我们的方法将编译误差减少了一半以上，并提高了各种大小和模型家族的LLM的代码合成，翻译和维修任务的功能正确性，包括SOTA开放式模型，具有超过30b参数。

Title: FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

Authors: Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09255
Pdf URL: https://arxiv.org/pdf/2504.09255
Copy Paste: [[2504.09255]] FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment(https://arxiv.org/abs/2504.09255)
Keywords: quality assessment
Abstract: Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
摘要：除了一般视频质量评估（VQA）之外，还应探索面部视频质量评估（FVQA），因为面部视频是社交媒体平台上的主要内容和人类视觉系统（HVS）对人的面孔特别敏感。但是，由于缺乏大规模的FVQA数据集，很少探索FVQA。为了填补这一空白，我们介绍了第一个大型野外FVQA数据集FVQ-20K，其中包含20,000个野外面部视频，以及相应的平均意见评分（MOS）注释。与FVQ-20K数据集一起，我们进一步提出了一种名为FVQ-Rater的专业FVQA方法，以实现人体样视频的评分和评分，这是探索FVQA任务的大型多模型（LMMS）潜力的首次尝试。具体而言，我们精心提取多维功能，包括空间特征，时间特征和特定面部特征（即肖像特征和面部嵌入），以提供全面的视觉信息，并利用基于Lora的指导调谐技术，以实现质量特异性的微调，在这两个fvQ-20k和CFVQ-20k和CFVVQA上都表现出优异的性能。广泛的实验和综合分析表明，FVQ-20K数据集和FVQ-RATER方法在促进FVQA的发展方面具有重要潜力。

Title: Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Danping Zou, Weiyao Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09261
Pdf URL: https://arxiv.org/pdf/2504.09261
Copy Paste: [[2504.09261]] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling(https://arxiv.org/abs/2504.09261)
Keywords: generation
Abstract: Visual Autoregressive (VAR) models have emerged as a powerful approach for multi-modal content creation, offering high efficiency and quality across diverse multimedia applications. However, they face significant memory bottlenecks due to extensive KV cache accumulation during inference. Existing KV cache compression techniques for large language models are suboptimal for VAR models due to, as we identify in this paper, two distinct categories of attention heads in VAR models: Structural Heads, which preserve spatial coherence through diagonal attention patterns, and Contextual Heads, which maintain semantic consistency through vertical attention patterns. These differences render single-strategy KV compression techniques ineffective for VAR models. To address this, we propose HACK, a training-free Head-Aware Compression method for KV cache. HACK allocates asymmetric cache budgets and employs pattern-specific compression strategies tailored to the essential characteristics of each head category. Experiments on Infinity-2B, Infinity-8B, and VAR-d30 demonstrate its effectiveness in text-to-image and class-conditional generation tasks. HACK can hack down up to 50\% and 70\% of cache with minimal performance degradation for VAR-d30 and Infinity-8B, respectively. Even with 70\% and 90\% KV cache compression in VAR-d30 and Infinity-8B, HACK still maintains high-quality generation while reducing memory usage by 44.2\% and 58.9\%, respectively.
摘要：视觉自我回归（VAR）模型已成为多模式内容创建的强大方法，可在各种多媒体应用程序中提供高效率和质量。但是，由于推断期间的KV缓存积累，它们面临着大量的记忆瓶颈。大型语言模型的现有KV缓存压缩技术对于VAR模型而言是次优的，因为我们在本文中确定了VAR模型中的两个不同类别的注意力头：结构头，通过对角度注意力模式保持空间连贯性，并通过上下文的头部通过垂直注意模式保持语义一致性。这些差异使单策略KV压缩技术无效。为了解决这个问题，我们提出了Hack，这是一种用于KV缓存的无训练的头感应压缩方法。黑客分配了不对称的缓存预算，并采用了针对每个头部类别的基本特征量身定制的模式特定压缩策略。关于Infinity-2B，Infinity-8B和VAR-D30的实验证明了其在文本对象和课堂条件生成任务中的有效性。骇客可以分别减少50 \％和70 \％的缓存，而VAR-D30和Infinity-8B的性能降低最小。即使在VAR-D30和Infinity-8B中使用70 \％和90 \％KV缓存压缩，Hack仍然保持高质量的生成，而分别将记忆使用量分别降低了44.2 \％和58.9 \％。

Title: Towards Explainable Partial-AIGC Image Quality Assessment

Authors: Jiaying Qian, Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09291
Pdf URL: https://arxiv.org/pdf/2504.09291
Copy Paste: [[2504.09291]] Towards Explainable Partial-AIGC Image Quality Assessment(https://arxiv.org/abs/2504.09291)
Keywords: generation, quality assessment
Abstract: The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.
摘要：AI驱动的视觉生成技术的快速发展促进了图像操纵的重大突破，尤其是在实现了对自然场景图像（NSIS）的影像现实主义局部编辑效果方面。尽管针对AI生成的图像（AGIS）进行了大量研究（IQA），但大多数研究都集中于完全AI生成的输出（例如，文本到图像生成），使部分AI-AIGC图像（PAIS）图像的质量评估几乎是局部AI驱动的编辑，几乎是无预定的。在此差距的推动下，我们构建了第一个大规模的PAI数据集，该数据集朝着可解释的部分AIGC图像质量评估（EPAIQA），EPAIQA-15K，其中包括15K在不同地区和超过300K多维人类评级的局部AI操纵的图像。基于此，我们利用大型多模式模型（LMM）提出了三阶段的模型训练范式。该范式逐步训练LMM，以编辑区域接地，定量质量评分和质量解释。最后，我们开发了Epaiqa系列模型，该模型具有可解释的质量反馈功能。我们的工作代表了感知IQA领域的开创性努力，以进行全面的PAI质量评估。

Title: MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

Authors: Tyler Spears, Shen Zhu, Yinzhu Jin, Aman Shrivastava, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09322
Pdf URL: https://arxiv.org/pdf/2504.09322
Copy Paste: [[2504.09322]] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions(https://arxiv.org/abs/2504.09322)
Keywords: generation, generative
Abstract: In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.
摘要：在这项工作中，我们介绍了Medil，这是一种旨在编码具有异质尺寸和图像生成分辨率的医学图像的首个自动编码器。医学图像通常很大且异质性，精细的细节至关重要。在考虑采集设备，患者人口统计和病理学时，图像属性会发生巨大变化，从而使医学图像产生具有挑战性。潜在扩散模型（LDMS）的最新工作显示在生成重新采样到固定大小的图像方面取得了成功。但是，这是原生习惯的分辨率的一个狭窄子集，并重新采样丢弃了精细的解剖细节。 Medil利用隐式神经表示作为连续信号将图像视为连续信号，在这种信号中，可以在不事先重新采样的情况下以任意分辨率进行编码和解码。我们对Medil的大型多站点多分辨率数据集的T1W Brain MRI和肺CTS的大型多站点数据集进行定量和定性地表明Medil如何压缩和保存与临床相关的特征。我们进一步证明了Medil如何影响扩散模型产生的图像的质量，并讨论Medil如何增强生成模型以类似于原始的临床采集。

Title: Text To 3D Object Generation For Scalable Room Assembly

Authors: Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09328
Pdf URL: https://arxiv.org/pdf/2504.09328
Copy Paste: [[2504.09328]] Text To 3D Object Generation For Scalable Room Assembly(https://arxiv.org/abs/2504.09328)
Keywords: generation
Abstract: Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.
摘要：现代的机器学习模型用于场景理解，例如深度估计和对象跟踪，依赖于模仿现实世界部署场景的大型高质量数据集。为了解决数据稀缺性，我们为可扩展，高质量和可自定义的3D室内场景的合成数据生成提供了端到端系统。通过将文本对图像和多视图扩散模型与基于神经辐射场的网格融合在一起，该系统从文本提示中生成了HighFidelity 3D对象资产，并使用渲染工具将它们合并到预定的平面图中。通过将新颖的损失功能和培训策略引入现有方法，该系统支持按需场景生成，旨在减轻当前可用数据的稀缺性，通常由艺术家手动制作。该系统推进了综合数据在解决机器学习训练限制中的作用，从而为现实世界应用提供了更健壮和可推广的模型。

Title: REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

Authors: Duy-Cat Can, Quang-Huy Tang, Huong Ha, Binh T. Nguyen, Oliver Y. Chén
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.09354
Pdf URL: https://arxiv.org/pdf/2504.09354
Copy Paste: [[2504.09354]] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis(https://arxiv.org/abs/2504.09354)
Keywords: generative
Abstract: Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.
摘要：及时，准确地诊断神经退行性疾病，例如阿尔茨海默氏病，是疾病管理的核心。 Existing deep learning models require large-scale annotated datasets and often function as "black boxes".此外，临床实践中的数据集通常是小的或未标记的，限制了深度学习方法的全部潜力。在这里，我们介绍了记住 - 基于检索的可解释的多模式循证指导的大脑评估和推理建模 - 一种新的机器学习框架，通过基于参考的推理过程，使用脑部MRI扫描来促进零和少数阿尔茨海默氏症的诊断。具体而言，请记住首先使用专家注册参考数据对比对齐的视觉文本模型，并扩展了编码异常类型，诊断标签和综合临床描述的伪文本模式。然后，在推理时，请记住从策划数据集中检索类似的人类验证案例，并通过编码模块和基于注意力的推理头的专用证据来整合其上下文信息。这种循证指导的设计使得记住可以通过在检索成像和文本上下文中进行预测来模仿现实世界中的临床决策过程。具体来说，请记住输出诊断预测以及可解释的报告，包括参考图像和与临床工作流程一致的解释。实验结果表明，请记住实现稳健的零和少量性能，并为现实世界中的基于神经成像的诊断提供了一个强大而可解释的框架，尤其是在有限的数据下。

Title: Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

Authors: Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09377
Pdf URL: https://arxiv.org/pdf/2504.09377
Copy Paste: [[2504.09377]] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers(https://arxiv.org/abs/2504.09377)
Keywords: restoration
Abstract: All-in-one image restoration, which aims to address diverse degradations within a unified framework, is critical for practical applications. However, existing methods rely on predicting and integrating degradation conditions, which can misactivate degradation-specific features in complex scenarios, limiting their restoration performance. To address this issue, we propose a novel all-in-one image restoration framework guided by Histograms of Oriented Gradients (HOG), named HOGformer. By leveraging the degradation-discriminative capability of HOG descriptors, HOGformer employs a dynamic self-attention mechanism that adaptively attends to long-range spatial dependencies based on degradation-aware HOG cues. To enhance the degradation sensitivity of attention inputs, we design a HOG-guided local dynamic-range convolution module that captures long-range degradation similarities while maintaining awareness of global structural information. Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions. Extensive experiments across diverse benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes effectively to complex real-world degradations. Code is available at this https URL.
摘要：旨在解决统一框架内各种降解的多合一图像恢复对于实际应用至关重要。但是，现有方法依赖于预测和整合降解条件，这些条件可能会在复杂的场景中误导特定于降解的特征，从而限制其恢复性能。为了解决这个问题，我们提出了一个新型的多合一图像恢复框架，该框架由定向梯度（HOG）的直方图，名为Hogformer。通过利用猪描述符的降解歧义能力，Hogformer采用了动态的自我注意力发项机制，该机制可适应基于降解感知的猪提示，以适应远程空间依赖性。为了增强注意力输入的降解敏感性，我们设计了一个猪引导的局部动力范围卷积模块，该模块捕获了远程降解相似性，同时保持了对全球结构信息的认识。 Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions.跨不同基准测试的广泛实验，包括不利的天气和自然降解，表明，Hogformer实现了最新的性能，并有效地推广到复杂的现实世界中的降级。代码可在此HTTPS URL上找到。

Title: Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.09441
Pdf URL: https://arxiv.org/pdf/2504.09441
Copy Paste: [[2504.09441]] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance(https://arxiv.org/abs/2504.09441)
Keywords: generation
Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.
摘要：多模式医学图像在精确和全面的临床诊断中起着至关重要的作用。扩散模型是合成所需医学图像的有力策略。但是，由于高频信息的过度拟合和低频信息的削弱，现有方法仍然遭受解剖结构失真的问题。因此，我们提出了一种基于动态频率平衡和知识指导的新方法。具体而言，我们首先使用小波变换分解模型的关键特征来提取低频和高频组件。然后，动态频率平衡模块被设计为适应性调整频率，以增强全局低频功能和有效的高频细节以及抑制高频噪声。为了进一步克服不同医学模式之间的巨大差异所带来的挑战，我们构建了一种知识引导的机制，该机制将视觉语言模型带有视觉特征的先前临床知识融合在一起，以促进准确的解剖结构的产生。多个数据集的实验评估表明，所提出的方法在定性和定量评估方面取得了重大改进，从而验证其有效性和优势。

Title: FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

Authors: Tianyi Wang, Harry Cheng, Ming-Hui Liu, Mohan Kankanhalli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09451
Pdf URL: https://arxiv.org/pdf/2504.09451
Copy Paste: [[2504.09451]] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks(https://arxiv.org/abs/2504.09451)
Keywords: generation
Abstract: Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
摘要：自从被动深层检测器遇到识别高质量合成图像时，通过强大的水印进行了积极的深层检测。但是，在证明合理的检测性能的同时，它们缺乏定位功能和检测结果中的解释性。此外，水印的不稳定鲁棒性可以显着影响检测性能。在这项研究中，我们提出了用于主动的深层检测和定位的新型分形水印，即分形福镜。从分形的特征中受益，我们设计了一个由参数驱动的水印生成管道，该管道得出基于分形的水印，并就所选参数进行单向加密。随后，我们为水印嵌入和恢复提出了一个半碎片的水印框架，在黑色盒子设置中面对深层操作时，接受了良好的良性图像处理操作和脆弱的训练。同时，我们引入了一种进入点键入策略，该策略将水印矩阵条目隐式嵌入到相应位置的图像贴片中，从而实现了深击操作的定位。广泛的实验表明，我们的方法令人满意地对抗常见的图像处理操作和深层操作，表现优于最先进的半污染水印算法和用于深膜检测的被动探测器。此外，通过强调操纵的区域，我们的方法为主动的深层检测结果提供了解释性。

Title: D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Authors: Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09454
Pdf URL: https://arxiv.org/pdf/2504.09454
Copy Paste: [[2504.09454]] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation(https://arxiv.org/abs/2504.09454)
Keywords: generation
Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at this https URL.
摘要：扩散模型因其产生高保真图像的能力而被广泛认可。尽管扩散变压器（DIT）结构具有出色的性能和可扩展性，但在扩散过程中，它在不同的图像区域进行了固定的压缩，无视这些区域中存在的自然变化信息密度。但是，大型压缩会导致本地现实主义有限，而小压缩会增加计算复杂性并损害全球一致性，最终影响产生的图像的质量。 To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for扩散过程。（2）在第二阶段的动态扩散变压器（D $^2 $ IT）通过预测多透明的噪声来生成图像，该噪声由粗粒（平滑区域中的潜在代码较少）和细粒度（详细区域中的更潜在代码）组成，通过动态晶粒变压器的新型组合和动态含量变压器组成。将噪声与详细区域进行粗略预测的策略校正实现了全球一致性和当地现实主义的统一。对各种一代任务的全面实验验证了我们方法的有效性。代码将在此HTTPS URL上发布。

Title: Comorbidity-Informed Transfer Learning for Neuro-developmental Disorder Diagnosis

Authors: Xin Wen, Shijie Guo, Wenbo Ning, Rui Cao, Jie Xiang, Xiaobo Liu, Jintai Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09463
Pdf URL: https://arxiv.org/pdf/2504.09463
Copy Paste: [[2504.09463]] Comorbidity-Informed Transfer Learning for Neuro-developmental Disorder Diagnosis(https://arxiv.org/abs/2504.09463)
Keywords: generation
Abstract: Neuro-developmental disorders are manifested as dysfunctions in cognition, communication, behaviour and adaptability, and deep learning-based computer-aided diagnosis (CAD) can alleviate the increasingly strained healthcare resources on neuroimaging. However, neuroimaging such as fMRI contains complex spatio-temporal features, which makes the corresponding representations susceptible to a variety of distractions, thus leading to less effective in CAD. For the first time, we present a Comorbidity-Informed Transfer Learning(CITL) framework for diagnosing neuro-developmental disorders using fMRI. In CITL, a new reinforced representation generation network is proposed, which first combines transfer learning with pseudo-labelling to remove interfering patterns from the temporal domain of fMRI and generates new representations using encoder-decoder architecture. The new representations are then trained in an architecturally simple classification network to obtain CAD model. In particular, the framework fully considers the comorbidity mechanisms of neuro-developmental disorders and effectively integrates them with semi-supervised learning and transfer learning, providing new perspectives on interdisciplinary. Experimental results demonstrate that CITL achieves competitive accuracies of 76.32% and 73.15% for detecting autism spectrum disorder and attention deficit hyperactivity disorder, respectively, which outperforms existing related transfer learning work for 7.2% and 0.5% respectively.
摘要：神经发展疾病表现为认知，沟通，行为和适应能力的功能障碍，基于学习的计算机辅助诊断（CAD）可以减轻越来越紧张的神经影像医疗保健资源。然而，诸如fMRI之类的神经影像含有复杂的时空特征，这使得相应的表示易受各种干扰，从而导致CAD有效。我们首次提出了一个合并症的转移学习（CITL）框架，用于使用fMRI诊断神经发展疾病。在CITL中，提出了一个新的增强表示生成网络，该网络首先将转移学习与伪标签结合起来，以从fMRI的时间域中删除干涉模式，并使用Encoder-Decoder架构生成新表示。然后，在建筑简单的分类网络中对新表示形式进行培训，以获得CAD模型。特别是，该框架充分考虑了神经发展疾病的合并症机制，并有效地将它们与半监督的学习和转移学习相结合，从而提供了有关跨学科的新观点。实验结果表明，对于检测自闭症谱系障碍和注意力不足多动障碍，CITL的竞争精度分别为76.32％和73.15％，这分别优于现有相关转移学习工作，分别以7.2％和0.5％的速度。

Title: CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Authors: Pooja Guhan, Divya Kothandaraman, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09472
Pdf URL: https://arxiv.org/pdf/2504.09472
Copy Paste: [[2504.09472]] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models(https://arxiv.org/abs/2504.09472)
Keywords: generation
Abstract: We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.
摘要：我们介绍了Cammimic，这是一种针对动态视频编辑需求而定制的创新算法。它旨在将在给定参考视频中观察到的相机运动无缝地将用户选择的任何场景中观察到的零摄像机的任何场景中，而无需任何其他数据。我们的算法通过利用文本对视频扩散模型来实现这一目标。在第一阶段，我们使用洛拉层和正交性损失的组合开发了一种多概念学习方法，以捕获和理解参考视频的基本时空特征以及用户所需场景的空间特征。第二阶段提出了一种基于同构的独特改进策略，以增强生成视频的时间和空间对齐。我们通过在包含包含各种摄像机运动的不同场景和参考视频组合的数据集上进行的实验来证明我们的方法的功效。在没有用于评估无关场景之间摄像机运动转移的既定指标的情况下，我们提出了Camerascore，这是一种新颖的指标，该指标利用同构图表示参考和生成的视频之间的摄像机运动相似性。广泛的定量和定性评估表明，我们的方法会产生高质量的运动增强视频。此外，一项用户研究表明，有70.31％的参与者更喜欢我们的场景保存方法，而90.45％的参与者则更喜欢它进行运动转移。我们希望这项工作为跨不同场景的摄像机运动传输的未来进步奠定了基础。

Title: GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model

Authors: Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2504.09485
Pdf URL: https://arxiv.org/pdf/2504.09485
Copy Paste: [[2504.09485]] GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model(https://arxiv.org/abs/2504.09485)
Keywords: generation, generative
Abstract: The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces, which restricts their ability to complement each other for more advanced applications. In this work, we present GenEDA, the first framework that aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representations and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms that support both open-source trainable LLMs and commercial frozen LLMs. Built on this aligned architecture, GenEDA enables three unprecedented generative reasoning tasks over netlists, where the model reversely generates the high-level functionality from low-level netlists in different granularities. These tasks extend traditional gate-type prediction to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs' (e.g., GPT-4o and DeepSeek-V3) performance in all tasks.
摘要：基金会AI的成功促进了电路基础模型的研究，这些模型是针对综合电路（IC）设计过程进行定制的。但是，现有的预训练电路模型通常仅限于独立编码器，用于预测任务或用于生成任务的解码器。这两种模型类型是独立开发的，以不同的电路方式运行，并驻留在不同的潜在空间中，这限制了他们相互补充的能力以获得更高级的应用程序。在这项工作中，我们提出了Geneda，这是将电路编码与共享潜在空间内的解码器对齐的第一个框架。 Geneda桥接了基于图的电路表示与基于文本的大语言模型（LLM）之间的差距，从而使其各自的潜在空间之间的通信。为了实现对齐方式，我们提出了两个支持可训练的LLM和商业冷冻LLM的范式。 Geneda建立在这种对齐的体系结构上，使三个前所未有的生成推理任务超过了NetLists，在此模型中，模型从不同粒度的低级网络名单中反复生成了高级功能。这些任务将传统的栅极类型预测扩展到直接生成全电路功能。实验表明，Geneda显着提高了所有任务中的高级LLM（例如GPT-4O和DeepSeek-V3）。

Title: PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

Authors: Pengfei Wang, Hao Zheng, Zhigang Hu, Aikun Xu, Meiguang Zheng, Liu Yang (School of Computer Science and Engineering, Central South University, Changsha, China)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09502
Pdf URL: https://arxiv.org/pdf/2504.09502
Copy Paste: [[2504.09502]] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification(https://arxiv.org/abs/2504.09502)
Keywords: generation
Abstract: Existing SAR image classification methods based on Contrastive Learning often rely on sample generation strategies designed for optical images, failing to capture the distinct semantic and physical characteristics of SAR data. To address this, we propose Physics-Driven Contrastive Mutual Learning for SAR Classification (PCM-SAR), which incorporates domain-specific physical insights to improve sample generation and feature extraction. PCM-SAR utilizes the gray-level co-occurrence matrix (GLCM) to simulate realistic noise patterns and applies semantic detection for unsupervised local sampling, ensuring generated samples accurately reflect SAR imaging properties. Additionally, a multi-level feature fusion mechanism based on mutual learning enables collaborative refinement of feature representations. Notably, PCM-SAR significantly enhances smaller models by refining SAR feature representations, compensating for their limited capacity. Experimental results show that PCM-SAR consistently outperforms SOTA methods across diverse datasets and SAR classification tasks.
摘要：基于对比度学习的现有SAR图像分类方法通常依赖于专为光学图像设计的样本生成策略，无法捕获SAR数据的独特语义和物理特征。为了解决这个问题，我们提出了以物理驱动的对比度相互学习的SAR分类（PCM-SAR），该学习结合了特定领域的物理见解，以改善样品产生和特征提取。 PCM-SAR利用灰级共存在矩阵（GLCM）模拟逼真的噪声模式并应用语义检测以进行无监督的局部采样，从而确保生成的样品准确地反映了SAR成像。此外，基于相互学习的多层次特征融合机制可以协作特征表示。值得注意的是，PCM-SAR通过完善SAR特征表示形式来显着增强较小的模型，从而补偿其容量有限。实验结果表明，PCM-SAR始终优于各种数据集和SAR分类任务的SOTA方法。

Title: DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

Authors: Puyu Han, Jiaju Kang, Yuhang Pan, Erting Pan, Zeyu Zhang, Qunchao Jin, Juntao Jiang, Zhichen Liu, Luqi Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09513
Pdf URL: https://arxiv.org/pdf/2504.09513
Copy Paste: [[2504.09513]] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion(https://arxiv.org/abs/2504.09513)
Keywords: restoration, generation
Abstract: Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.
摘要：大规模训练的扩散模型在条件图像产生领域产生了出色的结果。然而，作为该领域的重要下游任务的恢复古代壁画，由于其较大的缺陷区域和稀缺的训练样本，对基于扩散模型的恢复方法构成了重大挑战。有条件的恢复任务更关心恢复部分是否符合壁画恢复的美学标准，从整体样式和接缝细节方面，以及用于评估启发式图像补充的这种指标在当前的研究中缺乏。因此，我们提出了带有控制网和循环一致性损失的分散，一种组合的多尺度收敛和协作扩散机制，以优化生成的图像与条件控制之间的匹配。 Diffumural显示出在壁画修复中的出色能力，利用了23个大规模的Dunhuang壁画的训练数据，这些数据表现出一致的视觉美学。该模型在恢复复杂的细节，实现整体外观以及解决不完整的壁画所带来的独特挑战方面表现出色。我们的评估框架包含了四个关键指标，以定量评估不完整的壁画：事实准确性，纹理细节，上下文语义和整体视觉连贯性。此外，我们整合了人文价值评估，以确保恢复的壁画保留其文化和艺术意义。广泛的实验验证了我们的方法在定性和定量指标中均优于最先进的方法（SOTA）方法。

Title: 3D CoCa: Contrastive Learners are 3D Captioners

Authors: Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09518
Pdf URL: https://arxiv.org/pdf/2504.09518
Copy Paste: [[2504.09518]] 3D CoCa: Contrastive Learners are 3D Captioners(https://arxiv.org/abs/2504.09518)
Keywords: generation
Abstract: 3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at this https URL.
摘要：3D字幕旨在描述自然语言中3D场景的内容，由于点云的固有稀疏性和现有方法中的跨模式对准较弱，因此仍然具有高度挑战性。为了应对这些挑战，我们提出了3D可口可乐，这是一个新颖的统一框架，无缝将对比的视觉学习与单个体系结构中的3D字幕产生相结合。我们的方法利用冷冻的剪贴视觉语言主链提供丰富的语义先验，一种空间意识的3D场景编码器来捕获几何环境，以及多模式解码器以生成描述性字幕。与以前依赖于显式对象建议的两阶段方法不同，3D可口可乐共同优化了共享特征空间中的对比度和字幕目标，从而消除了对外部检测器或手工制作的建议的需求。这种联合训练范式通过对齐3D和文本表示，产生了更强的空间推理和更丰富的语义基础。对扫描仪和NR3D基准测试的广泛实验表明，3D可口可分的表现分别以0.5iou的苹果酒明显优于当前的最新面漆和5.76％。代码将在此HTTPS URL上可用。

Title: AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Authors: Xing Zi, Tengjun Ni, Xianjing Fan, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09528
Pdf URL: https://arxiv.org/pdf/2504.09528
Copy Paste: [[2504.09528]] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions(https://arxiv.org/abs/2504.09528)
Keywords: generation
Abstract: Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.
摘要：对环境监测，城市规划和灾难管理等应用至关重要，准确和自动化的字幕至关重要。但是，由于复杂的空间语义和域变异性，该任务仍然具有挑战性。为了解决这些问题，我们介绍了\ textbf {Aerolite}，这是一个轻巧的标签引导字幕框架，旨在配备具有强大且可解释的字幕功能的小规模语言模型（1--3B参数），专门用于遥感图像。 \ textbf {Aerolite}利用GPT-4O来生成大规模的，语义上丰富的伪符号数据集，通过集成多个遥感基准，包括DLRSD，ISAID，ISAID，LOVEDA，WHU，WHU和RSSCN7。为了明确捕获关键的语义元素，例如方向和土地利用类型，Aerolite采用自然语言处理技术来提取相关的语义标签。然后，由专用的多标签剪辑编码器学习这些标签，以确保精确的语义预测。为了有效地融合视觉和语义信息，我们提出了一种新颖的桥接多层感知器（MLP）体系结构，将语义标签与视觉嵌入在同时保持最小的计算开销的同时。 Aerolite的灵活设计还可以通过各种预算的大语言模型无缝集成。我们采用了两阶段的基于洛拉的训练方法：初始阶段利用我们的伪符号数据集捕获广泛的遥感语义，然后在较小的，策划的数据集（如UCM和悉尼字幕）上进行微调以完善域特异性对齐。实验评估表明，在包括BLEU和流星在内的标准字幕指标中，Aerolite超过了更大的模型（例如13B参数），同时保持了较低的计算成本。

Title: Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

Authors: Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09530
Pdf URL: https://arxiv.org/pdf/2504.09530
Copy Paste: [[2504.09530]] Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders(https://arxiv.org/abs/2504.09530)
Keywords: quality assessment
Abstract: Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at this https URL.
摘要：神经系统疾病中的自动面部表达质量评估（FEQA）对于增强诊断准确性和改善患者护理至关重要，但有效地捕捉了面部肌肉运动的细微动作和细微差别仍然是一个挑战。我们建议分析面部地标轨迹，即紧凑而有益的表示，从高级结构的角度编码这些微妙的动作。因此，我们引入了轨迹引导的运动感知变压器（Tramp-Former），这是一种新型的FEQA框架，可将具有里程碑意义的轨迹特征与RGB框架的视觉语义提示相融合，最终将组合的特征回归质量得分。广泛的实验表明，流浪汉在具有神经系统疾病的基准数据集上实现了新的最先进的性能，包括PFED5（上升6.51％）和增强的多伦多神经法（增长7.62％）。我们的消融研究进一步验证了FEQA中地标轨迹的效率和有效性。我们的代码可在此HTTPS URL上找到。

Title: FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View

Authors: Yuting Zhao, Yuheng Ji, Xiaoshuai Hao, Shuxiao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09535
Pdf URL: https://arxiv.org/pdf/2504.09535
Copy Paste: [[2504.09535]] FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View(https://arxiv.org/abs/2504.09535)
Keywords: generation
Abstract: Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird's Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0x speedup by stereo methods on the RSRD dataset. The source code will be released.
摘要：道路表面重建（RSR）对于自动驾驶至关重要，从而能够理解道路表面条件。最近，鸟类视图（BEV）的RSR因其提高性能的潜力而引起了人们的关注。但是，现有的转换观点以使BEV面临挑战的方法，例如信息丢失和表示稀疏。此外，BEV中的立体声匹配受到平衡准确性与推理速度的限制。为了应对这些挑战，我们提出了两个有效且准确的基于BEV的RSR模型：FastrSR-MONO和FASTRSR-STEREO。具体而言，我们首先引入深度感知投影（DAP），这是一种有效的视图转换策略，旨在通过查询深度和图像特征来减轻信息损失和稀疏性，以使用预先计算的查找表在特定的道路表面区域中汇总BEV数据。为了优化立体声匹配的准确性和速度，我们设计了空间注意力增强（SAE）和置信度注意力（CAG）模块。 SAE适应性地突出了重要区域，而CAG则着重于高信心预测并滤除无关的信息。 FASTRSR实现了最先进的性能，超过单眼竞争对手的高度绝对误差超过6.0％，并在RSRD数据集中通过立体声方法至少提供3.0倍的速度。源代码将发布。

Title: SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Authors: Xiang Hu, Pingping Zhang, Yuhao Wang, Bin Yan, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09549
Pdf URL: https://arxiv.org/pdf/2504.09549
Copy Paste: [[2504.09549]] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification(https://arxiv.org/abs/2504.09549)
Keywords: generative
Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.
摘要：空中人员重新识别（AG-REID）旨在检索具有不同观点的相机的特定人员。以前的作品着重于设计歧视性REID模型，以保持身份一致性，尽管相机观点发生了巨大变化。这些方法背后的核心思想是很自然的，但是设计一个稳定的网络是一项非常具有挑战性的任务。此外，他们忽略了特定特定特征在增强模型代表人员的能力方面的贡献。为了解决这些问题，我们提出了一个名为AG-REID的新型两阶段学习框架SD-REID，它利用了生成模型的强大理解能力，例如稳定扩散（SD），以在不同视点之间生成特定视图的特征。在第一阶段，我们训练一个简单的基于VIT的模型来提取粗粒表示和可控条件。然后，在第二阶段，我们调整了SD模型以学习以可控条件为指导的互补表示。此外，我们提出了视图解码器（VRD），以获取其他可控条件，以生成缺失的跨视图功能。最后，我们使用SD生成的粗粒表示和全视图功能来检索目标人员。对AG-REID基准测试的广泛实验证明了我们提出的SD-REID的有效性。源代码将在接受后可用。

Title: Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

Authors: Jinhao Li, Zijian Chen, Runze Dong, Tingzhu Chen, Changbo Wang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09555
Pdf URL: https://arxiv.org/pdf/2504.09555
Copy Paste: [[2504.09555]] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark(https://arxiv.org/abs/2504.09555)
Keywords: generation, generative
Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.
摘要：Oracle骨铭文（OBI）的识别在理解古代中国的历史和文化中起着重要作用。但是，现有的OBI数据集遭受了长尾分配问题的困扰，从而导致在多数和少数类别中的OBI识别模型的表现有偏见。随着生成模型的最新进展，基于OBI综合的数据增强已成为扩大少数类别样本量的有希望的途径。不幸的是，当前的OBI数据集缺乏用于生成模型训练的大规模结构一致的图像对。为了解决这些问题，我们首先介绍了Oracle-P15K，这是一个与结构一致的OBI数据集，用于OBI生成和Denoising，由14,542张图像组成，这些图像注入了OBI专家的域知识。其次，我们提出了一个基于扩散模型的伪OBI发电机，称为Obidiff，以实现现实且可控制的OBI生成。鉴于干净的字形图像和目标摩擦式图像，它可以有效地将原始摩擦的噪声样式传递到字形图像中。对OBI下游任务和用户偏好研究的广泛实验表明了拟议的Oracle-P15K数据集的有效性，并证明Obidiff可以准确地保留固有的Glyph结构，同时有效地传输真实的摩擦样式。

Title: DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning

Authors: Yining Zhao, Ali Braytee, Mukesh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09598
Pdf URL: https://arxiv.org/pdf/2504.09598
Copy Paste: [[2504.09598]] DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning(https://arxiv.org/abs/2504.09598)
Keywords: generation
Abstract: Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts' prior knowledge and automatic annotations for downstream vision-language tasks.
摘要：通过视觉语言模型对医疗图像字幕进行了标题显示出有希望的临床诊断辅助潜力。但是，以准确的方式识别产生上下文相关的描述仍然具有挑战性。我们提出了双启示性medcap，这是一种新型的双提取增强框架，通过两个专业组件增强了大型视觉语言模型（LVLM）：（1）一种模态感知的提示，从半抑制的分类模型中预测的是在医疗问答词回答中预处理的模型，并且（2）提示构成了一个问题，并提示了一个问题构成的构图。为了解决缺乏字幕地面真相，我们还提出了一个评估框架，该框架共同考虑了空间语义相关性和医学叙事质量。多个医疗数据集的实验表明，双促销中的MEDCAP通过实现模态识别精度提高22％，同时产生更全面和提问的描述，从而优于基线Blip-3。我们的方法可以生成临床准确的报告，这些报告可以作为医学专家的先验知识和自动注释，以实现下游视力语言任务。

Title: Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Authors: Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan (Celine)Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09606
Pdf URL: https://arxiv.org/pdf/2504.09606
Copy Paste: [[2504.09606]] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training(https://arxiv.org/abs/2504.09606)
Keywords: generation, generative
Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at this https URL.
摘要：训练扩散模型（DMS）需要大量的计算资源，这是由于多次向后传球跨越了许多时间段，激发了对有效训练技术的研究。在本文中，我们提出了EB-DIFF-TRAIN，这是一种新的有效的DM培训方法，与其他加速DM培训的方法是正交的，通过调查和利用早期鸟（EB）门票（EB）门票 - 稀疏的子网络在训练早期表现出来，并保持高发质量。我们首先研究了DMS中传统的EB票的存在，从而实现了具有竞争力的生成质量，而无需全面训练密集的模型。然后，我们深入研究了扩散专用的EB门票的概念，从不同的时间段区域的重要性来看。这些门票根据相应的时间步段的重要性调整其稀疏度，从而使非关键区域的侵略性稀疏性在为关键时间段区域保存计算资源的同时。在此基础上，我们开发了一种有效的DM培训技术，该技术会导致TimeStep-tawaence EB门票，并行训练它们，并在推断图像生成过程中结合使用。广泛的实验验证了传统和时间段感知到的EB门票的存在，以及我们提出的EB-DIFF训练方法的有效性。这种方法可以大大减少空间和时间上的训练时间 - 与训练未经修复的密度型号相比，与标准的Train-Prune-Finetune管道相比，在训练未经修复的密集模型上实现2.9 $ \ times $ tos $ \ times $速度，最多可快10.3 $ \ times $。我们的代码可在此HTTPS URL上找到。

Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Authors: Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09656
Pdf URL: https://arxiv.org/pdf/2504.09656
Copy Paste: [[2504.09656]] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation(https://arxiv.org/abs/2504.09656)
Keywords: generation
Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in this https URL.
摘要：从各种条件（例如文本，图像和音频）中生成视频，可以实现空间和时间控制，从而导致高质量的生成结果。具有戏剧性动作的视频通常需要更高的帧速率以确保流畅的运动。当前，大多数音频到视觉动画模型都使用视频剪辑中均匀采样的帧。但是，这些均匀的采样框架无法在低帧速率下捕获戏剧性动作的重要关键力矩，直接增加帧数时需要明显更多的内存。在本文中，我们提出了KeyVid，KeyVid是一种关键的音频到视觉动画框架，可显着提高音频信号中关键时刻的发电质量，同时保持计算效率。给定图像和音频输入，我们首先将键帧时间步骤从音频定位。然后，我们使用键帧发电机来生成相应的Visual KeyFrames。最后，我们使用Motion Interpolator生成所有中间帧。通过广泛的实验，我们证明了KeyVid可以显着改善多个数据集的音频视频同步和视频质量，尤其是对于高度动态的动作。该代码在此HTTPS URL中发布。

Title: Computer-Aided Layout Generation for Building Design: A Review

Authors: Jiachen Liu, Yuan Xue, Haomiao Ni, Rui Yu, Zihan Zhou, Sharon X. Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09694
Pdf URL: https://arxiv.org/pdf/2504.09694
Copy Paste: [[2504.09694]] Computer-Aided Layout Generation for Building Design: A Review(https://arxiv.org/abs/2504.09694)
Keywords: generation, generative
Abstract: Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.
摘要：在计算机视觉和架构域中都研究了用于自动建筑设计的现实建筑布局。基于优化技术或启发式设计指南的建筑领域的传统方法可以综合理想的布局，但通常需要后处理并涉及设计管道中的人类互动，从而使它们昂贵且昂贵。深层生成模型的出现显着提高了生成的建筑布局的忠诚度和多样性，从而减少了设计师的工作量，并使过程更加有效。在本文中，我们对建筑布局设计和发电的三个主要研究主题进行了全面评论：平面图生成，场景布局综合以及其他一些其他建筑布局的形式。对于每个主题，我们概述了领先的范式，该范式由研究域（体系结构或机器学习）或用户输入条件或约束进行分类。然后，我们介绍用于验证方法的有效性以及相应评估指标的常用基准数据集。最后，我们确定了现有方法的解决方案的问题和局限性，然后提出新的观点，作为在这个重要研究领域的未来研究的有希望的方向。与本调查相关的项目可以在很棒的建设中获得资源。

Title: Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

Authors: Shuai Jiang, Saeed Hassanpour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09704
Pdf URL: https://arxiv.org/pdf/2504.09704
Copy Paste: [[2504.09704]] Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis(https://arxiv.org/abs/2504.09704)
Keywords: restoration
Abstract: Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
摘要：基于变压器的模型在自然语言和视觉任务中取得了显着的成功，但是由于数据稀疏性，高维度和缺失值，它们在基因表达分析中的应用仍然有限。我们提出了Gexbert，这是一个基于变压器的自动编码器框架，用于鲁棒表示基因表达数据。 Gexbert通过在大规模的转录组概况上预处理具有掩盖和恢复目标，从而学习上下文感知的基因嵌入，从而捕获了数千种基因之间的共表达关系。我们在癌症研究中的三个关键任务中评估了Gexbert：Pan-Cancer分类，癌症特异性的生存预测和缺失的价值插补。 Gexbert从有限的基因子集中实现了最新的分类精度，通过恢复预后锚基因的表达来改善生存预测，并且在高缺失下的常规插补方法胜过。此外，其基于注意力的可解释性揭示了跨癌症类型的生物学意义上的基因模式。这些发现证明了Gexbert作为基因表达建模的可扩展有效工具的实用性，在基因覆盖范围有限或不完整的设置中具有翻译潜力。

Title: Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems

Authors: John J. Vastola
Subjects: cs.LG, cond-mat.stat-mech
Abstract URL: https://arxiv.org/abs/2504.09761
Pdf URL: https://arxiv.org/pdf/2504.09761
Copy Paste: [[2504.09761]] Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems(https://arxiv.org/abs/2504.09761)
Keywords: generative
Abstract: Noether's theorem provides a powerful link between continuous symmetries and conserved quantities for systems governed by some variational principle. Perhaps unfortunately, most dynamical systems of interest in neuroscience and artificial intelligence cannot be described by any such principle. On the other hand, nonequilibrium physics provides a variational principle that describes how fairly generic noisy dynamical systems are most likely to transition between two states; in this work, we exploit this principle to apply Noether's theorem, and hence learn about how the continuous symmetries of dynamical systems constrain their most likely trajectories. We identify analogues of the conservation of energy, momentum, and angular momentum, and briefly discuss examples of each in the context of models of decision-making, recurrent neural networks, and diffusion generative models.
摘要：Noether的定理在连续对称性和由某些变化原理支配的系统的保守数量之间提供了强大的联系。也许不幸的是，任何这样的原则都无法描述神经科学和人工智能的大多数动态系统。另一方面，非平衡物理学提供了一个变分原理，该原理描述了两种状态之间最有可能过渡的相当通用嘈杂的动力系统。在这项工作中，我们利用这一原则应用了Noether的定理，因此了解动态系统的连续对称性如何限制其最可能的轨迹。我们确定了能量，动量和角动量保护的类似物，并在决策模型，经常性神经网络和扩散生成模型的模型中简要讨论每个示例。

Title: EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Authors: Chao Liu, Arash Vahdat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09789
Pdf URL: https://arxiv.org/pdf/2504.09789
Copy Paste: [[2504.09789]] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise(https://arxiv.org/abs/2504.09789)
Keywords: generation
Abstract: Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.
摘要：在时间上一致的视频到视频生成对于在SIM到运行，样式转移，视频上升采样等领域的视频扩散模型的应用至关重要。在本文中，我们提出了一个视频扩散框架，该视频扩散框架利用时间一致的噪声来生成相干的视频框架，而无需专门的模块或其他约束。我们表明，扩散模型的标准训练目标在用时间一致的噪声应用时，鼓励模型与输入视频和噪声中的空间转换一样。这使我们的模型能够更好地遵循输入视频的运动模式，从而产生对齐运动和高保真框架。此外，我们将方法扩展到3D一致的视频生成，通过将噪声作为3D网格上的纹理连接，从而确保在SIM到运行应用程序中的3D一致性。实验结果表明，我们的方法超过了运动对齐，3D一致性和视频质量的最新基准，同时仅需要几个采样步骤。

Title: ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments

Authors: Lu Yue, Dongliang Zhou, Liang Xie, Erwei Yin, Feitian Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.09843
Pdf URL: https://arxiv.org/pdf/2504.09843
Copy Paste: [[2504.09843]] ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments(https://arxiv.org/abs/2504.09843)
Keywords: generation
Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.
摘要：连续环境（VLN-CE）中的视觉和语言导航要求代理根据自然语言指令导航未知的连续空间。与离散设置相比，VLN-CE提出了两个核心感知挑战。首先，缺乏预定义的观察点会导致异质的视觉记忆并削弱了全球空间相关性。其次，在三维场景中的累积重建错误引入了结构噪声，从而损害了局部特征感知。为了应对这些挑战，本文提出了ST-Booster，这是一种迭代时空助推器，通过多种晶体感知和指导感知推理来增强导航性能。 ST-Booster由三个关键模块组成 - 层次时空编码（HSTE），多晶状体对齐融合（MGAF）和ValueGuided Waypoint Exentrain（VGWG）。 HSTE使用拓扑图编码长期的全球记忆，并通过网格图捕获短期本地细节。 MGAF通过几何学知识融合将这些双映像表示与说明。由此产生的表示是通过预处理任务进行迭代完善的。在推理过程中，VGWG生成有引导的注意热图（GAHS），以明确对环境指导相关性进行建模并优化WayPoint选择。进行了广泛的比较实验和绩效分析，表明ST-Booster的表现优于现有的最新方法，尤其是在复杂的，容易发生的环境中。

Title: Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Authors: Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du, Zhiwei Xu, Longxiang Gao, Li Guo, Shibiao Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09881
Pdf URL: https://arxiv.org/pdf/2504.09881
Copy Paste: [[2504.09881]] Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition(https://arxiv.org/abs/2504.09881)
Keywords: generation
Abstract: Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at this https URL
摘要：Visual Place识别（VPR）旨在通过引用地理标记的数据库来预测查询图像的位置。对于VPR任务，图像中通常更少的歧视性局部区域会产生重要的影响，而平凡的背景区域由于容易重叠而不会造成甚至引起感知混叠。但是，现有方法缺乏对这些歧视区域的完全建模和完全开采。在本文中，我们提出了对局部（FOL）方法的关注，以通过挖掘和利用图像中可靠的判别局部区域并引入伪相关监督，以同时刺激VPR的图像检索和重新排列。首先，我们设计了两种损失：提取 - 聚集空间对齐损失（SAL）和前后背景对比度增强损失（CEL），以明确模拟可靠的可靠歧视性局部区域，并使用它们来指导它们的产生全球表示和有效的重新排列。其次，我们基于从汇总的全球特征获得的伪反应来介绍一个弱监督的本地功能培训策略，以减轻VPR任务缺乏本地对应的真实性。第三，我们建议有效地基于歧视区域的指导有效，精确地进行重新排列的管道。最后，实验结果表明，我们的人在图像检索和重新排列阶段都在多个VPR基准上实现了最新的基准，并且在计算效率方面也显着优于现有的两阶段VPR方法。代码和型号可在此HTTPS URL上找到

Title: Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Authors: Yiwen Wang, Ying Liang, Yuxuan Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09887
Pdf URL: https://arxiv.org/pdf/2504.09887
Copy Paste: [[2504.09887]] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution(https://arxiv.org/abs/2504.09887)
Keywords: super-resolution, generation
Abstract: Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.
摘要：由于用户生成内容（UGC）图像和合成降解的现实世界降解之间的差异，传统的超分辨率方法难以有效地概括，因此需要采取更强大的方法来模拟现实世界中的扭曲。在本文中，我们提出了一种新颖的方法来通过将语义指导整合到扩散框架中，以实现UGC图像超分辨率。我们的方法通过分别模拟LSDIR数据集上的降解过程并将其与官方的配对训练集相结合，从而解决了野生和合成数据集降解之间的不一致性。此外，我们通过合并经过验证的语义提取模型（SAM2）和微调关键的高参数来增强降解的去除和细节的产生，以改善感知忠诚度。广泛的实验证明了我们的方法与最先进的方法的优越性。此外，拟议的模型在CVPR NTIRE 2025短期UGC图像超分辨率挑战中获得了第二名，从而进一步验证了其有效性。该代码可在https：//github.c10pom/moonsofang/ntire-2025-srab上获得。

Title: KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference

Authors: Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.09936
Pdf URL: https://arxiv.org/pdf/2504.09936
Copy Paste: [[2504.09936]] KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference(https://arxiv.org/abs/2504.09936)
Keywords: generation
Abstract: Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
摘要：大型语言模型（LLM）的有效推断受到不断增长的键值（KV）缓存的阻碍，从而使KV缓存压缩成为关键的研究方向。传统方法基于注意力评分或启发式方法，选择性地驱逐不太重要的KV缓存条目，从而导致信息丢失和幻觉。最近，已经探索了基于合并的策略来保留更多信息，通过合并将被丢弃的KV对。但是，这些现有的方法不可避免地会在合并前后引起注意力分布的不一致，从而导致产出扰动并降低发电质量。为了克服这一挑战，我们提出了一种新型的自适应KV缓存合并方法，旨在消除输出扰动，同时在严格的内存约束下保留性能。 KeepKV引入了选举投票机制，该机制记录了融合历史并自适应调整注意力评分的机制。此外，它进一步利用了一种新型的零推理扰动合并方法，保持注意力一致性并补偿由于缓存合并而导致的注意力损失。 KeepKV成功保留了重大压缩的缓存中的基本上下文信息。对各种基准和LLM体系结构进行的广泛实验表明，KeepKV大大降低了记忆使用量，将推理吞吐量提高了2倍以上，即使有10％的KV缓存预算，也可以保持卓越的发电质量。

Title: Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Authors: Huijie Liu, Bingcan Wang, Jie Hu, Xiaoming Wei, Guoliang Kang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.09948
Pdf URL: https://arxiv.org/pdf/2504.09948
Copy Paste: [[2504.09948]] Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes(https://arxiv.org/abs/2504.09948)
Keywords: generation
Abstract: Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
摘要：碟形图像在数字时代起着至关重要的作用，由于食品行业和电子商务的数字化，对文化独特的菜肴图像的需求不断增加。在一般情况下，现有的文本到图像生成模型在产生高质量图像方面表现出色。但是，他们努力捕获特定领域（尤其是中国菜肴）的各种特征和忠实的细节。为了解决这一限制，我们提出了Omni-Dish，这是第一个专门针对中国菜肴量身定制的文本到图像生成模型。我们开发了一条综合的盘式策划管道，构建了迄今为止最大的菜肴数据集。此外，我们介绍了一种夺回策略，并采用了粗到精细的训练计划来帮助该模型更好地学习细粒度的烹饪细微差别。在推断期间，我们使用预先构造的高质量字幕库和大型语言模型来增强用户的文本输入，从而实现更多逼真和忠实的图像生成。此外，为了扩展我们的模型编辑任务的能力，我们提出了概念增强的P2P。基于这种方法，我们构建了一个编辑数据集并培训专门编辑模型。广泛的实验证明了我们方法的优势。

Title: Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09973
Pdf URL: https://arxiv.org/pdf/2504.09973
Copy Paste: [[2504.09973]] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration(https://arxiv.org/abs/2504.09973)
Keywords: restoration
Abstract: All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
摘要：多合一的图像恢复，通过统一模型来解决各种降级类型，在设计特定于任务的提示方面面临着重大挑战，这些提示有效地指导了跨多个退化场景的恢复。自适应提示学习可以启用端到端优化，但它通常会产生重叠或冗余的任务表示。相反，从验证的分类器中得出的明确提示可增强可区分性，但可能会丢弃重新构造的关键视觉信息。 To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating跨不同降解类型的负提示样品。与以前主要关注降解分类的方法不同，CPL优化了提示与恢复模型本身之间的关键相互作用。跨五个综合基准的广泛实验表明，CPL始终增强最先进的多合一恢复模型，从而在标准的多任务情景和具有挑战性的综合退化设置方面取得了重大改进。我们的框架在保持参数效率的同时建立了新的最先进的性能，为统一图像恢复提供了原则的解决方案。

Title: Metric-Guided Synthesis of Class Activation Mapping

Authors: Alejandro Luque-Cerpa, Elizabeth Polgreen, Ajitha Rajan, Hazem Torfah
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09998
Pdf URL: https://arxiv.org/pdf/2504.09998
Copy Paste: [[2504.09998]] Metric-Guided Synthesis of Class Activation Mapping(https://arxiv.org/abs/2504.09998)
Keywords: generation
Abstract: Class activation mapping (CAM) is a widely adopted class of saliency methods used to explain the behavior of convolutional neural networks (CNNs). These methods generate heatmaps that highlight the parts of the input most relevant to the CNN output. Various CAM methods have been proposed, each distinguished by the expressions used to derive heatmaps. In general, users look for heatmaps with specific properties that reflect different aspects of CNN functionality. These may include similarity to ground truth, robustness, equivariance, and more. Although existing CAM methods implicitly encode some of these properties in their expressions, they do not allow for variability in heatmap generation following the user's intent or domain knowledge. In this paper, we address this limitation by introducing SyCAM, a metric-based approach for synthesizing CAM expressions. Given a predefined evaluation metric for saliency maps, SyCAM automatically generates CAM expressions optimized for that metric. We specifically explore a syntax-guided synthesis instantiation of SyCAM, where CAM expressions are derived based on predefined syntactic constraints and the given metric. Using several established evaluation metrics, we demonstrate the efficacy and flexibility of our approach in generating targeted heatmaps. We compare SyCAM with other well-known CAM methods on three prominent models: ResNet50, VGG16, and VGG19.
摘要：类激活映射（CAM）是一种广泛采用的显着性方法，用于解释卷积神经网络（CNN）的行为。这些方法生成的热图突出了与CNN输出最相关的输入部分。已经提出了各种CAM方法，每种方法由用于推导热图的表达式区分。通常，用户寻找具有反映CNN功能不同方面的特定属性的热图。这些可能包括与地面真理，鲁棒性，肩那样等相似。尽管现有的CAM方法隐式地编码了这些属性的表达式中的某些属性，但它们不允许在用户的意图或域知识之后进行热图生成的可变性。在本文中，我们通过引入Sycam（一种基于度量的综合CAM表达式的方法）来解决此限制。鉴于对显着图的预定义评估度量，Sycam会自动生成针对该度量的CAM表达式。我们专门探讨了Sycam的语法引导的合成实例化，其中CAM表达式基于预定义的句法约束和给定的度量。使用几个既定的评估指标，我们证明了方法在产生目标热图时的功效和灵活性。我们将SYCAM与其他众所周知的CAM方法进行了比较：RESNET50，VGG16和VGG19。

Title: GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Authors: Junlin Hao, Peiheng Wang, Haoyang Wang, Xinggong Zhang, Zongming Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10001
Pdf URL: https://arxiv.org/pdf/2504.10001
Copy Paste: [[2504.10001]] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting(https://arxiv.org/abs/2504.10001)
Keywords: generation, generative
Abstract: Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.
摘要：单像3D场景重建由于其本质上不适的性质和有限的输入约束而提出了重大挑战。最近的进步探索了两个有希望的方向：多视图生成模型，这些模型在3D一致的数据集中训练但要努力分布概括，以及跨视图不一致和次优误差处理的3D场景介绍和完成框架，因为它们完全依赖于深度数据或3D平滑度的效果，最终依赖于最终的表达效果，这些框架最终会降级和计算出色的效果和计算性能和计算性能。在这些方法的基础上，我们提出了GaussVideodreamer，通过弥合图像，视频和3D代的差距，通过两种关键创新来整合其优势：（1）逐步的视频介入策略来利用暂时的连贯性，以提高其优势，从而提高了生成的多媒体方法，从而促进了它们的优势：（2）3D高斯脱落一致性掩码，以3D一致的多视频证据指导视频扩散。我们的管道结合了三个核心组成部分：几何学意识到的初始化协议，不一致的高斯裂纹以及渐进的视频介绍策略。实验结果表明，与现有方法相比，我们的方法的LLAVA-IQA得分高32％，至少达到2倍的速度，同时在各种场景中保持了稳健的性能。

Title: Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

Authors: Nikolai Röhrich (XITASO GmbH and LMU Munich), Alwin Hoffmann (XITASO GmbH), Richard Nordsieck (XITASO GmbH), Emilio Zarbali (XITASO GmbH), Alireza Javanmardi (LMU Munich and Munich Center for Machine Learning)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10021
Pdf URL: https://arxiv.org/pdf/2504.10021
Copy Paste: [[2504.10021]] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics(https://arxiv.org/abs/2504.10021)
Keywords: generation
Abstract: Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.
摘要：尽管在一般计算机视觉中，基于变压器的架构已迅速成为黄金标准，而微电子缺陷检测仍然在很大程度上依赖于卷积神经网络（CNNS）。我们假设这是由于a）a）变压器对数据的需求增加，而b）标记的微电子图像生成过程成本很高，因此标记的数据很少。尽管在其他域中，大型自然图像数据集上的预训练可以减轻此问题，但在微电子转移中，由于域数据和自然图像的相似性而受到阻碍。因此，我们评估自我预训练，其中模型在目标数据集上进行了预训练，而不是另一个数据集。我们提出了一个基于蒙版自动编码器（MAE）的微电子中缺陷检测的视觉变压器（VIT）预训练框架。在MAE中，在预训练期间，该模型将大量的图像贴片掩盖和重建。我们使用小于10.000扫描的声学显微镜（SAM）图像使用瞬态热分析（TTA）进行预训练和缺陷检测。我们的实验结果表明，与A）受监督的VIT相比，我们的方法可实现大量性能，b）在自然图像数据集中预先训练的VIT以及c）文献中使用的基于CNN的最先进的基于CNN的缺陷检测模型。此外，可解释性分析表明，与VIT基线相比，我们的自我预训练的模型正确地集中在与缺陷相关的特征上，例如焊料材料中的裂纹。这表明我们的方法产生了特定于故障的特征表示，使我们的自我训练模型可用于微电子中的现实世界缺陷检测。

Title: Aligning Anime Video Generation with Human Feedback

Authors: Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10044
Pdf URL: https://arxiv.org/pdf/2504.10044
Copy Paste: [[2504.10044]] Aligning Anime Video Generation with Human Feedback(https://arxiv.org/abs/2504.10044)
Keywords: generation
Abstract: Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our dataset and code will be publicly available.
摘要：动漫视频产生由于动漫数据和异常运动模式而面临重大挑战，这导致了诸如运动失真和闪烁的文物之类的问题，这导致与人类偏好的不一致。现有的奖励模型主要是为现实世界视频设计的，无法捕获动漫的独特外观和一致性要求。在这项工作中，我们提出了一条管道，以利用人类的反馈来更好地保持一致性来增强动漫视频的产生。具体而言，我们为动漫视频构建了第一个多维奖励数据集，其中包括30k人类通知的样本，这些样本包括人类的偏好，以实现视觉外观和视觉一致性。基于此，我们开发了一种强大的奖励模型Animereward，它采用专门的视觉模型来实现不同的评估维度来指导偏好对齐。此外，我们引入了Gap-Aware偏好优化（GAPO），这是一种新型的训练方法，将优先差距明确纳入优化过程，从而提高了对齐性能和效率。广泛的实验结果表明，AnimeReward优于现有的奖励模型，并且包含GAPO会导致定量基准和人类评估的较高对准，这表明了我们的管道在增强动漫视频质量方面的有效性。我们的数据集和代码将公开可用。

Title: Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution

Authors: Zexin Ji, Beiji Zou, Xiaoyan Kui, Sebastien Thureau, Su Ruan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10105
Pdf URL: https://arxiv.org/pdf/2504.10105
Copy Paste: [[2504.10105]] Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution(https://arxiv.org/abs/2504.10105)
Keywords: super-resolution
Abstract: Convolutional neural networks and Transformer have made significant progresses in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To fully integrate the reference image for low-resolution image super-resolution, we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images.
摘要：卷积神经网络和变压器在多模式医学图像超分辨率方面取得了重大进展。但是，这些方法要么具有用于本地学习的固定接受领域，要么具有重大的计算负担，以限制超分辨率的性能。为了解决此问题，引入了状态空间模型，尤其是Mamba，以有效地模拟具有线性计算复杂性的图像中的远程依赖性。依靠曼巴（Mamba）以及低分辨率图像依靠全球信息来补偿缺失细节的事实，而高分辨率参考图像需要为准确的超级分辨率提供更多的本地详细信息，我们建议全球和本地的曼巴巴网络（GLMAMBA）以多模式性医疗图像超级分辨率。具体来说，我们的Glmamba是一个配备了全球Mamba分支和当地Mamba分支机构的两分支网络。全球MAMBA分支在低分辨率输入中捕获了远程关系，而本地Mamba分支则更多地关注高分辨率参考图像中的短距离细节。我们还使用变形块来适应两个分支的特征，以增强表示能力。调制器旨在进一步增强全球和本地曼巴块块中的可变形功能。为了完全整合低分辨率图像超分辨率的参考图像，我们通过考虑模态之间的相似性，差异和互补方面，进一步开发了多模式的特征融合块来适应熔融特征。此外，开发了对比边缘损失（CELOSS），以充分增强边缘纹理和医学图像中的对比度。

Title: SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding

Authors: Marc Gutiérrez-Pérez, Antonio Agudo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10106
Pdf URL: https://arxiv.org/pdf/2504.10106
Copy Paste: [[2504.10106]] SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding(https://arxiv.org/abs/2504.10106)
Keywords: generation
Abstract: Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball's position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.
摘要：体育视频分析是计算机视觉中的关键领域，通过多视图对应关系可以详细的空间理解。在这项工作中，我们介绍了Soccernet-V3D和Issia-3D，这是两个增强且可扩展的数据集，旨在在足球广播分析中理解3D场景。这些数据集通过合并基于场线的相机校准和多视图同步，扩展了Soccernet-V3和ISSIA，从而通过三角剖分实现了3D对象定位。我们提出了一个单眼3D球定位任务，该任务是基于地面2D注释的三角剖分，以及几个校准和再投影指标，以评估按需评估注释质量。此外，我们提出了一种单位图3D球定位方法作为基线，利用摄像头校准和球大小先验，从单眼角度估算球的位置。为了进一步完善2D注释，我们引入了一种边界框优化技术，以确保与3D场景表示。我们提出的数据集为3D足球场景的理解建立了新的基准测试，从而增强了运动分析中的空间和时间分析。最后，我们提供代码以促进访问我们的注释和数据集的生成管道。

Title: The Impact of Model Zoo Size and Composition on Weight Space Learning

Authors: Damian Falk, Konstantin Schürholt, Damian Borth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10141
Pdf URL: https://arxiv.org/pdf/2504.10141
Copy Paste: [[2504.10141]] The Impact of Model Zoo Size and Composition on Weight Space Learning(https://arxiv.org/abs/2504.10141)
Keywords: generation
Abstract: Re-using trained neural network models is a common strategy to reduce training cost and transfer knowledge. Weight space learning - using the weights of trained models as data modality - is a promising new field to re-use populations of pre-trained models for future tasks. Approaches in this field have demonstrated high performance both on model analysis and weight generation tasks. However, until now their learning setup requires homogeneous model zoos where all models share the same exact architecture, limiting their capability to generalize beyond the population of models they saw during training. In this work, we remove this constraint and propose a modification to a common weight space learning method to accommodate training on heterogeneous populations of models. We further investigate the resulting impact of model diversity on generating unseen neural network model weights for zero-shot knowledge transfer. Our extensive experimental evaluation shows that including models with varying underlying image datasets has a high impact on performance and generalization, for both in- and out-of-distribution settings. Code is available on this http URL.
摘要：重复使用训练有素的神经网络模型是降低培训成本和转移知识的常见策略。体重空间学习 - 使用训练有素的模型作为数据模式的权重 - 是一个有希望的新领域，可以重新使用预先训练的模型，以实现未来的任务。该领域的方法已经证明了模型分析和重量生成任务的高性能。但是，到目前为止，他们的学习设置需要均匀的模型动物园，在该模型中，所有模型都具有相同的确切体系结构，从而限制了它们在训练过程中看到的模型人群超越其概括的能力。在这项工作中，我们删除了这一约束，并提出了对一种常见的体重空间学习方法的修改，以适应模型异质种群的培训。我们进一步研究模型多样性对生成零照片知识转移的看不见的神经网络模型权重的影响。我们广泛的实验评估表明，对于分布式设置，包括具有不同基础图像数据集的模型对性能和概括都有很大的影响。代码可在此HTTP URL上找到。

Title: GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions

Authors: Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Ziran Qin, Jinwen Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10146
Pdf URL: https://arxiv.org/pdf/2504.10146
Copy Paste: [[2504.10146]] GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions(https://arxiv.org/abs/2504.10146)
Keywords: generation
Abstract: We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. Traditionally, solving geometry problems and generating diagrams have been treated as separate tasks in machine learning, with no models successfully integrating both to support problem creation. However, we believe that mastery in geometry requires frictionless integration of all of these skills, from solving problems to visualizing geometric relationships, and finally, crafting tailored problems. Our extensive experiments demonstrate that GeoUni, with only 1.5B parameters, achieves performance comparable to larger models such as DeepSeek-R1 with 671B parameters in geometric reasoning tasks. GeoUni also excels in generating precise geometric diagrams, surpassing both text-to-image models and unified models, including the GPT-4o image generation. Most importantly, GeoUni is the only model capable of successfully generating textual problems with matching diagrams based on specific knowledge points, thus offering a wider range of capabilities that extend beyond current models.
摘要：我们提出了Geouni，Geouni是第一个能够在单个框架内生成问题解决方案和图表的统一几何专家模型，其方式可以创建独特和个性化的几何问题。传统上，解决几何问题和生成图已被视为机器学习中的独立任务，没有成功地集成两者以支持问题的创造。但是，我们认为几何学的精通需要所有这些技能的无摩擦整合，从解决问题到可视化几何关系，最后是制定量身定制的问题。我们的广泛实验表明，Geouni只有1.5B参数可以达到与较大模型（例如DeepSeek-R1）相当的性能，而几何推理任务中具有671B参数。 Geouni还擅长生成精确的几何图，超过文本对图像模型和统一模型，包括GPT-4O图像生成。最重要的是，Geouni是唯一能够基于特定知识点成功生成文本问题的模型，因此提供了更广泛的功能，超出了当前模型。

Title: Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

Authors: Chunyang Zhang, Zhenhong Sun, Zhicheng Zhang, Junyan Wang, Yu Zhang, Dong Gong, Huadong Mo, Daoyi Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10148
Pdf URL: https://arxiv.org/pdf/2504.10148
Copy Paste: [[2504.10148]] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers(https://arxiv.org/abs/2504.10148)
Keywords: generation
Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.
摘要：文本对图像（T2I）的生成模型通常与多实体合成（MIS）困难，它们必须根据复杂的提示在单个图像中准确描述单个图像中的多个不同实例，从而详细介绍了单个特征。 SD V1.5/SDXL等UNET体系结构的传统MIS控制方法无法适应基于DIT的模型，例如Flux和SD v3.5，这些模型依赖于图像和文本令牌之间的集成注意力，而不是文本图像片段的交叉注意力。为了加强DIT的MIS，我们首先分析DIT中的混合注意机制。我们对注意图的依靠和层面的分析揭示了一个分层响应结构：实例令牌主导了早期层，中间层中的背景令牌以及属于后期层中的令牌。在此观察结果的基础上，我们提出了一种无训练的方法，以通过层次和阶梯式注意专业调整（AST）增强基于DIT的模型中的MIS。 AST放大了关键区域，同时在层次结构的指导下抑制各个层和步骤中不同注意力图的无关区域。这是通过基于实例的草图将复杂提示的分层解耦，从而优化了多模式的交互。我们评估了使用T2I-Compbench和自定义复杂场景的基于升级的草图布局评估我们的方法。定量和定性结果证实了我们的方法增强了复杂的布局生成，确保了精确的实例放置和属性表示。

Title: Efficient Generative Model Training via Embedded Representation Warmup

Authors: Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10188
Pdf URL: https://arxiv.org/pdf/2504.10188
Copy Paste: [[2504.10188]] Efficient Generative Model Training via Embedded Representation Warmup(https://arxiv.org/abs/2504.10188)
Keywords: generation, generative
Abstract: Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at this https URL.
摘要：扩散模型在生成高维数据方面表现出色，但与自我监督方法相比，训练效率和表示质量的质量不足。我们确定了一个关键的瓶颈：在训练过程中，高质量，语义上丰富的表示的充分利用不足。我们的系统分析揭示了一个关键的表示处理区域（主要是在早期的层中），在该区域发生在发电之前就进行了语义和结构模式学习。为了解决这个问题，我们提出了嵌入式表示热身（ERW），这是一个插件框架，在第一阶段，我们将ERW模块充当热身，可以用高质量，预审核的表示的扩散模型的早期层来初始化。这种热身可以最大程度地减少学习表征的负担，从而加快了融合和提高性能。我们的理论分析表明，ERW的功效取决于其精确集成到特定的神经网络层中 - 称为表示处理区域 - 该模型主要处理并转换以后一代的特征表示。我们进一步确定，ERW不仅可以加速培训融合，还可以提高表示质量：从经验上讲，与当前的最新方法Repa相比，我们的方法在训练速度上达到了40美元$ \ times $的加速。代码可在此HTTPS URL上找到。

Title: VibrantLeaves: A principled parametric image generator for training deep restoration models

Authors: Raphael Achddou, Yann Gousseau, Saïd Ladjal, Sabine Süsstrunk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10201
Pdf URL: https://arxiv.org/pdf/2504.10201
Copy Paste: [[2504.10201]] VibrantLeaves: A principled parametric image generator for training deep restoration models(https://arxiv.org/abs/2504.10201)
Keywords: restoration, super-resolution
Abstract: Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.
摘要：即使深层神经网络对于图像恢复任务非常强大，但它们仍存在一些局限性。他们的理解不足，并遭受了训练场所继承的强烈偏见。解决这些缺点的一种方法是更好地控制训练集，尤其是使用合成集。在本文中，我们提出了一个依赖一些简单原理的合成图像发生器。特别是，我们专注于几何建模，纹理和图像采集的简单建模。这些属性集成在经典的死叶模型中，可以创建有效的训练集。可以在此类数据集上培训标准图像denoising和超分辨率网络，几乎与自然图像数据集的培训达到相当的表现。作为迈向解释性的第一步，我们对所考虑的原理进行了仔细的分析，确定要获得良好性能的必要图像属性。此外，这种训练还可以为测试集的各种几何和辐射扰动提供更好的鲁棒性。

Title: ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams

Authors: Vivek Yelleti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10229
Pdf URL: https://arxiv.org/pdf/2504.10229
Copy Paste: [[2504.10229]] ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams(https://arxiv.org/abs/2504.10229)
Keywords: generation
Abstract: Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the "cold-start" problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz.,, DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This "train-only-when-required" strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.
摘要：从在线交易和数字交互等不同来源的连续生成流媒体数据需要及时的欺诈检测。传统的批处理加工方法通常很难捕捉欺诈活动的快速发展的模式。本文强调了处理流数据以进行有效欺诈检测的至关重要性。为了解决流媒体环境中延迟，可扩展性和概念漂移的固有挑战，我们提出了一个强大的在线流欺诈检测（ROSFD）框架。我们提出的框架包括两个关键阶段：（i）第一阶段：离线模型初始化。在这个初始阶段，使用增量学习原理在离线设置中内置模型，以克服“冷启动”问题。（ii）第二阶段：实时模型适应。在这个动态阶段，采用漂移检测算法（即DDM，EDDM和ADWIN）来识别传入数据流中的概念漂移，并逐步训练模型。这种“仅在火车上要求的”策略大大减少了所需的重培训数量，而没有显着影响接收器操作特征曲线（AUC）下的面积。总体而言，ROSFD利用ADWIN作为漂移检测方法证明了所采用方法的最佳性能。在模型功效方面，自适应随机森林始终优于其他模型，在五个数据集中有四分之四获得了最高的AUC。

Title: A Model Zoo of Vision Transformers

Authors: Damian Falk, Léo Meynent, Florence Pfammatter, Konstantin Schürholt, Damian Borth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10231
Pdf URL: https://arxiv.org/pdf/2504.10231
Copy Paste: [[2504.10231]] A Model Zoo of Vision Transformers(https://arxiv.org/abs/2504.10231)
Keywords: generation, generative
Abstract: The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at this http URL.
摘要：神经网络的大型结构化种群的可用性（称为“模型Zoos”）导致了从模型分析到模型权重的表示或神经网络参数的生成模型的多种下游任务的发展。但是，现有的模型动物园的大小和体系结构有限，并且忽略了当前最成功的神经网络体系结构之一。我们通过引入Vision Transformers（VIT）的第一个模型动物园来解决这一差距。为了更好地表示最近的培训方法，我们为模型动物园生成开发了一种新的蓝图，该蓝图涵盖了预训练和微调步骤，并发布了250个独特的模型。它们经过巨大的生成因子的跨度，并使用重量空间和行为指标的彻底选择来验证它们的多样性。为了进一步激发我们提出的数据集的效用，我们建议在广泛的探索实验和现有文献中的许多示例中均基于多个可能的应用程序。通过扩展以前的类似作品行，我们的模型动物园使研究人员可以将基于模型的模型方法从小型模型制度推向最先进的体系结构。我们在此HTTP URL上使模型动物园可用。

Title: XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Authors: Shuai Liu, Youmeng Li, Jizeng Wei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2504.10258
Pdf URL: https://arxiv.org/pdf/2504.10258
Copy Paste: [[2504.10258]] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark(https://arxiv.org/abs/2504.10258)
Keywords: generation
Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.
摘要：文档阅读顺序恢复是文档图像理解的一项基本任务，在增强检索型生成（RAG）方面发挥了关键作用，并作为大型语言模型（LLMS）的关键预处理步骤。现有的方法通常在复杂的布局（例如多列报纸），跨模式元素（视觉区域和文本语义）之间的高空相互作用以及缺乏强大的评估基准之间的高空相互作用。我们介绍了XY-CUT ++，这是一种高级布局排序方法，该方法集成了前掩模处理，多粒度分段和跨模式匹配，以应对这些挑战。与传统的XY切割技术相比，我们的方法显着提高了布局排序精度。具体而言，XY切割++在保持简单性和效率的同时，达到了最先进的性能（总体上为98.8 BLEU）。它的表现优于现有基线的最多24 \％，并且在新引入的DocBench-100数据集中的简单和复杂布局中表现出一致的精度。该进步为文档结构恢复建立了可靠的基础，为布局订购任务设定了新的标准，并促进了更有效的破布和LLM预处理。

Title: $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models

Authors: Chaoran Cheng, Jiahan Li, Jiajun Fan, Ge Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10283
Pdf URL: https://arxiv.org/pdf/2504.10283
Copy Paste: [[2504.10283]] $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models(https://arxiv.org/abs/2504.10283)
Keywords: generation, generative
Abstract: Recent efforts have extended the flow-matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous-State Discrete Flow Matching (CS-DFM). Existing CS-DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS-DFM models, under which the existing variants can be understood as operating on different $\alpha$-representations of probabilities. Building upon the theory of information geometry, we introduce $\alpha$-Flow, a family of CS-DFM models that adheres to the canonical $\alpha$-geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for $\alpha$-flow establishes a unified variational bound for the discrete negative log-likelihood. We comprehensively evaluate different instantiations of $\alpha$-flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. $\alpha$-flow significantly outperforms its discrete-state counterpart in image and protein sequence generation and better captures the entropy in language modeling.
摘要：最近的努力将流程匹配框架扩展到离散的生成建模。一串模型直接与连续概率而不是离散令牌一起工作，我们通话称为连续状态离散流匹配（CS-DFM）。现有的CS-DFM模型在其表示形式和几何假设上差异很大。这项工作为CS-DFM模型提供了一个统一的框架，在该框架下，现有变体可以理解为在不同的$ \ alpha $ - 概率表述上运行。在信息几何学理论的基础上，我们介绍了$ \ alpha $ -flow，这是CS-DFM模型的家族，该系列符合统计歧管的规范$ \ alpha $ - 几何形式，并证明了其在最小化广义动能的最佳性。从理论上讲，我们表明$ \ alpha $ - 流的流量匹配损失为离散的Log-likelihoody建立了统一的变分界。我们全面评估了各种离散生成域上$ \ alpha $流的不同实例，以证明它们在离散生成建模中的有效性，包括以前从未探索过几何形状的中间值。 $ \ alpha $ - 流在图像和蛋白质序列生成中显着优于其离散状态对应物，并更好地捕获语言建模中的熵。

Title: ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

Authors: Huiqi Wu, Jianbo Mei, Yingjie Huang, Yining Xu, Jingjiao You, Yilong Liu, Li Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10316
Pdf URL: https://arxiv.org/pdf/2504.10316
Copy Paste: [[2504.10316]] ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting(https://arxiv.org/abs/2504.10316)
Keywords: generation
Abstract: In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.
摘要：近年来，在文本驱动的3D内容生成中已取得了重大进步。但是，仍然存在一些挑战。在实际应用中，用户通常会提供非常简单的文本输入，同时期望高质量的3D内容。由于文本到3D模型对输入提示的质量的强烈依赖性，因此从这种最小文本中生成最佳结果是一项艰巨的任务。此外，发电过程表现出很高的可变性，因此难以控制。因此，通常需要多次迭代才能产生满足用户期望的内容，从而降低发电效率。为了解决这个问题，我们提出了GPT-4V进行自我优化，从而显着提高了一次尝试中产生令人满意的内容的效率。此外，尚未完全探索文本到3D生成方法的可控性。我们的方法使用户不仅可以提供文本描述，还可以指定其他条件，例如样式，边缘，涂鸦，姿势或多种条件的组合，从而可以对生成的3D内容进行更精确的控制。此外，在培训期间，我们有效地整合了多视图信息，包括多视图深度，面具，功能和图像，以解决3D内容生成中常见的Janus问题。广泛的实验表明，我们的方法实现了强大的概括，从而促进了高质量3D含量的有效生成。

Title: SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model

Authors: Zongcan Ding, Haodong Zhang, Peng Wu, Guansong Pang, Zhiwei Yang, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10320
Pdf URL: https://arxiv.org/pdf/2504.10320
Copy Paste: [[2504.10320]] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model(https://arxiv.org/abs/2504.10320)
Keywords: generation
Abstract: Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.
摘要：视频异常检测（VAD）旨在识别视频中的意外事件，并在安全至关重要的域中进行广泛应用。虽然仅接受正常样本训练的半监督方法已经获得了吸引力，但它们通常会遭受高误报率和可解释性差。最近，视觉模型（VLM）表现出强大的多模式推理能力，为可解释的异常检测提供了新的机会。但是，它们的高计算成本和缺乏域的适应性阻碍了实时部署和可靠性。受到人类视觉感知的双重互补途径的启发，我们提出了SlowfastVad，这是一个混合框架，将快速异常检测器与缓慢异常检测器（即检索增强发电（RAG）增强的VLM）集成，以解决这些限制。具体而言，快速检测器首先提供了粗糙的异常置信度评分，只有一小部分模棱两可的段，而不是整个视频，才通过较慢而更容易解释的VLM进行进一步分析，以进行详细的检测和推理。此外，为了使VLM适应特定于域的VAD场景，我们构建了一个知识库，其中包括基于少数正常样本和VLMS推论的异常模式的正常模式。在推断期间，检索相关模式并用于增加异常推理的提示。最后，我们平稳地融合了快速和缓慢检测器的异常置信度，以增强异常检测的鲁棒性。对四个基准测试的广泛实验表明，SlowFastVad有效地结合了快速和慢速检测器的优势，并实现了显着的检测准确性和可解释性，并显着降低了计算开销，从而非常适合具有高度可靠性要求的现实世界VAD应用。

Title: InstructEngine: Instruction-driven Text-to-Image Alignment

Authors: Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Bin Wen, Chun Yuan, Fan Yang, Tingting Gao, Di Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10329
Pdf URL: https://arxiv.org/pdf/2504.10329
Copy Paste: [[2504.10329]] InstructEngine: Instruction-driven Text-to-Image Alignment(https://arxiv.org/abs/2504.10329)
Keywords: generation
Abstract: Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.
摘要：从人/AI反馈（RLHF/RLAIF）中学习的强化学习已被广泛用于文本对图像模型的偏好对齐。现有方法在数据和算法方面都面临着某些局限性。对于培训数据，大多数方法都依赖手动注释的偏好数据，无论是直接对发电机进行微调或通过培训奖励模型提供培训信号。但是，高注释成本使它们难以扩大，奖励模型会消耗额外的计算，无法保证准确性。从算法的角度来看，大多数方法忽略了文本的价值，而仅将图像反馈作为比较信号，这是效率低下且稀疏的比较信号。为了减轻这些缺点，我们提出了指令框架。关于注释成本，我们首先构建了文本到图像生成的分类法，然后根据其开发自动数据构建管道。利用先进的大型多模型模型和人为定义的规则，我们生成了25K文本图像偏好对。最后，我们引入了交叉验证比对方法，该方法通过将语义类似样品组织到相互比较的对中来完善数据效率。对Drawbench的评估表明，指令EngendEngine将SD V1.5和SDXL的性能提高了10.53％和5.30％，表现优于最先进的基线，并进行了消融研究，证实了FenerchEngine所有组件的好处。人类评论中的胜利率超过50％也证明，指导者可以更好地与人类的偏好保持一致。

Title: FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Authors: Rui Chen, Lei Sun, Jing Tang, Geng Li, Xiangxiang Chu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10358
Pdf URL: https://arxiv.org/pdf/2504.10358
Copy Paste: [[2504.10358]] FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos(https://arxiv.org/abs/2504.10358)
Keywords: generation
Abstract: Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose $\textbf{F}$ing$\textbf{ER}$, a novel entity-level reasoning evaluation framework that first automatically generates $\textbf{F}$ine-grained $\textbf{E}$ntity-level questions, and then answers those questions by a $\textbf{R}$easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of $11.8\%$ on GenAI-Bench and $5.5\%$ on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.
摘要：视频生成的最新进展在评估AI生成的内容方面面临着巨大的挑战，尤其是随着越来越复杂的模型的出现。在此类视频中观察到的各种矛盾和缺陷本质上是复杂的，这使得众所周知的总分很难。在本文中，我们强调将细粒度的推理整合到视频评估中的关键重要性，我们提出了$ \ textbf {f} $ \ textbf {er textbf {er} $，这是一个新颖的实体级别的推理评估框架，该框架首先自动生成$ \ textbf {f textbf {f textbf {f textbf inte $ \ textbf nity-- $ \ textbf {r} $带有分数的$调压模型，随后可以将加权求和到不同应用程序的总分。具体而言，我们利用LLM来从五个不同的角度来得出实体级问题，（i）通常将重点放在内容的某些特定实体上，从而使MLLMS的回答或得分更加容易，并且（ii）更容易解释。然后，我们构建一个手指数据集，该数据集由大约3.3k的视频和相应的60k细粒QA注释组成，每个注释都有详细的原因。基于此，我们进一步研究了各种培训方案，以最好地激励MLLM的推理能力以进行正确的答案预测。广泛的实验表明，使用小组相对政策优化（GRPO）训练的推理模型以冷启动策略实现了最佳性能。值得注意的是，我们的模型在Genai-Bench上的相对利润率为$ 11.8 \％$，而Monetbench的$ 5.5 \％$只有3.3k培训视频，这是其他方法使用的十分之一。我们的代码和数据集将很快发布。

Title: PG-DPIR: An efficient plug-and-play method for high-count Poisson-Gaussian inverse problems

Authors: Maud Biquard, Marie Chabert, Florence Genin, Christophe Latry, Thomas Oberlin
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10375
Pdf URL: https://arxiv.org/pdf/2504.10375
Copy Paste: [[2504.10375]] PG-DPIR: An efficient plug-and-play method for high-count Poisson-Gaussian inverse problems(https://arxiv.org/abs/2504.10375)
Keywords: restoration, super-resolution
Abstract: Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.
摘要：Poisson-Gaussian噪声描述了各种成像系统的噪声，因此需要有效的算法来进行泊松高斯图像恢复。深度学习方法提供最先进的性能，但在监督环境中使用时通常需要特定于传感器的培训。插件（PNP）方法给出了一个有希望的替代方案，该方法仅通过通过DeNoiser学习一个正则化，允许从具有相同网络的多个来源恢复图像。本文介绍了PG-DPIR，这是一种适用于DPIR的高计泊托尼高斯逆问题的有效PNP方法。尽管DPIR是为白色高斯噪声而设计的，但由于没有闭合形式的近端操作员，对泊松高斯噪声的天真适应会导致慢慢算法。为了解决这个问题，我们适应了泊松高斯噪声的特异性，并特别提出了近端步骤所需的梯度下降的有效初始化，该梯度下降加速了几个数量级。实验是在卫星图像恢复和超分辨率问题上进行的。为实验模拟了高分辨率现实的pleiades图像，这表明PG-DPIR可以通过提高效率来实现最先进的性能，这对于地面卫星处理链似乎很有希望。

Title: HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Authors: Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, Yi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10414
Pdf URL: https://arxiv.org/pdf/2504.10414
Copy Paste: [[2504.10414]] HUMOTO: A 4D Dataset of Mocap Human Object Interactions(https://arxiv.org/abs/2504.10414)
Keywords: generation
Abstract: We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: this https URL .
摘要：我们用物体（Humoto）介绍了人类动作，这是一个针对运动，计算机视觉和机器人技术应用的人类对象相互作用的高保真数据集。 Humoto具有736个序列（30 fps的7,875秒），可捕获与63个精确建模的对象和72个铰接零件的相互作用。我们的创新包括场景驱动的LLM脚本管道，创建具有自然进展的完整，有目的的任务，以及一个MoCap和相机录音设置，以有效地处理遮挡。跨越从烹饪到户外野餐的各种活动，Humoto保留了身体准确性和逻辑任务流。专业艺术家严格清洁和验证每个序列，最大程度地减少脚滑和对象穿透。与其他数据集相比，我们还提供基准。 Humoto的全面全身运动和同时的多对象交互解决了关键的数据捕获挑战，并提供了机会，以推动跨研究领域的现实人类对象的相互作用建模，并在动画，机器人技术和体现的AI系统中实现了实际应用。项目：此HTTPS URL。

Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Authors: Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10433
Pdf URL: https://arxiv.org/pdf/2504.10433
Copy Paste: [[2504.10433]] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model(https://arxiv.org/abs/2504.10433)
Keywords: generation
Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at this https URL.
摘要：对象姿势估计是机器人理解和与环境互动的核心手段。对于此任务，单眼类别级别的方法很有吸引力，因为它们仅需要一个RGB摄像头。但是，当前的方法依赖于阶级已知对象的形状先验或CAD模型。我们提出了一种基于扩散的单眼类别级9D对象姿势生成方法，monodiff9d。我们的动机是利用扩散模型的概率性质，以减轻对形状先验，CAD模型或深度传感器的需求，以进行阶级内部未知物体姿势估计。我们首先以零拍的方式从单眼图像中通过Dinov2估算了粗糙深度，然后将其转换为点云。然后，我们将点云的全局特征与输入图像融合在一起，并使用融合功能以及编码的时间步骤来调节monodiff9d。最后，我们设计了一个基于变压器的DeNoiser，以从高斯噪声中恢复对象姿势。在两个流行的基准数据集上进行的广泛实验表明，Monodiff9D实现了最先进的单眼类别级别级别9D对象姿势估计精度，而无需在任何阶段使用形状先验或CAD模型。我们的代码将在此HTTPS URL上公开。

Title: Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Authors: Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10434
Pdf URL: https://arxiv.org/pdf/2504.10434
Copy Paste: [[2504.10434]] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing(https://arxiv.org/abs/2504.10434)
Keywords: generation, generative
Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at this https URL
摘要：文本对图像生成已经通过扩散模型看到了突破性的进步，从而实现了高保真性的综合和通过交叉注意操作进行精确的图像编辑。最近，自回归（AR）模型已重新出现为强大的替代方案，利用下一代生成来匹配扩散模型。但是，由于结构控制的根本差异，专为扩散模型设计的现有编辑技术无法直接转化为AR模型。具体而言，AR模型遭受注意图的空间贫困和图像编辑过程中结构错误的顺序积累，这破坏了对象布局和全局一致性。在这项工作中，我们引入了隐式结构锁定（ISLOCK），这是AR视觉模型的第一个无培训编辑策略。 Islock并不依赖于明确的注意操纵或微调，而是通过通过锚定令牌匹配（ATT）协议动态地对准自我发项模式来保留结构蓝图。通过隐式在潜在空间中执行结构一致性，我们的方法ISLOCK可以在保持生成性自主权的同时进行结构意识的编辑。广泛的实验表明，ISLOCK在没有额外培训的情况下实现了高质量的结构符合编辑，并且与常规编辑技术相当。我们的发现为高效且灵活的基于AR的图像编辑开辟了道路，进一步弥合了扩散和自回旋生成模型之间的性能差距。该代码将在此HTTPS URL上公开可用

Title: M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Authors: Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10449
Pdf URL: https://arxiv.org/pdf/2504.10449
Copy Paste: [[2504.10449]] M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models(https://arxiv.org/abs/2504.10449)
Keywords: generation
Abstract: Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.
摘要：有效的推理对于解决复杂的数学问题至关重要。最近的大型语言模型（LLM）通过长期的经过经过经过经过经过经过经过经验的推理来扩展测试时间计算，从而提高了性能。但是，由于其二次计算复杂性和线性内存要求，基于变压器的模型在扩展上下文长度方面存在固有的限制。在本文中，我们介绍了一种基于Mamba体系结构的新型混合线性RNN推理模型M1，该模型允许记忆效率推理。我们的方法利用现有推理模型的蒸馏过程，并通过RL培训进一步增强。 AIME和数学基准的实验结果表明，M1不仅胜过先前的线性RNN模型，而且还与类似规模的最新DeepSeek R1蒸馏式推理模型相匹配。我们还将我们的生成速度与高性能的通用推理引擎VLLM进行了比较，并且与相同尺寸的变压器相比，观察到超过3倍的速度。借助吞吐量加速，与DeepSeek R1蒸馏器推理模型在固定生成时间预算下，我们能够实现更高的准确性。总体而言，我们引入了混合MAMBA推理模型，并提供了一种更有效的方法，可以使用自一致性或长时间的思维推理来扩展测试时间的生成。

Title: Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Authors: Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10466
Pdf URL: https://arxiv.org/pdf/2504.10466
Copy Paste: [[2504.10466]] Art3D: Training-Free 3D Generation from Flat-Colored Illustration(https://arxiv.org/abs/2504.10466)
Keywords: generation, generative
Abstract: Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: this https URL .
摘要：大规模训练的图像到3D生成模型在各种形状的世代中表现出显着的功能。但是，当参考图像由于缺乏3D幻觉而纯色（如手绘）时，他们中的大多数人都难以合成合理的3D资产，这通常是艺术内容创建中最用户友好的输入方式。为此，我们提出了ART3D，这是一种无训练的方法，可以将扁平颜色的2D设计提高到3D。通过使用预先训练的2D图像生成模型和基于VLM的现实主义评估来利用结构和语义特征，ART3D成功地增强了参考图像中的三维幻觉，从而简化了从2D产生3D的过程，并证明可适应于广泛的绘画样式。为了基准在没有3D感觉的平面图像上现有的图像到3D模型的概括性能，我们收集了一个新的数据集，Flat-2D，其中有100多个样本。实验结果证明了ART3D的性能和鲁棒性，表现出卓越的可推广能力和有希望的实际适用性。我们的源代码和数据集将在我们的项目页面上公开可用：此HTTPS URL。

Title: Weight Ensembling Improves Reasoning in Language Models

Authors: Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10478
Pdf URL: https://arxiv.org/pdf/2504.10478
Copy Paste: [[2504.10478]] Weight Ensembling Improves Reasoning in Language Models(https://arxiv.org/abs/2504.10478)
Keywords: generation
Abstract: We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades-off between bias and variance.
摘要：我们研究了在推理模型训练过程中产生的失败模式，该模型的多样性开始崩溃，导致了次优的测试时间缩放。值得注意的是，通行证@1率可靠地改善了监督填充（SFT），但通过@k迅速恶化。令人惊讶的是，简单的干预措施是用早期检查点（也称为Wise-ft）插值最新SFT检查点的权重，几乎完全恢复了Pass@k，同时也改善了Pass@1。 Wise-FT的变体可以实现更好的测试时间缩放（最佳@K，多数投票），并通过增强学习进一步调整数据，从而通过更少的数据取得了优势。最后，我们发现，明智的FT提供了互补的性能增长，仅通过诱导多样性的解码策略（例如温度缩放）才能实现。我们对Pass@K的偏见变化权衡方面正式，就测试分配而言，Pass@1的期望和差异。我们发现，明智的ft可以同时减少偏见和差异，而温度缩放在偏置和方差之间固有地进行交易。

Title: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10479
Pdf URL: https://arxiv.org/pdf/2504.10479
Copy Paste: [[2504.10479]] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models(https://arxiv.org/abs/2504.10479)
Keywords: generation
Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
摘要：我们介绍了InternVL3，这是Intervl系列的重大进步，该系列具有本地多模式预训练范式。 Intervl3并没有将仅文本大语模型（LLM）调整为支持视觉输入的多模式大型语言模型（MLLM），而是在单个预训练阶段中共同从多样化的多模式数据和纯文本公司中获得多模式和语言能力。这种统一的训练范式有效地解决了MLLM的常规事后培训管道中通常遇到的复杂性和一致性挑战。为了进一步提高性能和可伸缩性，InternVL3结合了可变的视觉位置编码（V2PE），以支持扩展的多模式环境，采用先进的后培训技术，例如监督的微调（SFT）（SFT）和混合偏好优化（MPO），并采用优化的培训培训策略以及优化的测试时间缩放策略。广泛的经验评估表明，Intervl3在各种多模式任务中提供了卓越的性能。特别是，Intervl3-78B在MMMU基准上取得了72.2的成绩，在开源MLLM中创造了新的最新时间。它的功能在领先的专有模型中仍然具有很高的竞争力，包括Chatgpt-4O，Claude 3.5 Sonnet和Gemini 2.5 Pro，同时还保持了强大的纯语言水平。为了追求开放科学原则，我们将公开发布培训数据和模型权重，以促进下一代MLLM的进一步研究和开发。

Title: REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Authors: Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10483
Pdf URL: https://arxiv.org/pdf/2504.10483
Copy Paste: [[2504.10483]] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers(https://arxiv.org/abs/2504.10483)
Keywords: generation
Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at this https URL.
摘要：在本文中，我们解决了一个基本问题：“我们可以以端到端的方式训练潜在扩散模型以及变异自动编码器（VAE）令牌吗？”传统的深入学习智慧表明，在可能的情况下通常可以端对端训练。但是，对于潜在扩散变压器，可以观察到使用标准扩散损失的端到端训练VAE和扩散模型无效，甚至导致最终性能降解。我们表明，尽管扩散损失是无效的，但可以通过表示形式对准（REPA）损失来解锁端到端训练 - 允许在训练过程中共同调整VAE和扩散模型。尽管它很简单，但拟议的培训配方（REPA-E）表现出色。超过17倍和45倍的扩散模型训练分别超过17倍和45倍。有趣的是，我们观察到，通过Repa-E进行端到端的调整也可以改善VAE本身。导致改善潜在空间结构和下游生成性能。在最终表现方面，我们的方法设定了新的最先进；在ImageNet 256 x 256上实现有没有分类器的指南，达到1.26和1.83的FID。可在此HTTPS URL上获得代码。

Title: Decoupled Diffusion Sparks Adaptive Scene Generation

Authors: Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10485
Pdf URL: https://arxiv.org/pdf/2504.10485
Copy Paste: [[2504.10485]] Decoupled Diffusion Sparks Adaptive Scene Generation(https://arxiv.org/abs/2504.10485)
Keywords: generation
Abstract: Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.
摘要：可控场景的产生可以大大降低各种数据收集的成本，以实现自动驾驶。先前的工作将流量布局生成作为预测进度，通过一次或迭代预测下一帧来确定整个序列。但是，完全序列降级阻碍了在线反应，而后者的短视下一框架预测缺乏精确的目标状态指导。此外，由于开放数据集的大量安全和有序的驾驶行为，博学的模型难以生成复杂或具有挑战性的场景。为了克服这些问题，我们介绍了Nexus，这是一个脱钩的场景生成框架，通过模拟具有独立噪声状态的细粒度的代币，改善了反应性和目标调理。脱钩管道的核心是集成部分噪声掩盖训练策略和噪音吸引的时间表，以确保在整个剥离过程中及时进行环境更新。为了补充具有挑战性的场景，我们收集了一个由复杂角案例组成的数据集。它涵盖了540小时的模拟数据，包括高风险的相互作用，例如切割，突然制动和碰撞。 Nexus在保留反应性和目标取向的同时，取得了上一代的现实主义，位移误差降低了40％。我们进一步证明，Nexus通过数据扩大将闭环计划提高了20％，并展示了其在安全至关重要的数据生成中的能力。