2025-11-26

Title: Personalized Reward Modeling for Text-to-Image Generation

Authors: Jeongeun Lee, Ryang Heo, Dongha Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19458
Pdf URL: https://arxiv.org/pdf/2511.19458
Copy Paste: [[2511.19458]] Personalized Reward Modeling for Text-to-Image Generation(https://arxiv.org/abs/2511.19458)
Keywords: generation
Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.
摘要：最近的文本到图像（T2I）模型根据文本提示生成语义连贯的图像，但评估它们与个人用户偏好的一致性仍然是一个开放的挑战。传统的评估方法、一般奖励函数或基于相似性的指标无法捕捉个人视觉品味的多样性和复杂性。在这项工作中，我们提出了 PIGReward，这是一种个性化奖励模型，可以动态生成用户条件评估维度并通过 CoT 推理来评估图像。为了解决用户数据稀缺的问题，PIGReward 采用自引导策略，通过有限的参考数据进行推理来构建丰富的用户上下文，无需针对特定用户的培训即可实现个性化。除了评估之外，PIGReward 还提供个性化反馈，推动用户特定的提示优化，提高生成的图像与个人意图之间的一致性。我们进一步介绍了 PIGBench，这是一个针对每个用户的偏好基准，可捕获共享提示的不同视觉解释。大量实验表明，PIGReward 在准确性和可解释性方面都超越了现有方法，为个性化 T2I 评估和优化建立了可扩展且基于推理的基础。总而言之，我们的研究结果强调 PIGReward 是迈向个性化 T2I 一代的有力一步。

Title: PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

Authors: Ruogu Ding, Xin Ning, Ulf Schlichtmann, Weikang Qian
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2511.19472
Pdf URL: https://arxiv.org/pdf/2511.19472
Copy Paste: [[2511.19472]] PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer(https://arxiv.org/abs/2511.19472)
Keywords: generation, generative
Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder's topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.
摘要：前缀加法器因其高速度而广泛用于计算密集型应用。然而，由于严格的设计规则和指数级大的设计空间，设计优化的前缀加法器具有挑战性。我们引入了 PrefixGPT，这是一种生成式预训练 Transformer (GPT)，可以从头开始直接生成优化的前缀加法器。我们的方法将加法器的拓扑表示为二维坐标序列，并在生成过程中应用合法性掩码，确保每个设计在构造时都是有效的。 PrefixGPT 具有定制的仅解码器 Transformer 架构。该模型首先在随机合成的有效前缀添加器的语料库上进行预训练，以学习设计规则，然后进行微调以导航设计空间以优化设计质量。与现有工作相比，PrefixGPT不仅找到了新的优化设计，面积延迟积（ADP）提高了7.7%，而且表现出卓越的勘探质量，平均ADP降低了高达79.1%。这证明了 GPT 式模型首先掌握复杂的硬件设计原理，然后应用它们进行更有效的设计优化的潜力。

Title: WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning

Authors: Haojin Yang, Rui Hu, Zequn Sun, Rui Zhou, Yujun Cai, Yiwei Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19473
Pdf URL: https://arxiv.org/pdf/2511.19473
Copy Paste: [[2511.19473]] WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning(https://arxiv.org/abs/2511.19473)
Keywords: generation
Abstract: Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.
摘要：扩散语言模型 (DLM) 在文本生成方面显示出强大的潜力，并且正在成为自回归模型的有竞争力的替代方案。去噪策略在确定输出质量方面发挥着重要作用。主流的去噪策略包括Standard Diffusion和BlockDiffusion。标准扩散在不限制更新范围的情况下执行全局去噪，通常最终确定不完整的上下文并导致过早的序列结束预测。 BlockDiffusion 按预设顺序更新固定大小的块，但其严格的结构会破坏连贯的语义单元并扰乱推理。我们提出了 WavefrontDiffusion，一种动态解码方法，可将活动令牌的波前从最终位置向外扩展。这种自适应过程遵循语义结构的自然流程，同时保持计算成本等于基于块的方法。在推理和代码生成的四个基准中，WavefrontDiffusion 实现了最先进的性能，同时产生具有更高语义保真度的输出，显示了自适应调度对于更连贯和高效的生成的价值。

Title: Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Authors: Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2511.19474
Pdf URL: https://arxiv.org/pdf/2511.19474
Copy Paste: [[2511.19474]] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks(https://arxiv.org/abs/2511.19474)
Keywords: generation
Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
摘要：自动检测视频中的异常事件对于现代自主系统至关重要，但现有的视频异常检测 (VAD) 基准缺乏可靠评估现实世界性能所需的场景多样性、平衡的异常覆盖范围和时间复杂度。与此同时，社区越来越多地转向视频异常理解（VAU），这需要更深入的语义和因果推理，但由于需要大量的手动注释工作，仍然难以进行基准测试。在本文中，我们介绍了 Pistachio，这是一种完全通过受控的、基于生成的管道构建的新 VAD/VAU 基准。通过利用视频生成模型的最新进展，Pistachio 提供了对场景、异常类型和时间叙述的精确控制，有效消除了互联网收集数据集的偏差和限制。我们的流程集成了场景条件异常分配、多步骤故事情节生成以及时间一致的长格式合成策略，该策略可在最少的人工干预下生成连贯的 41 秒视频。大量的实验证明了 Pistachio 的规模、多样性和复杂性，揭示了现有方法的新挑战，并激发了未来对动态和多事件异常理解的研究。

Title: Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms

Authors: Ruoxin Zhang, Zhizhao Wen, Chao Wang, Chenchen Tang, Puyang Xu, Yifan Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19481
Pdf URL: https://arxiv.org/pdf/2511.19481
Copy Paste: [[2511.19481]] Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms(https://arxiv.org/abs/2511.19481)
Keywords: generation
Abstract: With the rapid evolution of large language models, retrieval enhanced generation technology has been widely used due to its ability to integrate external knowledge to improve output accuracy. However, the performance of the system is highly dependent on the quality of the retrieval module. If the retrieval results have low relevance to user needs or contain noisy information, it will directly lead to distortion of the generated content. In response to the performance bottleneck of existing models in processing tabular features, this paper proposes an XGBoost machine learning regression model based on feature engineering and particle swarm optimization. Correlation analysis shows that answer_quality is positively correlated with doc_delevance by 0.66, indicating that document relevance has a significant positive effect on answer quality, and improving document relevance may enhance answer quality; The strong negative correlations between semantic similarity, redundancy, and diversity were -0.89 and -0.88, respectively, indicating a trade- off between semantic similarity, redundancy, and diversity. In other words, as the former two increased, diversity significantly decreased. The experimental results comparing decision trees, AdaBoost, etc. show that the VMD PSO BiLSTM model is superior in all evaluation indicators, with significantly lower MSE, RMSE, MAE, and MAPE compared to the comparison model. The R2 value is higher, indicating that its prediction accuracy, stability, and data interpretation ability are more outstanding. This achievement provides an effective path for optimizing the retrieval quality and improving the generation effect of RAG system, and has important value in promoting the implementation and application of related technologies.
摘要：随着大型语言模型的快速演进，检索增强生成技术因其能够整合外部知识以提高输出准确性而得到广泛应用。然而，系统的性能高度依赖于检索模块的质量。如果检索结果与用户需求相关性较低或者包含噪声信息，将直接导致生成内容的失真。针对现有模型处理表格特征的性能瓶颈，提出一种基于特征工程和粒子群优化的XGBoost机器学习回归模型。相关性分析显示answer_quality与doc_delevance呈0.66正相关，说明文档相关性对答案质量有显着的正向影响，提高文档相关性可能会提升答案质量；语义相似性、冗余性和多样性之间的强负相关性分别为-0.89和-0.88，表明语义相似性、冗余性和多样性之间存在权衡。换句话说，随着前两者的增加，多样性显着下降。对比决策树、AdaBoost等的实验结果表明，VMD PSO BiLSTM模型在各项评价指标上均优于对比模型，MSE、RMSE、MAE、MAPE均显着低于对比模型。 R2值越高，表明其预测精度、稳定性和数据解释能力越突出。该成果为优化RAG系统检索质量、提高生成效果提供了有效路径，对于推动相关技术的落地和应用具有重要价值。

Title: Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems

Authors: Guijun Liu, Yuwen Cao, Tomoaki Ohtsuki, Jiguang He, Shahid Mumtaz
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2511.19490
Pdf URL: https://arxiv.org/pdf/2511.19490
Copy Paste: [[2511.19490]] Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems(https://arxiv.org/abs/2511.19490)
Keywords: generative
Abstract: Deep autoencoder (DAE) frameworks have demonstrated their effectiveness in reducing channel state information (CSI) feedback overhead in massive multiple-input multiple-output (mMIMO) orthogonal frequency division multiplexing (OFDM) systems. However, existing CSI feedback models struggle to adapt to dynamic environments caused by user mobility, requiring retraining when encountering new CSI distributions. Moreover, returning to previously encountered environments often leads to performance degradation due to catastrophic forgetting. Continual learning involves enabling models to incorporate new information while maintaining performance on previously learned tasks. To address these challenges, we propose a generative adversarial network (GAN)-based learning approach for CSI feedback. By using a GAN generator as a memory unit, our method preserves knowledge from past environments and ensures consistently high performance across diverse scenarios without forgetting. Simulation results show that the proposed approach enhances the generalization capability of the DAE framework while maintaining low memory overhead. Furthermore, it can be seamlessly integrated with other advanced CSI feedback models, highlighting its robustness and adaptability.
摘要：深度自动编码器 (DAE) 框架已证明其在减少大规模多输入多输出 (mMIMO) 正交频分复用 (OFDM) 系统中的信道状态信息 (CSI) 反馈开销方面的有效性。然而，现有的 CSI 反馈模型难以适应用户移动性引起的动态环境，在遇到新的 CSI 分布时需要重新训练。此外，返回以前遇到的环境通常会因灾难性遗忘而导致性能下降。持续学习包括使模型能够融入新信息，同时保持先前学习任务的性能。为了应对这些挑战，我们提出了一种基于生成对抗网络（GAN）的 CSI 反馈学习方法。通过使用 GAN 生成器作为存储单元，我们的方法可以保留过去环境中的知识，并确保在不同场景下始终保持高性能而不会忘记。仿真结果表明，所提出的方法增强了DAE框架的泛化能力，同时保持了较低的内存开销。此外，它可以与其他先进的CSI反馈模型无缝集成，凸显其鲁棒性和适应性。

Title: PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting

Authors: Bowen Zhao, Huanlai Xing, Zhiwen Xiao, Jincheng Peng, Li Feng, Xinhan Wang, Rong Qu, Hui Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19497
Pdf URL: https://arxiv.org/pdf/2511.19497
Copy Paste: [[2511.19497]] PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting(https://arxiv.org/abs/2511.19497)
Keywords: generative
Abstract: The attention mechanism has demonstrated remarkable potential in sequence modeling, exemplified by its successful application in natural language processing with models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT). Despite these advancements, its utilization in time series forecasting (TSF) has yet to meet expectations. Exploring a better network structure for attention in TSF holds immense significance across various domains. In this paper, we present PeriodNet with a brand new structure to forecast univariate and multivariate time series. PeriodNet incorporates period attention and sparse period attention mechanism for analyzing adjacent periods. It enhances the mining of local characteristics, periodic patterns, and global dependencies. For efficient cross-variable modeling, we introduce an iterative grouping mechanism which can directly reduce the cross-variable redundancy. To fully leverage the extracted features on the encoder side, we redesign the entire architecture of the vanilla Transformer and propose a period diffuser for precise multi-period prediction. Through comprehensive experiments conducted on eight datasets, we demonstrate that PeriodNet outperforms six state-of-the-art models in both univariate and multivariate TSF scenarios in terms of mean square error and mean absolute error. In particular, PeriodNet achieves a relative improvement of 22% when forecasting time series with a length of 720, in comparison to other models based on the conventional encoder-decoder Transformer architecture.
摘要：注意力机制在序列建模中表现出了巨大的潜力，其在自然语言处理中的成功应用就是例证，其中包括来自 Transformers 的双向编码器表示 (BERT) 和生成预训练 Transformer (GPT) 等模型。尽管取得了这些进步，但其在时间序列预测（TSF）中的利用率尚未达到预期。探索 TSF 中更好的注意力网络结构在各个领域都具有巨大的意义。在本文中，我们提出了具有全新结构的PeriodNet来预测单变量和多变量时间序列。 periodNet 结合了周期注意力和稀疏周期注意力机制来分析相邻周期。它增强了对局部特征、周期性模式和全局依赖性的挖掘。为了有效的跨变量建模，我们引入了迭代分组机制，可以直接减少跨变量冗余。为了充分利用编码器端提取的特征，我们重新设计了 vanilla Transformer 的整个架构，并提出了一种用于精确多周期预测的周期扩散器。通过对八个数据集进行的综合实验，我们证明了PeriodNet 在单变量和多变量 TSF 场景中的均方误差和平均绝对误差方面均优于六种最先进的模型。特别是，与其他基于传统编码器-解码器 Transformer 架构的模型相比，PeriodNet 在预测长度为 720 的时间序列时相对提高了 22%。

Title: Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2511.19499
Pdf URL: https://arxiv.org/pdf/2511.19499
Copy Paste: [[2511.19499]] Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection(https://arxiv.org/abs/2511.19499)
Keywords: generative
Abstract: The rapid advancement of generators (e.g., StyleGAN, Midjourney, DALL-E) has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. These generators are typically based on a few core architectural families, primarily Generative Adversarial Networks (GANs) and Diffusion Models (DMs). A critical vulnerability in current forensics is the failure of detectors to achieve cross-generator generalization, especially when crossing architectural boundaries (e.g., from GANs to DMs). We hypothesize that this gap stems from fundamental differences in the artifacts produced by these \textbf{distinct architectures}. In this work, we provide a theoretical analysis explaining how the distinct optimization objectives of the GAN and DM architectures lead to different manifold coverage behaviors. We demonstrate that GANs permit partial coverage, often leading to boundary artifacts, while DMs enforce complete coverage, resulting in over-smoothing patterns. Motivated by this analysis, we propose the \textbf{Tri}archy \textbf{Detect}or (TriDetect), a semi-supervised approach that enhances binary classification by discovering latent architectural patterns within the "fake" class. TriDetect employs balanced cluster assignment via the Sinkhorn-Knopp algorithm and a cross-view consistency mechanism, encouraging the model to learn fundamental architectural distincts. We evaluate our approach on two standard benchmarks and three in-the-wild datasets against 13 baselines to demonstrate its generalization capability to unseen generators.
摘要：生成器（例如 StyleGAN、Midjourney、DALL-E）的快速发展产生了高度逼真的合成图像，对数字媒体的真实性提出了重大挑战。这些生成器通常基于几个核心架构系列，主要是生成对抗网络（GAN）和扩散模型（DM）。当前取证中的一个关键漏洞是检测器无法实现跨生成器泛化，特别是在跨越架构边界（例如，从 GAN 到 DM）时。我们假设这种差距源于这些 \textbf{不同架构} 产生的工件的根本差异。在这项工作中，我们提供了理论分析，解释了 GAN 和 DM 架构的不同优化目标如何导致不同的流形覆盖行为。我们证明 GAN 允许部分覆盖，通常会导致边界伪影，而 DM 则强制完全覆盖，导致过度平滑的模式。受此分析的启发，我们提出了 \textbf{Tri}archy \textbf{Detect}or (TriDetect)，这是一种半监督方法，通过发现“假”类中的潜在架构模式来增强二元分类。 TriDetect 通过 Sinkhorn-Knopp 算法和跨视图一致性机制采用平衡集群分配，鼓励模型学习基本的架构差异。我们在两个标准基准和三个野外数据集（针对 13 个基线）上评估我们的方法，以证明其对未见过的生成器的泛化能力。

Title: Profile Generators: A Link between the Narrative and the Binary Matrix Representation

Authors: Raoul H. Kutil, Georg Zimmermann, Barbara Strasser-Kirchweger, Christian Borgelt
Subjects: cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2511.19506
Pdf URL: https://arxiv.org/pdf/2511.19506
Copy Paste: [[2511.19506]] Profile Generators: A Link between the Narrative and the Binary Matrix Representation(https://arxiv.org/abs/2511.19506)
Keywords: generation
Abstract: Mental health disorders, particularly cognitive disorders defined by deficits in cognitive abilities, are described in detail in the DSM-5, which includes definitions and examples of signs and symptoms. A simplified, machine-actionable representation was developed to assess the similarity and separability of these disorders, but it is not suited for the most complex cases. Generating or applying a full binary matrix for similarity calculations is infeasible due to the vast number of symptom combinations. This research develops an alternative representation that links the narrative form of the DSM-5 with the binary matrix representation and enables automated generation of valid symptom combinations. Using a strict pre-defined format of lists, sets, and numbers with slight variations, complex diagnostic pathways involving numerous symptom combinations can be represented. This format, called the symptom profile generator (or simply generator), provides a readable, adaptable, and comprehensive alternative to a binary matrix while enabling easy generation of symptom combinations (profiles). Cognitive disorders, which typically involve multiple diagnostic criteria with several symptoms, can thus be expressed as lists of generators. Representing several psychotic disorders in generator form and generating all symptom combinations showed that matrix representations of complex disorders become too large to manage. The MPCS (maximum pairwise cosine similarity) algorithm cannot handle matrices of this size, prompting the development of a profile reduction method using targeted generator manipulation to find specific MPCS values between disorders. The generators allow easier creation of binary representations for large matrices and make it possible to calculate specific MPCS cases between complex disorders through conditional generators.
摘要：DSM-5 详细描述了精神健康障碍，特别是由认知能力缺陷定义的认知障碍，其中包括体征和症状的定义和示例。开发了一种简化的、机器可操作的表示来评估这些疾病的相似性和可分离性，但它不适合最复杂的病例。由于存在大量的症状组合，生成或应用完整的二进制矩阵进行相似性计算是不可行的。这项研究开发了一种替代表示法，将 DSM-5 的叙述形式与二进制矩阵表示法联系起来，并能够自动生成有效的症状组合。使用严格的预定义格式的列表、集合和数字（略有变化），可以表示涉及多种症状组合的复杂诊断路径。这种格式称为症状概况生成器（或简称生成器），为二进制矩阵提供了可读、适应性强且全面的替代方案，同时可以轻松生成症状组合（概况）。因此，认知障碍通常涉及具有多种症状的多种诊断标准，因此可以表示为生成器列表。以生成器形式表示几种精神疾病并生成所有症状组合表明，复杂疾病的矩阵表示变得太大而无法管理。 MPCS（最大成对余弦相似度）算法无法处理这种大小的矩阵，这促使开发一种轮廓缩减方法，使用目标生成器操作来查找疾病之间的特定 MPCS 值。生成器可以更轻松地创建大型矩阵的二进制表示，并可以通过条件生成器计算复杂疾病之间的特定 MPCS 案例。

Title: Single Image to High-Quality 3D Object via Latent Features

Authors: Huanning Dong, Yinuo Huang, Fan Li, Ping Kuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19512
Pdf URL: https://arxiv.org/pdf/2511.19512
Copy Paste: [[2511.19512]] Single Image to High-Quality 3D Object via Latent Features(https://arxiv.org/abs/2511.19512)
Keywords: generation
Abstract: 3D assets are essential in the digital age. While automatic 3D generation, such as image-to-3d, has made significant strides in recent years, it often struggles to achieve fast, detailed, and high-fidelity generation simultaneously. In this work, we introduce LatentDreamer, a novel framework for generating 3D objects from single images. The key to our approach is a pre-trained variational autoencoder that maps 3D geometries to latent features, which greatly reducing the difficulty of 3D generation. Starting from latent features, the pipeline of LatentDreamer generates coarse geometries, refined geometries, and realistic textures sequentially. The 3D objects generated by LatentDreamer exhibit high fidelity to the input images, and the entire generation process can be completed within a short time (typically in 70 seconds). Extensive experiments show that with only a small amount of training, LatentDreamer demonstrates competitive performance compared to contemporary approachs.
摘要：3D 资产在数字时代至关重要。虽然自动 3D 生成（例如图像转 3D）近年来取得了显着进步，但它常常难以同时实现快速、详细和高保真度的生成。在这项工作中，我们介绍了 LatentDreamer，这是一种从单个图像生成 3D 对象的新颖框架。我们方法的关键是预训练的变分自动编码器，它将 3D 几何图形映射到潜在特征，这大大降低了 3D 生成的难度。 LatentDreamer 的管道从潜在特征开始，依次生成粗略几何图形、精细几何图形和逼真纹理。 LatentDreamer生成的3D物体对输入图像表现出高保真度，整个生成过程可以在短时间内（通常在70秒内）完成。大量实验表明，只需少量训练，LatentDreamer 就表现出与当代方法相比具有竞争力的性能。

Title: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2511.19524
Pdf URL: https://arxiv.org/pdf/2511.19524
Copy Paste: [[2511.19524]] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning(https://arxiv.org/abs/2511.19524)
Keywords: generation
Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
摘要：通过利用工具增强的多模态大语言模型 (MLLM)，多代理框架正在推动视频理解的进步。然而，它们大多数采用静态和不可学习的工具调用机制，这限制了对时间或空间复杂视频的鲁棒感知和推理所必需的各种线索的发现。为了应对这一挑战，我们提出了一种新颖的视频理解多智能体系统，即 VideoChat-M1。 VideoChat-M1 没有使用单一或固定的策略，而是采用了具有多个策略代理的独特的协作策略规划 (CPP) 范例，其中包括三个关键流程。 (1) 策略生成：每个代理根据用户的查询生成其独特的工具调用策略； (2) 策略执行：每个智能体依次调用相关工具来执行其策略并探索视频内容； (3)策略通信：在策略执行的中间阶段，代理之间相互交互以更新各自的策略。通过这个协作框架，所有代理协同工作，根据同行的上下文洞察动态地完善其首选策略，以有效地响应用户的查询。此外，我们为我们的 CPP 范式配备了简洁的多智能体强化学习（MARL）方法。因此，策略代理团队可以在最终答案奖励和中间协作过程反馈的指导下联合优化，以增强 VideoChat-M1 的性能。大量实验表明，VideoChat-M1 在涵盖四项任务的八个基准测试中实现了 SOTA 性能。值得注意的是，在 LongVideoBench 上，我们的方法比 SOTA 模型 Gemini 2.5 pro 提高了 3.6%，比 GPT-4o 提高了 15.6%。

Title: Vidi2: Large Multimodal Models for Video Understanding and Creation

Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao, Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19529
Pdf URL: https://arxiv.org/pdf/2511.19529
Copy Paste: [[2511.19529]] Vidi2: Large Multimodal Models for Video Understanding and Creation(https://arxiv.org/abs/2511.19529)
Keywords: generation
Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
摘要：视频已成为互联网上交流和创造力的主要媒介，推动了对可扩展、高质量视频制作的强烈需求。 Vidi 模型继续向下一代视频创建发展，并在多模态时间检索 (TR) 方面实现了最先进的性能。在第二个版本中，Vidi2 通过细粒度时空基础 (STG) 推进视频理解，并将其功能扩展到视频问答 (Video QA)，从而实现全面的多模态推理。给定一个文本查询，Vidi2不仅可以识别相应的时间戳，还可以识别输出时间范围内目标对象的边界框。这种端到端的时空基础能力使得复杂编辑场景中的潜在应用成为可能，例如情节或角色理解、自动多视图切换以及智能、构图感知的重构和裁剪。为了在实际环境中对 STG 进行全面评估，我们引入了一个新的基准 VUE-STG，它比现有 STG 数据集提供了四个关键改进： 1）视频时长：从大约 10 秒到 30 分钟，支持长上下文推理； 2）查询格式：查询大部分转换为名词短语，同时保留句子级表达力； 3）标注质量：所有真实时间范围和边界框都是手动标注的，精度很高； 4）评估指标：改进的vIoU/tIoU/vIoU-Intersection方案。此外，我们将之前的VUE-TR基准测试升级为VUE-TR-V2，实现了更均衡的视频长度分布和更多用户风格的查询。值得注意的是，Vidi2 模型在 VUE-TR-V2 和 VUE-STG 上的性能均远远优于领先的专有系统，例如 Gemini 3 Pro（预览版）和 GPT-5，同时在视频 QA 基准测试中与具有类似规模的流行开源模型取得了具有竞争力的结果。

Title: Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture

Authors: Kaidi Wan, Minghao Liu, Yong Lai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19544
Pdf URL: https://arxiv.org/pdf/2511.19544
Copy Paste: [[2511.19544]] Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture(https://arxiv.org/abs/2511.19544)
Keywords: generation
Abstract: Wepropose SplitGNN, a graph neural network (GNN)-based approach that learns to solve weighted maximum satisfiabil ity (MaxSAT) problem. SplitGNN incorporates a co-training architecture consisting of supervised message passing mech anism and unsupervised solution boosting layer. A new graph representation called edge-splitting factor graph is proposed to provide more structural information for learning, which is based on spanning tree generation and edge classification. To improve the solutions on challenging and weighted instances, we implement a GPU-accelerated layer applying efficient score calculation and relaxation-based optimization. Exper iments show that SplitGNN achieves 3* faster convergence and better predictions compared with other GNN-based ar chitectures. More notably, SplitGNN successfully finds solu tions that outperform modern heuristic MaxSAT solvers on much larger and harder weighted MaxSAT benchmarks, and demonstrates exceptional generalization abilities on diverse structural instances.
摘要：我们提出 SplitGNN，这是一种基于图神经网络 (GNN) 的方法，可以学习解决加权最大可满足性 (MaxSAT) 问题。 SplitGNN 采用了一个协同训练架构，该架构由监督消息传递机制和无监督解决方案增强层组成。提出了一种称为边缘分裂因子图的新图表示，它基于生成树生成和边缘分类，为学习提供更多结构信息。为了改进具有挑战性和加权实例的解决方案，我们实现了一个 GPU 加速层，应用高效的分数计算和基于松弛的优化。实验表明，与其他基于 GNN 的架构相比，SplitGNN 实现了 3 倍的更快收敛和更好的预测。更值得注意的是，SplitGNN 成功地找到了在更大、更难加权的 MaxSAT 基准上优于现代启发式 MaxSAT 求解器的解决方案，并在不同的结构实例上展示了卓越的泛化能力。

Title: Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Authors: Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19557
Pdf URL: https://arxiv.org/pdf/2511.19557
Copy Paste: [[2511.19557]] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment(https://arxiv.org/abs/2511.19557)
Keywords: generative
Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.
摘要：及时、准确地评估自然灾害后的损失对于有效的应急响应和恢复至关重要。最近开发的基于人工智能的框架可以分析无人机收集的大量航空图像，快速提供可操作的见解。然而，创建和注释用于训练这些模型的数据既昂贵又耗时，导致数据集的大小和多样性受到限制。此外，大多数现有方法依赖于具有固定答案空间的传统的基于分类的框架，限制了它们在没有额外数据收集或模型再训练的情况下提供新信息的能力。使用基于情境学习 (ICL) 的预训练生成模型可以实现灵活且开放的答案空间。然而，这些模型经常产生幻觉的输出或产生缺乏特定领域相关性的通用响应。为了解决这些限制，我们提出了 ThiFAN-VQA，这是一种基于两阶段推理的框架，用于灾难场景中的视觉问答（VQA）。 ThiFAN-VQA 首先使用思维链 (CoT) 提示和 ICL 生成结构化推理轨迹，以在有限的监督下实现可解释的推理。随后的答案选择模块评估生成的响应并分配最连贯且上下文准确的答案，有效提高模型性能。通过集成自定义信息检索系统、特定领域的提示和推理引导的答案选择，ThiFAN-VQA 弥合了零样本方法和监督方法之间的差距，将灵活性与一致性结合起来。在 FloodNet 和 RescueNet-VQA（来自洪水和飓风影响地区的基于无人机的数据集）上进行的实验表明，ThiFAN-VQA 在现实世界的灾后损失评估任务中实现了卓越的准确性、可解释性和适应性。

Title: Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach

Authors: Maria Thoma, Michalis A. Savelonas, Dimitris K. Iakovidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19576
Pdf URL: https://arxiv.org/pdf/2511.19576
Copy Paste: [[2511.19576]] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach(https://arxiv.org/abs/2511.19576)
Keywords: generative
Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.
摘要：缺血性中风是一种时间紧迫的医疗紧急情况，快速诊断对于改善患者预后至关重要。非增强计算机断层扫描 (NCCT) 是一线成像工具，但它常常无法揭示早期超急性期存在的细微缺血变化。这种限制可能会延迟关键的干预措施。为了解决这一诊断挑战，我们引入了一种使用生成对抗网络（GAN）的半监督分割方法来准确描绘早期缺血性中风区域。所提出的方法采用对抗性框架来有效地从有限数量的带注释的 NCCT 扫描中学习，同时利用更大的未标记扫描池。通过采用 Dice 损失、交叉熵损失、特征匹配损失和自训练损失，该模型学会识别和描绘早期梗塞，即使它们很微弱或尺寸很小。在公开的急性缺血性中风数据集（AISD）上进行的实验证明了所提出的方法在增强诊断能力、减轻手动注释负担并支持中风护理中更有效的临床决策方面的潜力。

Title: Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Authors: Dimitrios E. Diamantis, Dimitris K. Iakovidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19578
Pdf URL: https://arxiv.org/pdf/2511.19578
Copy Paste: [[2511.19578]] Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis(https://arxiv.org/abs/2511.19578)
Keywords: generation, generative
Abstract: Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.
摘要：通过无线胶囊内窥镜 (WCE) 进行的胃肠 (GI) 成像会生成大量需要手动筛选的图像。基于深度学习的临床决策支持 (CDS) 系统可以协助筛查，但其性能依赖于大型、多样化的训练医疗数据集的存在。然而，由于隐私限制和注释成本，此类数据的稀缺阻碍了 CDS 的发展。生成机器学习提供了一种可行的解决方案来克服这一限制。尽管已经探索了当前的合成数据生成（SDG）方法，例如生成对抗网络和变分自动编码器，但它们经常面临训练稳定性和捕获足够的视觉多样性的挑战，特别是在综合异常结果时。这项工作介绍了一种基于 VAE 的新型医学图像合成方法，并介绍了其在生成 WCE 图像方面的应用。这项工作的新颖贡献包括：a）矢量量化 VAE 模型的多尺度扩展，称为多尺度矢量量化变分自动编码器（MSVQ-VAE）； b) 与用于生成 WCE 图像的其他基于 VAE 的 SDG 模型不同，MSVQ-VAE 用于将异常无缝地引入到正常的 WCE 图像中； c) 它能够有条件地生成合成图像，从而能够将不同类型的异常引入到正常的 WCE 图像中； d) 它对多种异常类型进行实验，包括息肉、血管和炎症状况。通过图像分类评估生成的 CDS 图像的实用性。比较实验表明，使用所提出的方法生成的异常图像来训练 CDS 分类器会产生与仅使用真实数据训练的分类器相当的结果。所提出的方法的通用性保证了其适用于与医疗多媒体相关的各个领域。

Title: Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds

Authors: Jiaxin Shi, Michalis K. Titsias
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19664
Pdf URL: https://arxiv.org/pdf/2511.19664
Copy Paste: [[2511.19664]] Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds(https://arxiv.org/abs/2511.19664)
Keywords: generative
Abstract: We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.
摘要：我们对广泛用于训练扩散模型的重新加权损失得出了新的理论解释。我们的方法基于在数据对数似然上构建一系列与时间相关的变分下界，这可证明改进了标准证据下界，并导致数据模型 KL 散度减少。结合这些边界产生了重新加权的目标，可以应用于任何生成扩散模型，包括连续高斯扩散和掩模（离散）扩散模型。然后，我们在掩蔽扩散中展示了该框架，并报告了相对于之前像素空间图像建模中的训练损失的显着改进，接近与连续扩散模型相当的样本质量。我们的结果还为掩模图像模型中广泛使用的简单加权方案提供了理论依据。

Title: Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Authors: Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19741
Pdf URL: https://arxiv.org/pdf/2511.19741
Copy Paste: [[2511.19741]] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans(https://arxiv.org/abs/2511.19741)
Keywords: generation, generative
Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.
摘要：最优传输 (OT) 提供了一个强大的框架，用于查找分布之间的对应关系并解决计算机视觉各个领域的匹配和对齐问题，包括形状分析、图像生成和多模态任务。然而，OT 的计算成本阻碍了其可扩展性。基于切片的运输计划最近显示出通过利用一维 OT 问题的封闭式解决方案来降低计算成本的前景。这些方法优化一维投影（切片）以获得条件运输计划，最大限度地减少周围空间的运输成本。虽然有效，但这些方法留下了一个悬而未决的问题：学习到的最优切片器是否可以在分布转移下转移到新的分布对。在数据不断变化或跨密切相关的分布进行重复 OT 计算的环境中，理解这种可转移性至关重要。在本文中，我们研究了最小切片传输计划（min-STP）框架，并研究了优化切片器的可转移性：在一个分布对上训练的切片器能否为新的、未见过的分布对生成有效的传输计划？理论上，我们表明优化的切片器在数据分布的轻微扰动下仍保持接近，从而实现跨相关任务的高效传输。为了进一步提高可扩展性，我们引入了 min-STP 的小批量公式，并为其准确性提供统计保证。根据经验，我们证明可转移的 min-STP 实现了强大的一次性匹配性能，并有助于点云对齐和基于流的生成建模的摊销训练。

Title: One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

Authors: Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19778
Pdf URL: https://arxiv.org/pdf/2511.19778
Copy Paste: [[2511.19778]] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer(https://arxiv.org/abs/2511.19778)
Keywords: generation
Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.
摘要：我们确定了在使用旋转位置嵌入 (RoPE) 上的常见线性插值通过扩散变压器进行混合分辨率去噪时发生的核心故障模式。当来自不同空间网格的令牌混合时，注意力机制就会崩溃。这个问题是结构性的。线性坐标重新映射迫使单个注意力头比较以不兼容的速率采样的 RoPE 相位，从而产生相位混叠，从而破坏分数景观的稳定性。预训练的 DiT 特别脆弱——许多磁头表现出极其尖锐的周期性相位选择性——因此，即使是微小的交叉速率不一致也会可靠地导致模糊、伪影或完全崩溃。为此，我们的主要贡献是跨分辨率相位对齐注意力（CRPA），这是一种无需训练的直接修复，可以从根源上消除这种失败。 CRPA 仅修改每个注意力调用的 RoPE 索引图：所有 Q/K 位置都在查询的步幅上表示，以便相等的物理距离始终会导致相同的相位增量。这可以恢复 DiT 所依赖的精确相位模式。 CRPA 与预训练的 DiT 完全兼容，统一稳定所有头和层。我们证明 CRPA 能够实现高保真和高效的混合分辨率生成，优于以前最先进的图像和视频生成方法。

Title: Terminal Velocity Matching

Authors: Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19797
Pdf URL: https://arxiv.org/pdf/2511.19797
Copy Paste: [[2511.19797]] Terminal Velocity Matching(https://arxiv.org/abs/2511.19797)
Keywords: generative
Abstract: We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
摘要：我们提出了终端速度匹配（TVM），这是流匹配的一种推广，可以实现高保真一步和少步生成建模。 TVM 对任意两个扩散时间步之间的过渡进行建模，并在其终止时间而不是初始时间规范其行为。我们证明，当模型是 Lipschitz 连续时，TVM 提供了数据和模型分布之间的 $2$-Wasserstein 距离的上限。然而，由于扩散变压器缺乏这种特性，我们引入了最小的架构更改来实现稳定的单阶段训练。为了使 TVM 在实践中高效，我们开发了一个融合注意力内核，支持雅可比向量积的向后传递，该向量可以很好地与变压器架构一起扩展。在 ImageNet-256x256 上，TVM 通过单个功能评估 (NFE) 实现了 3.29 FID，通过 4 个 NFE 实现了 1.99 FID。它在 ImageNet-512x512 上同样实现了 4.32 1-NFE FID 和 2.94 4-NFE FID，代表了从头开始的单步/少步模型的最先进性能。

Title: Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Authors: Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19811
Pdf URL: https://arxiv.org/pdf/2511.19811
Copy Paste: [[2511.19811]] Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization(https://arxiv.org/abs/2511.19811)
Keywords: generation, generative
Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.
摘要：图像多样性仍然是文本到图像扩散模型的基本挑战。低多样性模型往往会产生重复的输出，增加采样冗余并阻碍创造性探索和下游应用。主要原因是生成通常会崩溃到学习分布中的强模式。现有的改善多样性的尝试，例如噪声重采样、即时重写或基于转向的指导，通常仍然会崩溃到主导模式或引入失真，从而降低图像质量。有鉴于此，我们提出了令牌提示嵌入空间优化（TPSO），这是一种免训练且与模型无关的模块。 TPSO 引入可学习参数来探索令牌嵌入空间中代表性不足的区域，从而减少模型从学习分布的强模式中重复生成样本的趋势。同时，提示级空间提供了全局语义约束，可以调节分布变化，防止质量下降，同时保持高保真度。对 MS-COCO 和三个扩散主干网的大量实验表明，TPSO 显着增强了生成多样性，将基线性能从 1.10 点提高到 4.18 点，而无需牺牲图像质量。代码将在接受后发布。

Title: Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

Authors: Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, Jiayin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19822
Pdf URL: https://arxiv.org/pdf/2511.19822
Copy Paste: [[2511.19822]] Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models(https://arxiv.org/abs/2511.19822)
Keywords: generation
Abstract: Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream this http URL experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.
摘要：稀疏专家混合 (SMoE) 架构开辟了扩展大型语言模型 (LLM) 的新领域，通过在推理过程中仅激活总参数的一小部分来提供卓越的性能。然而，它们的实际部署受到大量静态内存开销的严重阻碍，因为所有专家都必须加载到内存中。现有的训练后剪枝方法在减小模型大小的同时，通常从单个通用语料库中得出剪枝标准。这导致了一个关键的限制：当修剪后的模型应用于其他域时，性能会发生灾难性的下降，需要对每个新域进行昂贵的重新修剪。为了解决这种泛化差距，我们引入了马赛克修剪（MoP）。 MoP 的核心思想是通过结构化的“聚类然后选择”过程构建功能全面的专家集。该过程利用相似性度量来捕获不同任务领域的专家表现，对专家进行功能聚类，然后根据我们提出的激活变异性得分从每个集群中选择最具代表性的专家。与针对单个语料库进行优化的方法不同，我们提出的马赛克剪枝可确保剪枝模型保留功能互补的专家集，就像马赛克图块共同构成了原始模型功能的完整图片，使其能够处理各种 MoE 模型上的不同下游 http URL 实验，证明了我们的方法的优越性，显着优于之前的工作，在一般任务上实现了 7.24% 的增益，在数学推理和代码生成等特殊任务上实现了 8.92% 的增益。

Title: ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Authors: Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, Jong Chul Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19827
Pdf URL: https://arxiv.org/pdf/2511.19827
Copy Paste: [[2511.19827]] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding(https://arxiv.org/abs/2511.19827)
Keywords: generation
Abstract: We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.
摘要：我们提出了 ReDirector，一种新颖的摄像机控制视频重拍生成方法，用于动态捕获可变长度视频。特别是，我们通过对齐输入视频和目标重拍的时空位置来纠正以前作品中常见的 RoPE 误用。此外，我们还引入了旋转相机编码 (RoCE)，这是一种相机调节的 RoPE 相移，可捕获并集成输入和目标视频内部和之间的多视图关系。通过将摄像机条件集成到 RoPE 中，我们的方法可推广到分布外的摄像机轨迹和视频长度，从而改进动态对象定位和静态背景保留。大量的实验进一步证明了相机可控性、几何一致性和不同轨迹和长度的视频质量的显着改进。

Title: Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

Authors: Haoqing Li, Jun Shi, Xianmeng Chen, Qiwei Jia, Rui Wang, Wei Wei, Hong An, Xiaowen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19834
Pdf URL: https://arxiv.org/pdf/2511.19834
Copy Paste: [[2511.19834]] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2511.19834)
Keywords: generation
Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.
摘要：深度学习方法在通过计算机断层扫描 (CT) 成像推进 Birt-Hogg-Dube 综合征 (BHD) 诊断方面面临着临床样本有限和弥漫性囊性肺疾病 (DCLD) 类间分化程度低的双重挑战。虽然多模态大语言模型 (MLLM) 证明了此类罕见疾病的诊断潜力，但缺乏特定领域的知识和可参考的放射学特征会加剧幻觉风险。为了解决这个问题，我们提出了 BHD-RAG，这是一种多模态检索增强生成框架，它将 DCLD 特定的专业知识和临床先例与 MLLM 相结合，以提高 BHD 诊断的准确性。 BHDRAG 采用：（1）专门的代理生成 CT 图像的成像表现描述，以构建 DCLD 病例的多模态语料库。 (2) 基于余弦相似度的检索器，精确定位查询图像的相关图像描述对，以及 (3) MLLM 将检索到的证据与图像数据合成以进行诊断。 BHD-RAG 在涉及四种 DCLD 的数据集上进行了验证，实现了卓越的准确性并生成与专家见解紧密结合的基于证据的描述。

Title: Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Authors: Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19835
Pdf URL: https://arxiv.org/pdf/2511.19835
Copy Paste: [[2511.19835]] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation(https://arxiv.org/abs/2511.19835)
Keywords: generation
Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at this https URL .
摘要：扩散变压器在视频生成中占主导地位，但注意力计算的二次复杂度会带来大量延迟。注意力稀疏性通过关注关键标记而忽略非关键标记来降低计算成本。然而，现有方法的性能严重下降。在本文中，我们重新审视注意力稀疏性，并揭示现有方法会导致注意力分配中的系统偏差：（1）过度关注关键标记会放大其注意力权重；（2）完全忽略非关键标记会导致相关注意力权重的损失。为了解决这些问题，我们提出了 Rectified SpaAttn，它通过隐式全注意力参考来纠正注意力分配，从而增强稀疏注意力图和全注意力图之间的对齐。具体来说：（1）对于关键标记，我们表明它们的偏差与稀疏注意力权重成正比，该比率由放大的权重控制。因此，我们提出了孤立池注意力重新分配，它通过重新分配多模态池权重来计算准确的校正因子。 (2) 对于非关键令牌，从池化查询密钥中恢复注意力权重会产生注意力增益，但也会引入池化错误。因此，我们提出增益感知池整流，确保整流后的增益始终超过引起的误差。此外，我们使用 Triton 定制并集成了 Rectified SpaAttn 内核，在 HunyuanVideo 和 Wan 2.1 上分别实现了高达 3.33 和 2.08 倍的加速，同时保持了高生成质量。我们在此 https URL 开源了 Rectified SpaAttn。

Title: 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

Authors: Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19836
Pdf URL: https://arxiv.org/pdf/2511.19836
Copy Paste: [[2511.19836]] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models(https://arxiv.org/abs/2511.19836)
Keywords: generation
Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at this https URL.
摘要：世界一代模型正在成为下一代多模式智能系统的基石。与传统的 2D 视觉生成不同，世界模型旨在从图像、视频或文本构建真实、动态且物理一致的 3D/4D 世界。这些模型不仅需要产生高保真视觉内容，还需要保持空间、时间、物理和指令控制的连贯性，从而能够在虚拟现实、自动驾驶、实体智能和内容创建中应用。然而，现有基准强调不同的评估维度，缺乏对世界现实能力的统一评估。为了系统地评估世界模型，我们引入了 4DWorldBench，它跨四个关键维度衡量模型：感知质量、条件 4D 对齐、物理真实性和 4D 一致性。该基准测试涵盖图像转 3D/4D、视频转 4D、文本转 3D/4D 等任务。除此之外，我们创新性地引入了跨多种模式的适应性调节，这不仅集成而且扩展了传统的评估范式。为了适应不同的模态条件输入，我们在评估过程中将所有模态条件映射到统一的文本空间中，并进一步集成LLM作为法官、MLLM作为法官和传统的基于网络的方法。这种统一的自适应设计可以对对齐、物理真实性和跨模式一致性进行更全面、一致的评估。初步的人类研究进一步表明，我们的自适应工具选择与人类的主观判断更加一致。我们希望这个基准能够成为客观比较和改进的基础，加速从“视觉一代”到“世界一代”的转变。我们的项目可以在这个 https URL 中找到。

Title: Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Authors: Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19856
Pdf URL: https://arxiv.org/pdf/2511.19856
Copy Paste: [[2511.19856]] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks(https://arxiv.org/abs/2511.19856)
Keywords: generation
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.
摘要：大型多模态模型 (LMM) 在跨文本和图像模态对齐和生成内容方面取得了显着进展。然而，使用非视觉、连续序列作为高保真图像生成的条件信号的潜力仍然很大程度上未被开发。此外，将序列转换为“伪图像”以进行时间预测的现有方法无法建立语义级对齐。在本文中，我们提出了 TimeArtist，这是一种时间-视觉转换框架，开创了时间序列波动和视觉概念之间的语义级别对齐。它开创了“预热对齐”范例：首先，双自动编码器和共享量化器在大规模数据集上进行自我监督训练，以学习模态共享表示。然后，编码器和量化器被冻结，并引入投影以在表示级别上对齐时间和视觉样本。 TimeArtist 建立了一个多功能的跨模式框架，可以直接从时间序列生成高质量、多样化的图像，同时捕获时间波动模式以在风格转移时渲染图像。大量实验表明，TimeArtist 在图像生成指标方面取得了令人满意的性能，同时在零样本时间任务中也取得了优异的结果。我们的工作为跨模式生成建立了一个新的范式，弥合了时间动态和视觉语义之间的差距。

Title: GigaWorld-0: World Models as Data Engine to Empower Embodied AI

Authors: GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19861
Pdf URL: https://arxiv.org/pdf/2511.19861
Copy Paste: [[2511.19861]] GigaWorld-0: World Models as Data Engine to Empower Embodied AI(https://arxiv.org/abs/2511.19861)
Keywords: generation, generative
Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
摘要：世界模型正在成为可扩展、数据高效的具体人工智能的基础范例。在这项工作中，我们提出了 GigaWorld-0，一个统一的世界模型框架，明确设计为视觉-语言-动作（VLA）学习的数据引擎。 GigaWorld-0 集成了两个协同组件： GigaWorld-0-Video，它利用大规模视频生成，在外观、相机视点和动作语义的细粒度控制下产生多样化、纹理丰富且时间连贯的体现序列； GigaWorld-0-3D，它结合了 3D 生成建模、3D 高斯喷射重建、物理可微分系统识别和可执行运动规划，以确保几何一致性和物理真实性。它们的联合优化能够实现视觉上引人注目、空间连贯、物理上合理且指令一致的具体交互数据的可扩展合成。通过我们高效的 GigaTrain 框架，大规模训练变得可行，该框架利用 FP8 精度和稀疏注意力来大幅减少内存和计算需求。我们进行的综合评估表明，GigaWorld-0在多个维度上生成了高质量、多样化、可控的数据。至关重要的是，在 GigaWorld-0 生成的数据上训练的 VLA 模型（例如 GigaBrain-0）实现了强大的现实世界性能，显着提高了物理机器人的泛化能力和任务成功率，而无需在训练期间进行任何现实世界交互。

Title: Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Authors: Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19909
Pdf URL: https://arxiv.org/pdf/2511.19909
Copy Paste: [[2511.19909]] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance(https://arxiv.org/abs/2511.19909)
Keywords: generation, generative
Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.
摘要：我们提出了 Motion Marionette，这是一种零镜头框架，用于从单目源视频到单视图目标图像的刚性运动传输。以前的工作通常采用几何、生成或模拟先验来指导传输过程，但这些外部先验引入了辅助约束，导致在泛化性和时间一致性之间进行权衡。为了解决这些限制，我们建议通过内部先验来指导运动传输过程，该内部先验专门捕获时空变换并在源视频和任何传输的目标视频之间共享。具体来说，我们首先将源视频和目标图像提升到统一的 3D 表示空间中。然后从源视频中提取运动轨迹，以构建独立于对象几何和语义的时空 (SpaT) 先验，编码随时间的相对空间变化。该先验进一步与目标对象集成，以合成可控速度场，随后使用基于位置的动力学对其进行细化，以减轻伪影并增强视觉连贯性。由此产生的速度场可以灵活地用于高效的视频制作。实证结果表明，Motion Marionette 可以泛化不同的对象，生成与源运动良好匹配的时间一致的视频，并支持可控视频生成。

Title: Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Authors: Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19912
Pdf URL: https://arxiv.org/pdf/2511.19912
Copy Paste: [[2511.19912]] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving(https://arxiv.org/abs/2511.19912)
Keywords: generation
Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.
摘要：视觉-语言-动作（VLA）模型最近在自动驾驶方面表现出了强大的决策能力。然而，现有的 VLA 常常难以实现有效的推理并推广到新颖的自动驾驶车辆配置和驾驶场景。在本文中，我们提出 Reasoning-VLA，一种通用且快速的动作生成 VLA 框架。所提出的模型采用一组可学习的动作查询，通过从训练语料库内的地面真实轨迹进行高斯采样来初始化。这些可学习的查询与推理增强的视觉语言功能交互，以并行生成连续的动作轨迹。为了促进稳健的泛化，我们将八个公开的自动驾驶数据集整合为标准化、基于思想链推理且易于使用的数据格式，用于模型训练。利用监督学习和强化学习微调，跨多个基准的广泛实证评估表明，Reasoning-VLA 实现了迄今为止最先进的性能、卓越的泛化能力和出色的推理速度。

Title: Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Authors: Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19917
Pdf URL: https://arxiv.org/pdf/2511.19917
Copy Paste: [[2511.19917]] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models(https://arxiv.org/abs/2511.19917)
Keywords: generation
Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.
摘要：扩散模型已成为文本到图像生成的主导范例，测试时间缩放 (TTS) 通过在推理期间分配更多计算进一步提高质量。然而，现有的 TTS 方法在全图像级别上运行，忽略了图像质量通常在空间上异质的事实。这导致对已经令人满意的区域进行不必要的计算以及对局部缺陷的校正不充分。在本文中，我们探索了一个新方向 - 本地化 TTS - 自适应地对缺陷区域进行重采样，同时保留高质量区域，从而大大减少搜索空间。这种范式提出了两个核心挑战：准确定位缺陷和保持全局一致性。我们提出 LoTTS，这是第一个完全免培训的本地化 TTS 框架。对于缺陷定位，LoTTS 在质量感知提示（例如，高质量与低质量）下对比交叉和自注意力信号，以识别缺陷区域，然后将其细化为相干掩模。为了保持一致性，LoTTS 仅干扰有缺陷的区域并对其进行局部去噪，确保校正保持有限，而图像的其余部分保持不受干扰。 SD2.1、SDXL 和 FLUX 上的大量实验表明，LoTTS 实现了最先进的性能：它持续提高本地质量和全局保真度，同时与 Best-of-N 采样相比，将 GPU 成本降低 2-4 倍。这些发现将局部 TTS 确立为在推理时扩展扩散模型的一个有前景的新方向。

Title: HybriDLA: Hybrid Generation for Document Layout Analysis

Authors: Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19919
Pdf URL: https://arxiv.org/pdf/2511.19919
Copy Paste: [[2511.19919]] HybriDLA: Hybrid Generation for Document Layout Analysis(https://arxiv.org/abs/2511.19919)
Keywords: generation, generative
Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at this https URL.
摘要：传统的文档布局分析 (DLA) 传统上依赖于经验先验或在单次前向传递中执行的一组固定的可学习查询。虽然对于具有少量预定区域的早期文档来说已经足够了，但这种范式对于表现出不同元素数量和日益复杂的布局的当代文档来说却很困难。为了解决现代文档带来的挑战，我们提出了 HybriDLA，这是一种新颖的生成框架，它将扩散和自回归解码统一在单层内。扩散组件迭代地完善边界框假设，而自回归组件则注入语义和上下文感知，即使在高度变化的布局中也能实现精确的区域预测。为了进一步提高检测质量，我们设计了一种多尺度特征融合编码器，可以捕获细粒度和高级视觉线索。该架构将性能提升至 83.5% 的平均精度 (mAP)。对 DocLayNet 和 M$^6$Doc 基准的大量实验表明，HybriDLA 具有最先进的性能，优于以前的方法。所有数据和模型都将在此 https URL 上公开提供。

Title: Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Authors: Youngseo Kim, Dohyun Kim, Geohee Han, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19936
Pdf URL: https://arxiv.org/pdf/2511.19936
Copy Paste: [[2511.19936]] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos(https://arxiv.org/abs/2511.19936)
Keywords: generation
Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.
摘要：图像扩散模型虽然最初是为图像生成而开发的，但它隐含地捕获了丰富的语义结构，使各种识别和定位任务能够超越合成。在这项工作中，我们研究了它们的自注意力图可以被重新解释为语义标签传播内核，从而提供相关图像区域之间强大的像素级对应关系。跨帧扩展这种机制会产生一个时间传播内核，该内核可以通过视频分段实现零镜头对象跟踪。我们进一步证明了测试时优化策略（DDIM 反转、文本反转和自适应头加权）在适应扩散特征以实现稳健且一致的标签传播方面的有效性。基于这些发现，我们引入了 DRIFT，这是一种视频中的对象跟踪框架，利用预训练的图像扩散模型和 SAM 引导的掩模细化，在标准视频对象分割基准上实现了最先进的零样本性能。

Title: Low-Resolution Editing is All You Need for High-Resolution Editing

Authors: Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19945
Pdf URL: https://arxiv.org/pdf/2511.19945
Copy Paste: [[2511.19945]] Low-Resolution Editing is All You Need for High-Resolution Editing(https://arxiv.org/abs/2511.19945)
Keywords: generation
Abstract: High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.
摘要：高分辨率内容创建正在迅速成为视觉和图形社区的核心挑战。虽然图像是视觉表达的最基本形式，但符合用户意图的内容生成需要有效、可控的高分辨率图像处理机制。然而，现有方法仍然仅限于低分辨率设置，通常仅支持高达 1K 的分辨率。在这项工作中，我们介绍了高分辨率图像编辑的任务，并提出了一个测试时优化框架来解决它。我们的方法对高分辨率源图像进行逐块优化，然后采用细粒度细节传输模块和新颖的同步策略来保持块之间的一致性。大量的实验表明，我们的方法可以产生高质量的编辑，从而促进高分辨率内容的创建。

Title: Prompt Fairness: Sub-group Disparities in LLMs

Authors: Meiyu Zhong, Noel Teku, Ravi Tandon
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2511.19956
Pdf URL: https://arxiv.org/pdf/2511.19956
Copy Paste: [[2511.19956]] Prompt Fairness: Sub-group Disparities in LLMs(https://arxiv.org/abs/2511.19956)
Keywords: generation
Abstract: Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.
摘要：大型语言模型 (LLM) 虽然在许多应用中被证明是有效的，但其响应质量可能存在很大差异。在本文中，我们研究了提示公平性的问题：具体来说，尽管原则上提出了相同的问题，但不同用户/风格的提示措辞可能会引起法学硕士的不同反应。为了量化这种差异，我们建议使用信息论指标来捕获偏差的两个维度：子组敏感性、子组内响应的变异性和组间一致性、子组间响应的变异性。我们的分析表明，某些亚组表现出较高的内部变异性和与其他亚组的较大差异。我们的实证分析表明，某些人口亚群体经历了较高的内部变异性和与其他群体的较大差异，表明模型行为中存在结构性不平等。为了减轻这些差异，我们提出了实际干预措施，包括跨多代的多数投票和及时中和，这些措施共同提高了响应稳定性并增强了用户群体的公平性。在实验中，我们观察到人口统计亚组之间明显的即时敏感性差异：在缓解之前，跨组差异值达到 0.28，通常落在 0.14 到 0.22 的范围内。应用我们的中和和多代策略后，这些差异持续减少，最大差距减少到 0.22，许多距离下降到 0.17 或更低，这表明子组之间的输出更加稳定和一致。

Title: HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19965
Pdf URL: https://arxiv.org/pdf/2511.19965
Copy Paste: [[2511.19965]] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning(https://arxiv.org/abs/2511.19965)
Keywords: generation, generative
Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
摘要：扩散模型的最新进展已经证明了在为简单提示生成高质量图像方面具有令人印象深刻的能力。然而，当面对涉及多个对象和层次结构的复杂提示时，现有模型难以准确遵循指令，导致概念遗漏、混乱和组合性差等问题。为了解决这些限制，我们提出了一种基于新颖的合成链（CoS）范式的分层组合生成框架（HiCoGen）。 HiCoGen 首先利用大型语言模型 (LLM) 将复杂的提示分解为最小的语义单元，而不是整体生成。然后，它迭代地合成这些单元，其中每个步骤中生成的图像为下一步提供关键的视觉上下文，确保所有文本概念都忠实地构建到最终场景中。为了进一步优化这个过程，我们引入了强化学习（RL）框架。至关重要的是，我们发现标准扩散采样器的有限探索阻碍了有效的强化学习。我们从理论上证明，通过在早期生成阶段集中随机性可以最大化样本多样性，并基于这一见解，提出一种新颖的衰减随机性计划来增强探索。然后，我们的强化学习算法在分层奖励机制的指导下，在全局、主题和关系级别上联合评估图像。我们还构建了 HiCoPrompt，一个新的文本到图像基准，具有用于严格评估的分层提示。实验表明，我们的方法在概念覆盖率和构图准确性方面都显着优于现有方法。

Title: EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Authors: Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19982
Pdf URL: https://arxiv.org/pdf/2511.19982
Copy Paste: [[2511.19982]] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback(https://arxiv.org/abs/2511.19982)
Keywords: generation, generative
Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.
摘要：连续情感图像生成（C-EICG）由于能够生成与用户描述和连续情感值一致的图像而迅速兴起。然而，现有的方法缺乏生成图像的情感反馈，限制了情感连续性的控制。此外，它们在情感和天真生成的文本之间的简单对齐无法根据图像内容自适应地调整情感提示，导致情感保真度不足。为了解决这些问题，我们为 C-EICG 提出了一种新颖的生成理解反馈强化范例（EmoFeedback2），它利用微调大型视觉语言模型（LVLM）的推理能力来提供奖励和文本反馈，以生成具有连续情感的高质量图像。具体来说，我们引入了一种情感感知奖励反馈策略，其中LVLM评估生成图像的情感值并计算针对目标情感的奖励，指导生成模型的强化微调并增强图像的情感连续性。此外，我们设计了一个自我推销文本反馈框架，其中LVLM迭代分析生成图像的情感内容，并自适应地为下一轮提示提供细化建议，以细粒度的内容提高情感保真度。大量的实验结果表明，我们的方法可以有效地生成具有所需情感的高质量图像，优于我们的自定义数据集中现有的最先进的方法。代码和数据集即将发布。

Title: OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Authors: Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19990
Pdf URL: https://arxiv.org/pdf/2511.19990
Copy Paste: [[2511.19990]] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement(https://arxiv.org/abs/2511.19990)
Keywords: restoration, generation
Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.
摘要：参考引导图像生成进展迅速，但当前的扩散模型在使用参考细化生成的图像时仍然难以保留细粒度的视觉细节。出现这种限制的原因是基于 VAE 的潜在压缩本质上会丢弃微妙的纹理信息，导致特定于身份和属性的线索消失。此外，基于现有方法放大局部细节的后期编辑方法通常会产生在光照、纹理或形状方面与原始图像不一致的结果。为了解决这个问题，我们引入了 \ourMthd{}，这是一个细节感知的细化框架，它执行两个连续阶段的参考驱动校正以增强像素级一致性。我们首先通过微调单图像扩散编辑器来共同摄取草稿图像和参考图像，从而在保持结构保真度的同时实现全局连贯的细化。然后，我们应用强化学习来进一步增强本地化编辑能力，明确优化细节准确性和语义一致性。大量实验表明，\ourMthd{} 显着改善了参考对齐和细粒度细节保留，产生忠实且视觉连贯的编辑，在具有挑战性的参考引导恢复基准上超越了开源和商业模型。

Title: CREward: A Type-Specific Creativity Reward Model

Authors: Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19995
Pdf URL: https://arxiv.org/pdf/2511.19995
Copy Paste: [[2511.19995]] CREward: A Type-Specific Creativity Reward Model(https://arxiv.org/abs/2511.19995)
Keywords: generation
Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.
摘要：创造力是一种复杂的现象。当谈到表达和评估创造力时，将其视为一个单一的、无差异的量会显得幼稚且平淡无奇。在这项工作中，我们学习了\emph{第一个特定类型的创造力奖励模型}，创造了 CREward，它跨越三个创造力“轴”，即几何、材料和纹理，使我们能够通过图像形成管道的镜头来看待创造力。为了构建我们的奖励模型，我们首先进行人类基准评估，以捕获人类对各种创意图像中每种类型的创造力的感知。然后，我们通过大型视觉语言模型（LVLM）分析人类判断和预测之间的相关性，确认 LVLM基于这一观察，我们收集了 LVLM 生成的标签来训练适用于创意图像的评估和生成的 CREward 模型，我们探索了 CREward 的三种应用：创造力评估、可解释的创造力以及通过低等级适应来获取创意样本。

Title: iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization

Authors: Xiucheng Wang, Tingwei Yuan, Yang Cao, Nan Cheng, Ruijin Sun, Weihua Zhuang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2511.20015
Pdf URL: https://arxiv.org/pdf/2511.20015
Copy Paste: [[2511.20015]] iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization(https://arxiv.org/abs/2511.20015)
Keywords: generative
Abstract: Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at this https URL.
摘要：无线电地图（RM）作为环境感知电磁（EM）表示，将场景几何和材料属性与信号强度的空间分布联系起来，无需昂贵的现场测量即可实现定位。然而，由于 EM 求解器的延迟过高以及基于学习的方法的局限性，构建高保真室内 RM 仍然具有挑战性，这些方法通常依赖于稀疏测量或均质材料的假设，而这与室内环境的异构性和多路径丰富的性质不相符。为了克服这些挑战，我们提出了 iRadioDiff，一种用于室内 RM 构建的基于无采样扩散的框架。 iRadioDiff 以接入点 (AP) 位置以及由材料反射和传输系数编码的物理提示为条件。它进一步结合了多路径关键先验，包括衍射点、强传输边界和视线 (LoS) 轮廓，通过条件通道和边界加权目标来指导生成过程。该设计能够对非平稳场不连续性进行精确建模，并有效构建物理一致的 RM。实验表明，iRadioDiff 在室内 RM 结构和基于接收信号强度的室内定位方面实现了最先进的性能，从而提供了跨布局和材料配置的有效泛化。代码可从此 https URL 获取。

Title: SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM

Authors: Lin Chen, Yingjian Zhu, Qi Yang, Xin Niu, Kun Ding, Shiming Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20027
Pdf URL: https://arxiv.org/pdf/2511.20027
Copy Paste: [[2511.20027]] SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM(https://arxiv.org/abs/2511.20027)
Keywords: generation
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM's tendency to over-segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask-injected framework, SAM-MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM-MI employs a Text-guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid-like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM's over-segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM-generated masks for guidance at low-frequency and high-frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM-MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded-SAM on the MESS benchmark, along with a 1.6$\times$ speedup. We hope SAM-MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.
摘要：开放词汇语义分割（OVSS）旨在通用地分割和识别对象。经过大量高质量分割数据的训练，Segment Anything 模型 (SAM) 展示了卓越的通用分割功能，为 OVSS 提供了宝贵的支持。尽管以前的方法在利用 SAM 进行 OVSS 方面取得了进展，但仍然存在一些挑战：（1）SAM 倾向于过度分割；（2）固定掩模和标签之间的硬组合。本文介绍了一种新颖的掩模注入框架 SAM-MI，该框架有效地将 SAM 与 OVSS 模型集成以应对这些挑战。最初，SAM-MI 采用文本引导的稀疏点提示器来对 SAM 的稀疏提示进行采样，而不是以前的密集网格状提示，从而显着加速了掩模生成过程。然后，该框架引入浅掩模聚合 (SMAgg) 来合并部分掩模，以缓解 SAM 的过度分割问题。最后，解耦掩模注入 (DMI) 结合了 SAM 生成的掩模，分别用于低频和高频引导，而不是直接将它们与标签相结合。多个基准的大量实验验证了 SAM-MI 的优越性。值得注意的是，所提出的方法在 MESS 基准上比 Grounded-SAM 的 mIoU 相对提高了 16.7%，并且加速了 1.6 倍。我们希望 SAM-MI 能够作为一种替代方法，有效地为 OVSS 模型配备 SAM。

Title: Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20032
Pdf URL: https://arxiv.org/pdf/2511.20032
Copy Paste: [[2511.20032]] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention(https://arxiv.org/abs/2511.20032)
Keywords: generation
Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.
摘要：视觉注意力是 MLLM 解释视觉信息的主要机制；然而，其有限的定位能力常常会导致幻觉。我们观察到，尽管 MLLM 可以准确地从视觉标记中提取视觉语义，但它们在后续推理过程中未能充分利用这一优势。为了解决这个限制，我们提出了视觉引导注意力（VGA），这是一种免训练的方法，首先通过利用视觉标记的语义内容构建精确的视觉基础，然后使用这种基础来引导模型将注意力集中到相关的视觉区域。在图像字幕中，VGA 在生成过程中通过抑制已经描述的区域进一步动态地细化这种指导。在 VGA 中，每个令牌仅经历一次前向传递，引入的延迟开销可忽略不计，仅为 4.36\%。此外，VGA 与 FlashAttention 等高效注意力实现完全兼容。跨不同 MLLM 和多个幻觉基准的大量实验表明 VGA 实现了最先进的去幻觉性能。进一步的分析证实，明确的视觉引导在增强 MLLM 的视觉理解能力方面发挥着至关重要的作用。

Title: MFM-point: Multi-scale Flow Matching for Point Cloud Generation

Authors: Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20041
Pdf URL: https://arxiv.org/pdf/2511.20041
Copy Paste: [[2511.20041]] MFM-point: Multi-scale Flow Matching for Point Cloud Generation(https://arxiv.org/abs/2511.20041)
Keywords: generation, generative
Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.
摘要：近年来，点云生成在 3D 生成建模中受到了广泛关注。在现有的方法中，基于点的方法直接生成点云，而不依赖于其他表示，例如潜在特征、网格或体素。这些方法训练成本低且算法简单，但与基于表示的方法相比通常表现不佳。在本文中，我们提出了 MFM-Point，一种用于点云生成的多尺度流匹配框架，它大大提高了基于点的方法的可扩展性和性能，同时保持了其简单性和效率。我们的多尺度生成算法采用从粗到细的生成范式，提高生成质量和可扩展性，而不会产生额外的训练或推理开销。开发这种多尺度框架的一个关键挑战在于保留无序点云的几何结构，同时确保跨分辨率的平滑和一致的分布过渡。为了解决这个问题，我们引入了一种结构化的下采样和上采样策略，该策略可以保留几何形状并保持粗分辨率和精细分辨率之间的对齐。我们的实验结果表明，MFM-Point 在基于点的方法中实现了同类最佳的性能，并对基于表示的最佳方法提出了挑战。特别是，MFM-point 在多类别和高分辨率生成任务中表现出了强劲的结果。

Title: History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images

Authors: Huijia Zhao, Jie Lu, Yunqing Jiang, Xiao-Ping Lu, Kaichang Di
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20045
Pdf URL: https://arxiv.org/pdf/2511.20045
Copy Paste: [[2511.20045]] History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images(https://arxiv.org/abs/2511.20045)
Keywords: super-resolution
Abstract: Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervised blind super-resolution due to the lack of ground-truth images. This work presents History-Augmented Contrastive Blind Super-Resolution (HACBSR), an unsupervised framework for blind super-resolution that operates without ground-truth images and external kernel priors. HACBSR comprises two components: (1) a contrastive kernel sampling mechanism with kernel similarity control to mitigate distribution bias from Gaussian sampling, and (2) a history-augmented contrastive learning that uses historical models to generate negative samples to enable less greedy optimization and to induce strong convexity without ground-truth. A convergence analysis of the history-augmented contrastive learning is given in the Appendix. To support evaluation in planetary applications, we introduce Ceres-50, a dataset with diverse geological features simulated degradation patterns. Experiments show that HACBSR achieves competitive performance compared with state-of-the-art unsupervised methods across multiple upscaling factors. The code is available at this https URL, and the dataset is available at this https URL.
摘要：行星遥感图像受到成像环境和硬件限制引起的各种未知退化的影响。由于缺乏真实图像，这些因素限制了图像质量并阻碍了监督盲超分辨率。这项工作提出了历史增强对比盲超分辨率（HACBSR），这是一种用于盲超分辨率的无监督框架，无需地面实况图像和外部内核先验即可运行。 HACBSR 包含两个组成部分：(1) 具有内核相似性控制的对比内核采样机制，以减轻高斯采样的分布偏差；(2) 历史增强对比学习，使用历史模型生成负样本，以实现较少的贪婪优化并在没有基本事实的情况下诱导强凸性。附录中给出了历史增强对比学习的收敛分析。为了支持行星应用中的评估，我们引入了 Ceres-50，这是一个具有多种地质特征的数据集，模拟了退化模式。实验表明，与最先进的无监督方法相比，HACBSR 在多个升级因素上实现了具有竞争力的性能。代码可从此 https URL 获取，数据集可从此 https URL 获取。

Title: PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Authors: Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20068
Pdf URL: https://arxiv.org/pdf/2511.20068
Copy Paste: [[2511.20068]] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images(https://arxiv.org/abs/2511.20068)
Keywords: generation
Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model's conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.
摘要：自回归 (AR) 图像生成最近已成为图像合成的强大范例。利用大型语言模型的生成原理，它们可以有效地生成看似真实的图像，进一步增加了对可靠检测方法的需求。然而，迄今为止，还缺乏专门针对 AR 图像生成器生成的图像检测的工作。在这项工作中，我们提出了 PRADA（基于概率比的自回归生成图像的归因和检测），这是一种简单且可解释的方法，可以可靠地检测 AR 生成的图像并将其归因于各自的源模型。关键思想是检查表示给定图像的自回归标记序列的模型的条件概率和无条件概率的比率。每当特定模型生成图像时，其概率比就会显示出其他模型生成的图像或真实图像所不存在的独特特征。我们通过校准简单的、特定于模型的评分函数，利用这些特征进行基于阈值的归因和检测。我们的实验评估表明，PRADA 对于 8 个类到图像模型和 4 个文本到图像模型非常有效。

Title: QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Authors: Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Subjects: cs.LG, cs.AR, cs.PL
Abstract URL: https://arxiv.org/abs/2511.20099
Pdf URL: https://arxiv.org/pdf/2511.20099
Copy Paste: [[2511.20099]] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression(https://arxiv.org/abs/2511.20099)
Keywords: generation
Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.
摘要：大型语言模型 (LLM) 在硬件描述语言 (HDL) 生成方面显示出有前景的功能。然而，现有方法通常依赖于自由形式的自然语言描述，这些描述通常是模糊的、冗余的和非结构化的，这给下游 Verilog 代码生成带来了重大挑战。我们将硬件代码生成视为从开放式自然语言空间到特定领域的、高度受限的目标空间的复杂转换。为了弥补这一差距，我们引入了 Core Refined Understanding eXpression (CRUX)，这是一种结构化的中间空间，可以捕获用户意图的基本语义，同时组织表达式以进行精确的 Verilog 代码生成。我们进一步设计了一个两阶段训练框架，包括联合表达建模和双空间优化，以提高 CRUX 和 Verilog 代码的质量。跨多个 Verilog 生成基准的实验表明，我们的模型 CRUX-V 在通用模型中实现了最先进的性能，特别是在具有挑战性的设计任务下。此外，CRUX 空间在用作其他代码模型的输入提示时被证明是可转移的且有益的，突出了它在缩小自由形式自然语言描述和精确的 Verilog 生成之间的差距方面的有效性。

Title: The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Authors: Craig Dickson
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20104
Pdf URL: https://arxiv.org/pdf/2511.20104
Copy Paste: [[2511.20104]] The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs(https://arxiv.org/abs/2511.20104)
Keywords: generation
Abstract: Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
摘要：先前的工作表明，在具有未对齐数据的狭窄域上微调模型可能会导致广泛的未对齐，这种现象称为“紧急未对齐”（Betley et al. 2025）。虽然所有测试的模型都容易受到紧急错位的影响，但某些模型比其他模型表现出更大的阻力。具体而言，Qwen-2.5 系列被证明具有相对抵抗性，而 GPT-4o 表现出最强的错位。在本文中，我们评估当前一代开放权重模型是否表现出与 Qwen-2.5 系列类似的抵抗力，并测量一系列模型架构和规模上的失准鲁棒性。我们在九个现代开放权重模型（Gemma 3 和 Qwen 3 系列，1B-32B 参数）中复制了该效果。针对不安全代码生成进行微调的模型显示出 0.68% 的错位率（基础模型为 0.07%），与之前开放模型结果的下限相符，但显着低于 GPT-4o 的 20%。我们发现了一个与格式相关的关键漏洞：与自然语言提示相比，需要 JSON 输出，错位率增加了一倍（0.96% vs 0.42%）。这表明结构约束可能通过降低模型拒绝的“自由度”来绕过安全训练。这些发现证实，在现代开放权重模型中，紧急失准是一种可重现的现象，其发生率远低于专有系统中观察到的发生率。

Title: Vision-Language Models for Automated 3D PET/CT Report Generation

Authors: Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang, Guangjie Yang, Lijuan Guo, Yan Wan, Xing Yang, Dakai Jin, Zhaoheng Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20145
Pdf URL: https://arxiv.org/pdf/2511.20145
Copy Paste: [[2511.20145]] Vision-Language Models for Automated 3D PET/CT Report Generation(https://arxiv.org/abs/2511.20145)
Keywords: generation
Abstract: Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.
摘要：正电子发射断层扫描/计算机断层扫描 (PET/CT) 在肿瘤学中至关重要，但扫描仪的快速扩展已经超出了训练有素的专家的能力，使得自动 PET/CT 报告生成 (PETRG) 对于减少临床工作量变得越来越重要。与结构成像（例如 X 射线、CT 和 MRI）相比，功能性 PET 提出了明显的挑战：代谢模式随示踪生理学而变化，并且需要全身 3D 背景信息而不是局部区域解释。为了推进 PETRG，我们提出了 PETRG-3D，这是一种端到端 3D 双分支框架，可单独编码 PET 和 CT 体积，并结合风格自适应提示来减轻报告实践中医院间的差异。我们构建了 PETRG-Lym，这是一个从四家医院收集的多中心淋巴瘤数据集（824 份报告，包含 245,509 个配对 PET/CT 切片），并构建了 AutoPET-RG-Lym，这是一个可公开访问的 PETRG 基准，源自开放成像数据，但配备了新的专家撰写的、经过临床验证的报告（135 例）。为了评估临床实用性，我们引入了 PETRG-Score，这是一种淋巴瘤特异性评估方案，可联合测量指定解剖区域的代谢和结构发现。实验表明，PETRG-3D 在自然语言指标（例如，+31.49\% ROUGE-L）和临床疗效指标（例如，+8.18\% PET-All）方面均明显优于现有方法，凸显了体积双模态建模和风格感知提示的优势。总体而言，这项工作为未来 PET/CT 特定模型奠定了基础，强调疾病感知推理和临床可靠评估。代码、型号和 AutoPET-RG-Lym 将发布。

Title: Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Authors: Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20152
Pdf URL: https://arxiv.org/pdf/2511.20152
Copy Paste: [[2511.20152]] Restora-Flow: Mask-Guided Image Restoration with Flow Matching(https://arxiv.org/abs/2511.20152)
Keywords: restoration, super-resolution, generation, generative
Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.
摘要：流匹配已成为一种有前途的生成方法，它可以解决与最先进的扩散模型相关的冗长采样时间问题，并实现更灵活的轨迹设计，同时保持高质量的图像生成。这种能力使其适合作为图像恢复任务的生成先验。尽管当前利用流模型的方法在恢复方面显示出了良好的结果，但有些方法仍然存在处理时间长或产生过度平滑的结果的问题。为了应对这些挑战，我们引入了 Restora-Flow，这是一种免训练方法，通过降级掩模指导流匹配采样，并结合轨迹校正机制来强制降级输入的一致性。我们在涉及基于掩模的退化（即修复、超分辨率和去噪）的多个图像恢复任务中评估了我们在自然和医学数据集上的方法。与基于扩散和流量匹配的参考方法相比，我们表现出卓越的感知质量和处理时间。

Title: Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware

Authors: Federico Paredes-Valles, Yoshitaka Miyatani, Kirk Y. W. Scheper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20175
Pdf URL: https://arxiv.org/pdf/2511.20175
Copy Paste: [[2511.20175]] Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware(https://arxiv.org/abs/2511.20175)
Keywords: generation
Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.
摘要：眼动追踪是众多应用的基础，但以超低功耗实现稳健的高频追踪对于可穿戴平台来说仍然具有挑战性。虽然基于事件的视觉传感器提供微秒分辨率和稀疏数据流，但它们缺乏能够实时推理的完全集成、低功耗处理解决方案。在这项工作中，我们提出了第一个具有完整设备集成的电池供电、可穿戴瞳孔中心跟踪系统，将商用 Speck2f 片上系统上基于事件的传感和神经形态处理与低功耗微控制器上的轻量级坐标解码相结合。我们的解决方案采用新颖的不确定性量化尖峰神经网络，具有门控时间解码功能，针对严格的内存和带宽限制进行了优化，并辅以弥合现实差距的系统部署机制。我们在新的多用户数据集上验证了我们的系统，并演示了具有双神经形态设备的可穿戴原型，可实现 100 Hz 下稳健的双眼瞳孔跟踪，每只眼睛的平均功耗低于 5 mW。我们的工作表明，端到端神经拟态计算可以为下一代节能可穿戴系统提供实用、始终在线的眼球追踪。

Title: Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Authors: Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20186
Pdf URL: https://arxiv.org/pdf/2511.20186
Copy Paste: [[2511.20186]] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis(https://arxiv.org/abs/2511.20186)
Keywords: generation, generative
Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.
摘要：WAN 2.2 等基础视频生成模型表现出强大的文本和图像条件合成能力，但仍然受限于相同视图生成设置。在这项工作中，我们介绍了 Exo2EgoSyn，它是 WAN 2.2 的改编版本，可解锁 Exocentric-to-Egocentric (Exo2Ego) 跨视图视频合成。我们的框架由三个关键模块组成。自我-外在视图对齐（EgoExo-Align）强制外向心和自我中心第一帧表示之间的潜在空间对齐，将生成空间从给定的外向视图重新定向到自我视图。多视图外中心视频调节 (MultiExoCon) 将多视图外中心视频聚合成统一的调节信号，将 WAN2.2 扩展到普通的单图像或文本调节之外。此外，姿势感知潜在注入（PoseInj）将相关的外部到自我相机姿势信息注入到潜在状态中，指导跨视点的几何感知合成。这些模块共同实现了从第三人称观察生成高保真自我视图视频，而无需从头开始重新训练。 ExoEgo4D 上的实验验证了 Exo2EgoSyn 显着改进了 Ego2Exo 合成，为使用基础模型生成可扩展的跨视图视频铺平了道路。源代码和模型将公开发布。

Title: OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Authors: Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20211
Pdf URL: https://arxiv.org/pdf/2511.20211
Copy Paste: [[2511.20211]] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation(https://arxiv.org/abs/2511.20211)
Keywords: generation, generative
Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
摘要：生成模型在 RGB 合成方面表现出色，但实际应用需要 RGBA 操作。这导致了碎片化的格局：专门的单任务模型处理 alpha 但缺乏通用性，而统一的多任务框架仅限于 RGB 域。为了弥补这一关键差距，我们提出了 OmniAlpha，这是第一个用于序列到序列 RGBA 图像生成和编辑的统一多任务生成框架。其架构采用 MSRoPE-BiL，这是一种新颖的 RoPE 方法，其 Diffusion Transformer (DiT) 主干具有双向可扩展层轴，可同时处理多个输入和目标 RGBA 层。为了支持这个框架，我们引入了 AlphaLayers，这是一个包含 1,000 个高质量多层三元组的新数据集，通过新颖的自动合成和过滤管道构建。在该数据集上跨 21 个不同任务的综合套件联合训练 OmniAlpha，大量实验表明我们的统一方法始终优于强大的专业基线。最值得注意的是，OmniAlpha 在 AIM-500 上实现了无掩模抠图的 SAD 相对大幅降低 84.8%，并在图层条件完成方面赢得了超过 90% 的人类偏好。我们的工作证明，统一的多任务模型可以学习 RGBA 的高级共享表示，为更强大的分层感知生成系统铺平道路。

Title: Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Authors: Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20218
Pdf URL: https://arxiv.org/pdf/2511.20218
Copy Paste: [[2511.20218]] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation(https://arxiv.org/abs/2511.20218)
Keywords: generation
Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.
摘要：迷彩图像生成（CIG）是一个新兴的研究领域，专注于合成图像，其中物体和谐地混合在一起，并与周围环境表现出高度的视觉一致性。现有方法通过将对象融合到特定背景中或通过前景对象引导扩散来覆盖周围环境来执行 CIG。然而，由于忽略了伪装物体与背景环境之间的逻辑关系，它们常常无法获得自然的结果。为了解决这个问题，我们提出了 CT-CIG，一种可控文本引导迷彩图像生成方法，可以生成逼真且逻辑上合理的迷彩图像。利用大型视觉语言模型（VLM），我们设计了一种伪装揭示对话机制（CRDM），用高质量的文本提示来注释现有的伪装数据集。随后，构建的图像提示对用于微调稳定扩散，结合轻量级控制器来引导伪装物体的位置和形状，以增强伪装场景的适应性。此外，我们设计了频率交互细化模块（FIRM）来捕获高频纹理特征，促进复杂迷彩图案的学习。广泛的实验，包括 CLIPScore 评估和伪装效果评估，证明了我们生成的文本提示的语义对齐以及 CT-CIG 生成逼真伪装图像的能力。

Title: PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Authors: Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20251
Pdf URL: https://arxiv.org/pdf/2511.20251
Copy Paste: [[2511.20251]] PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling(https://arxiv.org/abs/2511.20251)
Keywords: generation
Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
摘要：文本到图像（T2I）生成的最新进展通过大规模整流流模型取得了显着的视觉效果。然而，这些模型在长时间提示下的表现仍有待探索。长提示编码丰富的内容、空间和风格信息，这些信息增强了保真度，但常常抑制多样性，导致重复且缺乏创意的输出。在这项工作中，我们系统地研究了这种保真度与多样性的困境，并揭示了随着即时长度的增加，最先进的模型表现出多样性的明显下降。为了实现一致的评估，我们引入了 LPD-Bench，这是一个旨在评估长提示生成的保真度和多样性的基准。基于我们的分析，我们开发了一个理论框架，通过提示重构来增加采样熵，并提出了一种免训练方法 PromptMoG，该方法从嵌入空间中的混合高斯中采样提示嵌入，以增强多样性，同时保留语义。对四种最先进的模型 SD3.5-Large、Flux.1-Krea-Dev、CogView4 和 Qwen-Image 进行的大量实验表明，PromptMoG 始终如一地提高了长提示生成多样性，而没有语义漂移。

Title: Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Authors: Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20253
Pdf URL: https://arxiv.org/pdf/2511.20253
Copy Paste: [[2511.20253]] Zoo3D: Zero-Shot 3D Object Detection at Scene Level(https://arxiv.org/abs/2511.20253)
Keywords: generation
Abstract: 3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at this https URL .
摘要：3D 对象检测是空间理解的基础。现实世界的环境需要能够识别各种以前未见过的物体的模型，这仍然是封闭集方法的主要限制。现有的开放词汇 3D 检测器放宽了注释要求，但仍然依赖于训练场景，例如点云或图像。我们更进一步引入了 Zoo3D，这是第一个免训练的 3D 对象检测框架。我们的方法通过 2D 实例掩码的图形聚类构建 3D 边界框，然后使用具有最佳视图选择和视图共识掩码生成的新型开放词汇模块分配语义标签。 Zoo3D 以两种模式运行：零样本 Zoo3D$_0$（根本不需要训练）和自监督 Zoo3D$_1$（通过在 Zoo3D$_0$ 生成的伪标签上训练与类无关的检测器来完善 3D 框预测）。此外，我们将 Zoo3D 扩展到点云之外，可以直接处理摆姿势甚至未摆姿势的图像。在 ScanNet200 和 ARKitScenes 基准测试中，Zoo3D$_0$ 和 Zoo3D$_1$ 在开放词汇 3D 对象检测中均取得了最先进的结果。值得注意的是，我们的零样本 Zoo3D$_0$ 优于所有现有的自监督方法，因此展示了免训练、现成的现实世界 3D 理解方法的强大功能和适应性。代码可在此 https URL 获取。

Title: The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20256
Pdf URL: https://arxiv.org/pdf/2511.20256
Copy Paste: [[2511.20256]] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation(https://arxiv.org/abs/2511.20256)
Keywords: generation
Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
摘要：可靠的奖励函数对于图像生成中的强化学习 (RL) 至关重要。目前大多数强化学习方法都依赖于预先训练的偏好模型，该模型输出标量奖励来近似人类的偏好。然而，这些奖励通常无法捕捉人类的感知，并且容易受到奖励黑客攻击，即较高的分数并不对应于更好的图像。为了解决这个问题，我们引入了 Adv-GRPO，这是一种具有对抗性奖励的 RL 框架，可以迭代更新奖励模型和生成器。该奖励模型使用参考图像作为正样本进行监督，可以很大程度上避免被黑客攻击。与限制参数更新的 KL 正则化不同，我们学习到的奖励直接引导生成器通过其视觉输出，从而产生更高质量的图像。此外，虽然优化现有奖励函数可以减轻奖励黑客行为，但其固有的偏见仍然存在。例如，PickScore 可能会降低图像质量，而基于 OCR 的奖励通常会降低审美保真度。为了解决这个问题，我们将图像本身作为奖励，使用参考图像和视觉基础模型（例如 DINO）来提供丰富的视觉奖励。这些密集的视觉信号，而不是单个标量，可以在图像质量、美观和特定任务指标方面带来一致的增益。最后，我们表明，将参考样本与基础模型奖励相结合可以实现分布转移和灵活的风格定制。在人类评估中，我们的方法优于 Flow-GRPO 和 SD3，在图像质量和美观方面分别实现了 70.0% 和 72.4% 的胜率。代码和模型已经发布。

Title: Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Authors: Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20280
Pdf URL: https://arxiv.org/pdf/2511.20280
Copy Paste: [[2511.20280]] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement(https://arxiv.org/abs/2511.20280)
Keywords: generation
Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
摘要：视频生成领域的最新进展带来了令人印象深刻的视觉质量，但当前的模型仍然难以产生符合现实世界物理原理的结果。为此，我们提出了一种迭代自我完善框架，利用大型语言模型和视觉语言模型为视频生成提供物理感知指导。具体来说，我们引入了多模式思想链（MM-CoT）流程，该流程根据物理不一致的反馈来完善提示，从而逐步提高生成质量。该方法无需训练且即插即用，使其易于适用于各种视频生成模型。 PhyIQ 基准测试表明，我们的方法将Physics-IQ 分数从 56.31 提高到了 62.38。我们希望这项工作能够作为物理一致视频生成的初步探索，并为未来的研究提供见解。

Title: Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Authors: Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20295
Pdf URL: https://arxiv.org/pdf/2511.20295
Copy Paste: [[2511.20295]] Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations(https://arxiv.org/abs/2511.20295)
Keywords: generation
Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.
摘要：反事实解释（CFE）是对模型输入进行最小且语义上有意义的修改，从而改变模型的预测。它们强调了模型所依赖的决定性特征，为分类器提供了对比解释。最先进的视觉反事实解释方法旨在解释图像分类器。用于视频分类器的 CFE 的生成在很大程度上仍未得到充分探索。为了使反事实视频有用，它们必须在物理上合理、时间上连贯，并且表现出平滑的运动轨迹。现有的基于图像的 CFE 方法旨在解释图像分类器，但缺乏生成时间连贯、平滑且物理上合理的视频 CFE 的能力。为了解决这个问题，我们提出了 Back To The Feature (BTTF)，这是一个生成视频 CFE 的优化框架。我们的方法引入了两个新颖的功能，1）一种优化方案，用于检索由输入视频第一帧调节的初始潜在噪声，2）一种两阶段优化策略，用于搜索输入视频附近的反事实视频。两个优化过程均仅由目标分类器指导，确保解释的真实性。为了加速收敛，我们还引入了渐进式优化策略，逐步增加去噪步骤的数量。对 Shape-Moving（运动分类）、MEAD（情感分类）和 NTU RGB+D（动作分类）等视频数据集进行的大量实验表明，我们的 BTTF 有效地生成了有效的、视觉上相似且真实的反事实视频，为分类器的决策机制提供了具体的见解。

Title: MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

Authors: Audrey Pei-Hsuan Chen
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2511.20382
Pdf URL: https://arxiv.org/pdf/2511.20382
Copy Paste: [[2511.20382]] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers(https://arxiv.org/abs/2511.20382)
Keywords: generative
Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with scArches, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.
摘要：由于极端的维度、模态异质性和队列特定的批次效应，多组学数据的表示学习具有挑战性。虽然预训练的 Transformer 主干在生物序列建模中显示出广泛的泛化能力，但它们在多组学集成中的应用仍未得到充分探索。我们提出了 MoRE（多组学表示嵌入），这是一个框架，可以重新利用冷冻的预训练变压器，将异质分析对齐到共享的潜在空间中。与纯粹的生成方法不同，MoRE 采用参数有效的微调 (PEFT) 策略，优先考虑跨样本和跨模态对齐而不是简单的序列重建。具体来说，MoRE 将轻量级、特定于模态的适配器和任务自适应融合层附加到冻结的主干上。它结合监督对比和批次不变对齐损失来优化掩蔽建模目标，产生可泛化到看不见的细胞类型和平台的结构保留嵌入。我们根据既定基线（包括 scGPT、scVI 和 Harmony with scArches）对 MoRE 进行基准测试，评估集成保真度、稀有群体检测和模态转移。我们的结果表明，与完全微调的模型相比，MoRE 实现了具有竞争力的批次鲁棒性和生物保护，同时显着减少了可训练参数。这项工作将 MoRE 定位为迈向通用组学基础模型的实际一步。

Title: FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Authors: Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20390
Pdf URL: https://arxiv.org/pdf/2511.20390
Copy Paste: [[2511.20390]] FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers(https://arxiv.org/abs/2511.20390)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.
摘要：扩散变压器 (DiT) 实现了最先进的生成质量，但需要较长的连续去噪轨迹，导致推理延迟较高。最近的推测推理方法通过起草者-验证者方案在基于 U-Net 的扩散模型中实现无损并行采样，但由于验证期间草稿精度不足，它们的加速在 DiT 上受到限制。为了解决这个限制，我们分析了 DiT 的特征动态，发现最终变压器层（顶部块）的特征表现出很强的时间一致性和丰富的语义抽象。基于这一见解，我们提出了 FREE，这是一种新颖的框架，它采用轻量级绘图器来执行具有并行验证的特征级自回归，在理论和经验支持下保证无损加速。同时，DiT 的预测方差（不确定性）在后续的去噪步骤中自然会增加，从而降低了推测采样下的接受率。为了减轻这种影响，我们进一步引入了一种不确定性引导的松弛策略，形成FREE（松弛），它根据不确定性水平动态调整接受概率。在 ImageNet-$512^2$ 上的实验表明，FREE 实现了高达 $1.86 \times$ 的加速，FREE (relax) 进一步达到了 $2.25 \times$ 的加速，同时保持了生成质量的高感知和定量保真度。

Title: A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

Authors: Jiawei Lin, Guanlong Jiao, Jianjin Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20401
Pdf URL: https://arxiv.org/pdf/2511.20401
Copy Paste: [[2511.20401]] A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control(https://arxiv.org/abs/2511.20401)
Keywords: generation
Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.
摘要：多ID定制是计算机视觉中一个有趣的话题，最近引起了相当多的关注。给定多个人的身份证图像，其目的是生成一个定制的图像，将他们无缝集成，同时保留他们各自的身份。与单ID定制相比，多ID定制难度要大得多，并带来两大挑战。首先，由于多ID定制模型被训练为从裁剪的人物区域重建图像，因此在推理过程中经常遇到复制粘贴问题，导致质量较低。其次，该模型还存在文本可控性较差的问题。生成的结果只是将多个人组合成一张图像，而不管它是否与输入文本对齐。在这项工作中，我们提出 MultiID 以免训练的方式解决这一具有挑战性的任务。由于现有的单ID定制模型较少存在复制粘贴问题，因此我们的关键思想是调整这些模型以实现多ID定制。为此，我们提出了一种 ID 解耦的交叉注意机制，将不同的 ID 嵌入注入相应的图像区域，从而生成多 ID 输出。为了增强生成的可控性，我们引入了三种关键策略，即局部提示、深度引导空间控制和扩展自注意力，使结果与文本提示和ID图像更加一致。我们还精心建立了一个基准，称为 IDBench，用于评估。广泛的定性和定量结果证明了 MultiID 在解决上述两个挑战方面的有效性。其性能与基于训练的多ID定制方法相当甚至更好。

Title: Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Authors: Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20410
Pdf URL: https://arxiv.org/pdf/2511.20410
Copy Paste: [[2511.20410]] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs(https://arxiv.org/abs/2511.20410)
Keywords: generation
Abstract: Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: this https URL.
摘要：时间步蒸馏是提高扩散模型生成效率的有效方法。一致性模型（CM）作为一种基于轨迹的框架，由于其强大的理论基础和高质量的少步生成而展现出巨大的潜力。然而，当前的连续时间一致性蒸馏方法仍然严重依赖训练数据和计算资源，阻碍了它们在资源受限场景中的部署，并限制了它们在不同领域的可扩展性。为了解决这个问题，我们提出了轨迹向后一致性模型（TBCM），它通过直接从教师模型的生成轨迹中提取潜在表示来消除对外部训练数据的依赖。与需要 VAE 编码和大规模数据集的传统方法不同，我们独立的蒸馏范例显着提高了效率和简单性。此外，轨迹提取的样本自然地弥合了训练和推理之间的分布差距，从而实现更有效的知识转移。根据经验，TBCM 在一步生成下在 MJHQ-30k 上实现了 6.52 FID 和 28.08 CLIP 分数，同时与 Sana-Sprint 相比减少了约 40% 的训练时间，并节省了大量 GPU 内存，在不牺牲质量的情况下展示了卓越的效率。我们进一步揭示了连续时间稠度蒸馏中的扩散生成空间差异，并分析了采样策略如何影响蒸馏性能，为未来的蒸馏研究提供了见解。 GitHub 链接：此 https URL。

Title: MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Authors: Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20415
Pdf URL: https://arxiv.org/pdf/2511.20415
Copy Paste: [[2511.20415]] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts(https://arxiv.org/abs/2511.20415)
Keywords: generation
Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at this https URL.
摘要：生成逼真的3D城市是世界模型、虚拟现实和游戏开发的基础，理想的城市场景必须满足风格多样性、细粒度和可控性。然而，现有的方法很难平衡基于文本的生成提供的创造性灵活性与显式结构表示实现的对象级可编辑性。我们介绍 MajutsuCity，这是一种自然语言驱动且具有美学适应性的框架，用于合成结构一致且风格多样的 3D 城市场景。 MajutsuCity 将城市描述为可控布局、资产和材料的组合，并通过四阶段管道运营。为了将可控性扩展到初始生成之外，我们进一步集成了 MajutsuAgent，这是一种基于交互式语言的编辑代理，支持五种对象级操作。为了支持逼真和可定制的场景合成，我们还构建了 MajutsuDataset，这是一个高质量的多模式数据集}，其中包含 2D 语义布局和高度图、各种 3D 建筑资源以及精心策划的 PBR 材质和天空盒，每个都附有详细的注释。同时，我们制定了一套实用的评估指标，涵盖结构一致性、场景复杂性、材质保真度、灯光氛围等关键维度。大量实验表明，MajutsuCity 与 CityDreamer 相比，布局 FID 减少了 83.7%，比 CityCraft 减少了 20.1%。我们的方法在所有 AQS 和 RDR 分数中排名第一，明显优于现有方法。这些结果证实 MajutsuCity 是 3D 城市生成的几何保真度、风格适应性和语义可控性方面的最新技术。我们期望我们的框架能够激发 3D 城市生成的新研究途径。我们的数据集和代码将在此 https URL 发布。

Title: Block Cascading: Training Free Acceleration of Block-Causal Video Models

Authors: Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, Varun Jampani
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20426
Pdf URL: https://arxiv.org/pdf/2511.20426
Copy Paste: [[2511.20426]] Block Cascading: Training Free Acceleration of Block-Causal Video Models(https://arxiv.org/abs/2511.20426)
Keywords: generation
Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: this https URL
摘要：块因果视频生成面临着明显的速度与质量权衡：小型 1.3B 模型只能管理 16 FPS，而大型 14B 模型则以 4.5 FPS 爬行，迫使用户在响应能力和质量之间做出选择。块级联通过免训练并行化显着减轻了这种权衡。我们的主要见解：未来的视频块不需要完全降噪的当前块来开始生成。通过使用来自前辈的部分去噪上下文开始块生成，我们将顺序管道转换为并行级联，其中多个块同时去噪。通过利用时间并行性的 5 个 GPU，我们在所有模型规模上实现了大约 2 倍的加速：1.3B 模型从 16 FPS 加速到 30 FPS，14B 模型从 4.5 FPS 加速到 12.5 FPS。除了推理速度之外，块级联还消除了交互式生成的上下文切换期间 KV 重新缓存（约 200 毫秒）的开销。针对多个块因果管道进行的广泛评估表明，从块因果管道切换到块级联管道进行推理时，生成质量没有显着损失。项目页面：此 https URL

Title: BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Authors: Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.20431
Pdf URL: https://arxiv.org/pdf/2511.20431
Copy Paste: [[2511.20431]] BRIC: Bridging Kinematic Plans and Physical Control at Test Time(https://arxiv.org/abs/2511.20431)
Keywords: generation
Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
摘要：我们提出了 BRIC，一种新颖的测试时间适应（TTA）框架，它通过解决基于扩散的运动学运动规划器和基于强化学习的物理控制器之间的执行差异来实现长期人体运动生成。虽然扩散模型可以根据文本和场景上下文生成多样化且富有表现力的运动，但它们通常会产生物理上难以置信的输出，从而导致模拟过程中的执行漂移。为了解决这个问题，BRIC 在测试时动态地使物理控制器适应嘈杂的运动计划，同时通过减轻灾难性遗忘的损失函数保留预先训练的技能。此外，BRIC还引入了一种轻量级的测试时引导机制，可以在不更新参数的情况下引导信号空间中的扩散模型。通过结合两种适应策略，金砖四国确保以有效和高效的方式在不同环境中实现一致且物理上合理的长期执行。我们验证了 BRIC 在各种长期任务上的有效性，包括运动合成、避障和人景交互，在所有任务中实现了最先进的性能。

Title: Diffusion for Fusion: Designing Stellarators with Generative AI

Authors: Misha Padidar, Teresa Huang, Andrew Giuliani, Marina Spivak
Subjects: cs.LG, physics.plasm-ph
Abstract URL: https://arxiv.org/abs/2511.20445
Pdf URL: https://arxiv.org/pdf/2511.20445
Copy Paste: [[2511.20445]] Diffusion for Fusion: Designing Stellarators with Generative AI(https://arxiv.org/abs/2511.20445)
Keywords: generative
Abstract: Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.
摘要：仿星器是一类有前景的基于聚变的发电厂，它用三维磁场限制热等离子体。仿星器设计通常被视为偏微分方程约束的优化问题，是一个耗时的过程，在计算集群上可能需要数小时才能解决。开发快速设计仿星器的方法对于推进聚变研究至关重要。鉴于最近优化仿星器大型数据集的发展，机器学习方法已成为潜在的候选者。受此启发，我们向机器学习社区提出了一个开放的逆问题：快速生成具有一组理想特性的高质量仿星器设计。作为问题空间中的案例研究，我们使用 QUASR 数据库中的数据训练条件扩散模型，以生成具有所需特性（纵横比和平均旋转变换）的准对称仿星器设计。扩散模型用于设计具有训练期间未见特征的仿星器。我们提供了评估协议，并表明许多生成的仿星器表现出可靠的性能：与准对称性和目标特性的偏差小于 5%。与准对称性的适度偏差凸显了实现低于 1% 目标的机会。除了案例研究之外，我们还分享了多种有前景的生成模型途径，以推进仿星器设计。

Title: Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Authors: Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20446
Pdf URL: https://arxiv.org/pdf/2511.20446
Copy Paste: [[2511.20446]] Learning to Generate Human-Human-Object Interactions from Textual Descriptions(https://arxiv.org/abs/2511.20446)
Keywords: generation, generative
Abstract: The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
摘要：人类彼此互动的方式，包括人际距离、空间配置和运动，在不同情况下有很大差异。为了使机器能够理解这种复杂的、依赖于上下文的行为，有必要对与周围场景上下文相关的多个人进行建模。在本文中，我们提出了一个新的研究问题来模拟参与涉及对象的共享交互的两个人之间的相关性。我们将这种表述称为人-人-物交互（HHOIs）。为了克服缺乏 HHOI 专用数据集的问题，我们提出了一个新捕获的 HHOI 数据集以及一种利用图像生成模型合成 HHOI 数据的方法。作为中介，我们从 HHOI 中获取个体人与对象交互 (HOI) 和人与人交互 (HHI)，并利用这些数据，使用基于分数的扩散模型训练文本到 HOI 和文本到 HHI 模型。最后，我们提出了一个统一的生成框架，该框架集成了两个单独的模型，能够在单个高级采样过程中合成完整的 HHOI。我们的方法将 HHOI 生成扩展到多人环境，从而实现涉及两个以上个体的交互。实验结果表明，我们的方法根据文本描述生成现实的 HHOI，优于之前仅关注单人 HOI 的方法。此外，我们引入了涉及对象的多人运动生成作为我们框架的应用。

Title: Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

Authors: Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20456
Pdf URL: https://arxiv.org/pdf/2511.20456
Copy Paste: [[2511.20456]] Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks(https://arxiv.org/abs/2511.20456)
Keywords: generation
Abstract: Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments. This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems.
摘要：机器学习已成为基于信道状态信息 (CSI) 的人类传感系统不可或缺的一部分，并有望为未来蜂窝和 Wi-Fi 世代中的无设备活动识别和身份检测等应用提供动力。然而，这些系统所依赖的模型的决策可能会受到微妙的干扰，引发了人们对普适传感的安全性和可靠性的担忧。因此，在无线传感能够安全部署在现实环境中之前，量化和理解此类模型的鲁棒性（定义为它们在对抗性扰动下保持准确预测的能力）至关重要。这项工作对 CSI 深度学习模型在不同威胁模型（白盒、黑盒/传输和通用扰动）和不同程度的攻击现实性下的鲁棒性进行了系统评估。我们建立了一个框架来比较紧凑的时间自动编码器模型与跨三个公共数据集的更大的深度架构，量化模型规模、训练机制和物理约束如何影响鲁棒性。我们的实验表明，较小的模型虽然在干净数据上高效且性能相同，但稳健性明显较差。我们进一步证实，与不受约束的特征空间攻击相比，物理上可实现的信号空间扰动（设计为在真实无线信道中可行）会显着降低攻击成功率。对抗性训练缓解了这些漏洞，提高了平均鲁棒精度，而两个模型类别的清洁性能仅适度下降。随着无线传感朝着可靠、跨域操作的方向发展，这些发现为鲁棒性估计提供了定量基线，并为安全可靠的以人为中心的传感系统的设计原则提供了信息。

Title: STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Authors: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20462
Pdf URL: https://arxiv.org/pdf/2511.20462
Copy Paste: [[2511.20462]] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow(https://arxiv.org/abs/2511.20462)
Keywords: generation, generative
Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at this https URL.
摘要：标准化流（NF）是基于端到端可能性的连续数据生成模型，最近随着图像生成方面的令人鼓舞的进展重新受到关注。然而，在视频生成领域，时空复杂性和计算成本要高得多，最先进的系统几乎完全依赖于基于扩散的模型。在这项工作中，我们通过展示 STARFlow-V 重新审视这个设计空间，这是一种基于流的归一化视频生成器，具有端到端学习、强大的因果预测和本机似然估计等显着优势。 STARFlow-V 以最近提出的 STARFlow 为基础，在时空潜在空间中运行，具有全局局部架构，该架构将因果依赖性限制在全局潜在空间中，同时保留丰富的局部帧内交互。这可以缓解随着时间的推移而积累的误差，这是标准自回归扩散模型生成的常见陷阱。此外，我们提出了流分数匹配，它为模型配备了轻量级因果降噪器，以自回归方式提高视频生成的一致性。为了提高采样效率，STARFlow-V 采用视频感知雅可比迭代方案，将内部更新重新构建为可并行迭代，而不会破坏因果关系。由于可逆结构，同一模型可以原生支持文本到视频、图像到视频以及视频到视频生成任务。根据经验，STARFlow-V 相对于基于扩散的基线，通过实际采样吞吐量实现了强大的视觉保真度和时间一致性。据我们所知，这些结果首次证明 NF 能够生成高质量的自回归视频，使它们成为构建世界模型的一个有前途的研究方向。代码和生成的示例可从此 https URL 获取。

Title: DesignPref: Capturing Personal Preferences in Visual Design Generation

Authors: Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu
Subjects: cs.CV, cs.AI, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.20513
Pdf URL: https://arxiv.org/pdf/2511.20513
Copy Paste: [[2511.20513]] DesignPref: Capturing Personal Preferences in Visual Design Generation(https://arxiv.org/abs/2511.20513)
Keywords: generation, generative
Abstract: Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
摘要：大型语言模型和文本到图像扩散模型等生成模型越来越多地用于创建用户界面 (UI) 和演示幻灯片等视觉设计。这些生成模型的微调和基准测试通常依赖于人类注释的设计偏好的数据集。然而，由于视觉设计的主观性和高度个性化的性质，个人之间的偏好差异很大。在本文中，我们通过引入 DesignPref 来研究这个问题，DesignPref 是一个由 20 位具有多级偏好评级的专业设计师注释的 UI 设计生成的 12k 成对比较的数据集。我们发现，在训练有素的设计师中，存在很大程度的分歧（对于二元偏好，Krippendorff 的 alpha = 0.25）。这些设计师提供的自然语言原理表明，分歧源于对各种设计方面重要性和个人偏好的不同看法。通过 DesignPref，我们证明了用于训练聚合判断模型的传统多数投票方法通常不能准确反映个人偏好。为了应对这一挑战，我们研究了多种个性化策略，特别是微调或将设计者特定的注释合并到 RAG 管道中。我们的结果表明，即使使用的示例数量减少了 20 倍，个性化模型在预测个体设计师偏好方面始终优于聚合基线模型。我们的工作提供了第一个数据集来研究个性化视觉设计评估并支持未来对个人设计品味建模的研究。

Title: HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Authors: Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20520
Pdf URL: https://arxiv.org/pdf/2511.20520
Copy Paste: [[2511.20520]] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2511.20520)
Keywords: generation, generative
Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
摘要：最近的统一模型将理解专家（例如法学硕士）与生成专家（例如扩散模型）结合起来，实现了强大的多模式性能。然而，最近的先进方法，如 BAGEL 和 LMFusion，遵循变压器混合 (MoT) 范式，采用对称设计，将一位专家镜像到另一位专家，以方便初始化和融合，但由于固有的模态差异，这种设计仍然不是最理想的。在这项工作中，我们提出了 HBridge，一种非对称 H 形架构，使异构专家能够最佳地利用来自各自模态域的预训练先验。与之前通过共享注意力直接连接专家之间的所有层的密集融合策略不同，HBridge 有选择地桥接中间层，减少了 40% 以上的注意力共享，从而提高了效率并提高了生成质量。捕获特定模态表示的浅层和深层是解耦的，而中间层桥接则促进语义对齐。为了进一步加强跨模态一致性，我们引入了语义重建标记，明确指导生成专家重建目标图像的视觉语义标记。跨多个基准的大量实验证明了 HBridge 的有效性和卓越性能，为统一多模态生成建立了新的范例。

Title: Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Authors: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20549
Pdf URL: https://arxiv.org/pdf/2511.20549
Copy Paste: [[2511.20549]] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning(https://arxiv.org/abs/2511.20549)
Keywords: generation, generative
Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
摘要：扩散模型已成为生成模型的领先类别，但其迭代采样过程的计算成本仍然很高。时间步长蒸馏是一种很有前景的加速生成技术，但它通常需要大量训练并导致图像质量下降。此外，使用强化学习（RL）针对特定目标（例如审美吸引力或用户偏好）对这些蒸馏模型进行微调是出了名的不稳定，并且很容易陷入奖励黑客行为。在这项工作中，我们介绍了 Flash-DMD，这是一种新颖的框架，可以通过蒸馏和基于 RL 的联合细化实现快速收敛。具体来说，我们首先提出了一种有效的时间步感知蒸馏策略，该策略通过增强的真实性显着降低了训练成本，其训练成本仅 2.1\%$，优于 DMD2。其次，我们引入了一种联合训练方案，其中模型使用 RL 目标进行微调，同时时间步蒸馏训练继续进行。我们证明，持续蒸馏产生的稳定、明确的损失可以作为强大的正则化器，有效稳定 RL 训练过程并防止策略崩溃。基于分数和流匹配模型的大量实验表明，我们提出的 Flash-DMD 不仅收敛速度明显更快，而且在少步采样机制中实现了最先进的生成质量，在视觉质量、人类偏好和文本图像对齐指标方面优于现有方法。我们的工作为训练高效、高保真和稳定的生成模型提供了一个有效的范例。代码即将推出。

Title: Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Authors: Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20561
Pdf URL: https://arxiv.org/pdf/2511.20561
Copy Paste: [[2511.20561]] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward(https://arxiv.org/abs/2511.20561)
Keywords: generation, generative
Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at this https URL
摘要：近年来，统一多模式模型取得了重大进展，但仍然存在一个基本问题：理解真的能够为一代人提供信息吗？为了研究这个问题，我们引入了 UniSandbox，这是一个解耦的评估框架，与受控的合成数据集配对，以避免数据泄漏并实现详细分析。我们的研究结果揭示了显着的理解代沟，这主要体现在两个关键维度：推理生成和知识转移。具体来说，对于推理生成任务，我们观察到理解模块中的显式思维链（CoT）有效地弥补了这一差距，并进一步证明自我训练方法可以成功地内化这种能力，从而在生成过程中实现隐式推理。此外，对于知识转移任务，我们发现 CoT 通过帮助检索新学习的知识来协助生成过程，并且还发现基于查询的架构本质上表现出影响这种转移的潜在类似 CoT 的属性。 UniSandbox 为设计未来统一架构和培训策略提供了初步见解，真正弥合了理解和生成之间的差距。代码和数据可在此 https URL 获取

Title: PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Authors: Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20562
Pdf URL: https://arxiv.org/pdf/2511.20562
Copy Paste: [[2511.20562]] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding(https://arxiv.org/abs/2511.20562)
Keywords: generation
Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.
摘要：虽然最近的视频生成模型已经实现了显着的视觉保真度，但它们常常缺乏明确的物理可控性和合理性。为了解决这个问题，最近的一些研究试图通过基于物理的渲染来指导视频生成。然而，这些方法在准确建模复杂的物理特性和有效控制扩展时间序列上产生的物理行为方面面临着固有的挑战。在这项工作中，我们介绍了 PhysChoreo，这是一种新颖的框架，可以从单个图像生成具有多种可控性和物理真实感的视频。我们的方法由两个阶段组成：首先，它通过部分感知的物理属性重建来估计图像中所有对象的静态初始物理属性。然后，通过时间指导和物理可编辑的模拟，它合成具有丰富动态行为和物理真实感的高质量视频。实验结果表明，PhysChoreo 可以生成具有丰富行为和物理真实感的视频，在多个评估指标上优于最先进的方法。

Title: A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Authors: Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20563
Pdf URL: https://arxiv.org/pdf/2511.20563
Copy Paste: [[2511.20563]] A Reason-then-Describe Instruction Interpreter for Controllable Video Generation(https://arxiv.org/abs/2511.20563)
Keywords: generation
Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: this https URL.
摘要：扩散变压器显着提高了视频保真度和时间一致性，但实际的可控性仍然有限。简洁、模糊且结构复杂的用户输入与训练中使用的详细提示形成对比，导致意图输出不匹配。我们提出了 ReaDe，这是一种与模型无关的通用解释器，可将原始指令转换为下游视频生成器的精确、可操作的规范。 ReaDe 遵循先推理后描述的范式：它首先分析用户请求，以确定核心需求并解决歧义，然后生成详细的指导，以实现忠实、可控的生成。我们通过两阶段优化来训练 ReaDe：(i) 推理增强监督通过逐步跟踪和密集字幕进行分析解析，(ii) 多维奖励分配器可以对自然风格字幕进行稳定、反馈驱动的细化。跨单条件和多条件场景的实验表明，指令保真度、字幕准确性和下游视频质量得到了一致的提升，并且对推理密集型和看不见的输入具有很强的泛化能力。 ReaDe 提供了一种将可控视频生成与准确解释的用户意图结合起来的实用途径。项目页面：此 https URL。

Title: DINO-Tok: Adapting DINO for Visual Tokenizers

Authors: Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20565
Pdf URL: https://arxiv.org/pdf/2511.20565
Copy Paste: [[2511.20565]] DINO-Tok: Adapting DINO for Visual Tokenizers(https://arxiv.org/abs/2511.20565)
Keywords: generation, generative
Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at this https URL.
摘要：视觉生成领域的最新进展凸显了潜在生成模型（LGM）的兴起，它依赖于有效的视觉分词器来桥接像素和语义。然而，现有的分词器通常是从头开始训练的，很难平衡语义表示和重建保真度，特别是在高维潜在空间中。在这项工作中，我们引入了 DINO-Tok，这是一种基于 DINO 的视觉分词器，它将分层表示统一到信息完整的潜在空间中。通过将保留细粒度细节的浅层特征与编码全局语义的深层特征相集成，DINO-Tok 有效地连接了预训练表示和视觉生成。我们进一步分析了矢量量化（VQ）在这个高维空间中的挑战，其中关键信息经常丢失并且发生码本崩溃。因此，我们提出了一种全局 PCA 重新加权机制来稳定 VQ 并保留跨维度的基本信息。在 ImageNet 256$\times$256 上，DINO-Tok 实现了最先进的重建性能，自动编码达到 28.54 PSNR，基于 VQ 的建模达到 23.98 PSNR，显着优于先前的分词器，可与十亿级数据训练的模型（例如 Hunyuan 和 Wan）相媲美。这些结果表明，采用 DINO 等强大的预训练视觉模型进行标记化可以实现语义对齐和高保真潜在表示，从而实现下一代视觉生成模型。代码将在此 https URL 公开提供。

Title: Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

Authors: Karim Kadry, Abdallah Abdelwahed, Shoaib Goraya, Ajay Manicka, Naravich Chutisilp, Farhad Nezami, Elazer Edelman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20587
Pdf URL: https://arxiv.org/pdf/2511.20587
Copy Paste: [[2511.20587]] Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models(https://arxiv.org/abs/2511.20587)
Keywords: generation
Abstract: We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We control geometric features such as size, shape, and position through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement Anatomica for latent diffusion models, where neural field decoders partially extract substructures, enabling the efficient control of anatomical properties. Anatomica applies flexibly across diverse anatomical systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems, thereby enabling the rational design of synthetic datasets for virtual trials or machine learning workflows.
摘要：我们提出了 Anatomica：一个推理时间框架，用于生成具有局部地理拓扑控制的多类解剖体素图。在生成过程中，我们使用不同维度、位置和形状的立方体控制域来分割相关的子结构。这些局部子结构用于计算可微分惩罚函数，将样本引导至目标约束。我们通过体素矩控制尺寸、形状和位置等几何特征，而连接组件、环路和空隙等拓扑特征则通过持久同源性来强制执行。最后，我们为潜在扩散模型实现了 Anatomica，其中神经场解码器部分提取子结构，从而能够有效控制解剖特性。 Anatomica 灵活地应用于不同的解剖系统，构成约束来控制任意维度和坐标系上的复杂结构，从而实现虚拟试验或机器学习工作流程的合成数据集的合理设计。

Title: Latent Diffusion Inversion Requires Understanding the Latent Space

Authors: Mingxing Rao, Bowen Qu, Daniel Moyer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.20592
Pdf URL: https://arxiv.org/pdf/2511.20592
Copy Paste: [[2511.20592]] Latent Diffusion Inversion Requires Understanding the Latent Space(https://arxiv.org/abs/2511.20592)
Keywords: generative
Abstract: The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\% and substantial increases in TPR@1\%FPR (6.42\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
摘要：从生成模型恢复训练数据（“模型反演”）已针对数据域中的扩散模型进行了广泛的研究。编码器/解码器对和相应的潜在代码在很大程度上被应用于潜在空间生成模型（例如潜在扩散模型（LDM））的反演技术所忽略。在这项工作中，我们描述了两个关键发现：（1）扩散模型在潜在代码中表现出不均匀的记忆，倾向于过度拟合位于解码器回拉度量的高失真区域的样本。 (2) 即使在单个潜在代码中，不同维度对记忆的贡献也不相同。我们引入了一种原则性方法，根据潜在维度对解码器回调指标的每维贡献进行排名，从而识别那些对记忆最有影响的维度。根据经验，在计算基于分数的成员推理攻击者的攻击统计数据时，删除记忆较少的维度可显着提高性能，在 CIFAR-10、CelebA、ImageNet-1K、Pokémon、MS-COCO 和 Flickr 等不同数据集上，平均 AUROC 增益为 2.7\%，TPR@1\%FPR (6.42\%) 大幅增加。这表明在极低的假阳性容忍度下识别成员的信心更强。我们的结果强调了自动编码器几何对 LDM 记忆的被忽视的影响，并为分析基于扩散的生成模型中的隐私风险提供了新的视角。

Title: Adaptive Hopfield Network: Rethinking Similarities in Associative Memory

Authors: Shurong Wang, Yuqi Pan, Zhuoyang Shen, Meng Zhang, Hongwei Wang, Guoqi Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20609
Pdf URL: https://arxiv.org/pdf/2511.20609
Copy Paste: [[2511.20609]] Adaptive Hopfield Network: Rethinking Similarities in Associative Memory(https://arxiv.org/abs/2511.20609)
Keywords: generative
Abstract: Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness. We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process. Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query's origin. This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories. To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval. We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased. We integrate this mechanism into a novel adaptive Hopfield network (A-Hop), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.
摘要：联想记忆模型是生物智能的基础内容可寻址记忆系统，并以其高可解释性而闻名。然而，现有模型根据邻近度来评估检索质量，这不能保证检索到的模式与查询具有最强的关联性，从而无法保证正确性。我们通过提出查询是存储的内存模式的生成变体来重构这个问题，并定义变体分布来模拟这种微妙的依赖于上下文的生成过程。因此，正确的检索应该返回具有作为查询来源的最大后验概率的记忆模式。这个观点揭示了理想的相似性度量应该近似每个存储模式根据变体分布生成查询的可能性，这对于现有联想记忆使用的固定和预定义的相似性来说是不可能的。为此，我们开发了自适应相似性，这是一种新颖的机制，可以从上下文中提取的样本中学习近似这种有洞察力但未知的可能性，以实现正确的检索。我们从理论上证明，我们提出的自适应相似性在三种规范且广泛适用的变体类型（噪声、屏蔽和偏差）下实现了最佳正确检索。我们将该机制集成到一种新颖的自适应 Hopfield 网络（A-Hop）中，实证结果表明，它在不同的任务中实现了最先进的性能，包括记忆检索、表格分类、图像分类和多实例学习。

Title: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Authors: Panayiotis Danassis, Naman Goel
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2511.20613
Pdf URL: https://arxiv.org/pdf/2511.20613
Copy Paste: [[2511.20613]] Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning(https://arxiv.org/abs/2511.20613)
Keywords: generation
Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
摘要：大型语言模型 (LLM) 的快速普及彻底改变了人工智能辅助代码生成。法学硕士的快速发展超出了我们对其进行正确基准测试的能力。流行的基准强调单元测试通过率和语法正确性。这些指标低估了许多需要规划、优化和战略交互的现实问题的难度。我们引入了基于现实世界物流优化问题（拍卖、提货和交付问题）的多智能体推理驱动基准，该基准将竞争性拍卖与容量受限的路线结合起来。该基准要求构建能够（i）在不确定性下进行战略投标和（ii）优化计划者以实现利润最大化的同时交付任务的代理。我们将 40 个 LLM 编码的代理（由各种最先进的 LLM 在多种提示方法下进行，包括振动编码）与 LLM 出现之前开发的 17 个人工编码代理进行了比较。我们在 12 场双打全场比赛和 $\sim 40$k 比赛中的结果证明了 (i) 人类（研究生）编码的智能体具有明显的优势：前 5 名位置始终由人类编码的智能体赢得，(ii) 大多数 LLM 编码的智能体（40 名中的 33 名）被非常简单的基线击败，以及 (iii) 给出最佳的人类解决方案作为输入并提示改进，表现最好的 LLM 使解决方案明显更差，而不是改进它。我们的结果凸显了法学硕士在生成在现实世界中具有竞争力的代码的能力方面的差距，并激发了强调现实世界场景中推理驱动的代码合成的新评估。

Title: The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Authors: Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20614
Pdf URL: https://arxiv.org/pdf/2511.20614
Copy Paste: [[2511.20614]] The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment(https://arxiv.org/abs/2511.20614)
Keywords: generation
Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
摘要：之前的工作已经探索了给定参考图像的各种定制生成任务，但它们在生成一致的细粒度细节方面仍然面临限制。在本文中，我们的目标是通过应用参考引导的后期编辑方法来解决生成图像的不一致问题，并介绍我们的 ImageCritic。我们首先构建通过基于 VLM 的选择和显式降级获得的参考降级目标三元组的数据集，该数据集有效地模拟了现有生成模型中观察到的常见不准确或不一致之处。此外，在对模型的注意力机制和内在表示进行彻底检查的基础上，我们相应地设计了注意力对齐损失和细节编码器来精确纠正不一致之处。 ImageCritic可以集成到代理框架中，以自动检测不一致之处，并在复杂场景中通过多轮和本地编辑来纠正它们。大量实验表明，ImageCritic可以有效解决各种定制生成场景中与细节相关的问题，对现有方法进行了显着改进。

Title: ShapeGen: Towards High-Quality 3D Shape Synthesis

Authors: Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang, Ding Liang, Yan-Pei Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20624
Pdf URL: https://arxiv.org/pdf/2511.20624
Copy Paste: [[2511.20624]] ShapeGen: Towards High-Quality 3D Shape Synthesis(https://arxiv.org/abs/2511.20624)
Keywords: generation, generative
Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.
摘要：受图像和视频生成范式的启发，3D 形状生成取得了显着进展，能够从单个图像快速合成高保真 3D 资产。然而，当前的方法仍然面临挑战，包括缺乏复杂的细节、过度平滑的表面和支离破碎的薄壳结构。这些限制使得生成的 3D 资产距离艺术家青睐的标准还差一步。在本文中，我们介绍了 ShapeGen，它通过 3D 表示和监督改进、分辨率放大以及线性变换器的优点实现了高质量图像到 3D 形状的生成。这些进步使得生成的资产能够无缝集成到 3D 管道中，从而促进它们在各种应用程序中的广泛采用。通过大量的实验，我们验证了这些改进对整体性能的影响。最终，由于这些增强功能的协同效应，ShapeGen 在图像到 3D 生成方面实现了重大飞跃，建立了新的最先进的性能。

Title: MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Authors: Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20629
Pdf URL: https://arxiv.org/pdf/2511.20629
Copy Paste: [[2511.20629]] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models(https://arxiv.org/abs/2511.20629)
Keywords: generation, generative
Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
摘要：通过人类反馈（RLHF）和奖励模型进行强化学习，可以使生成模型与人类审美和感知偏好保持一致。然而，联合优化多个奖励通常会产生调整税，改善一个维度，同时降低其他维度。为了解决这个问题，我们引入了两种补充方法：MapReduce LoRA 和奖励感知令牌嵌入（RaTE）。 MapReduce LoRA 并行训练特定偏好的 LoRA 专家，并迭代地合并它们以完善共享基础模型； RaTE 学习特定奖励的令牌嵌入，这些令牌嵌入在推理中组成，以实现灵活的偏好控制。文本到图像生成（Stable Diffusion 3.5 Medium 和 FLUX.1-dev）的实验表明，GenEval、PickScore 和 OCR 分别提高了 36.1%、4.6% 和 55.7%，以及 32.7%、4.3% 和 67.1%。在文本转视频生成 (HunyuanVideo) 中，视觉和运动质量分别提高了 48.1% 和 90.0%。在语言任务上，Helpful Assistant 与 Llama-2 7B 的帮助和无害分别提高了 43.4% 和 136.7%。我们的框架设置了一个新的最先进的跨模式多偏好调整方案。

Title: iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Authors: Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20635
Pdf URL: https://arxiv.org/pdf/2511.20635
Copy Paste: [[2511.20635]] iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation(https://arxiv.org/abs/2511.20635)
Keywords: generation
Abstract: Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: this https URL.
摘要：预先训练的视频模型可以学习强大的先验知识，以生成高质量、时间连贯的内容。虽然这些模型在时间一致性方面表现出色，但它们的动态通常受到训练数据的连续性的限制。我们假设，通过将图像数据中丰富且不受约束的内容多样性注入到这个连贯的时间框架中，我们可以生成具有自然过渡和更广泛的动态范围的图像集。为此，我们引入了 iMontage，这是一个统一的框架，旨在将强大的视频模型重新转变为一体化图像生成器。该框架使用并生成可变长度的图像集，统一了广泛的图像生成和编辑任务。为了实现这一目标，我们提出了一种优雅且微创的适应策略，并辅以定制的数据管理流程和培训范例。这种方法允许模型获得广泛的图像处理能力，而不会破坏其宝贵的原始运动先验。 iMontage 在多个主流多进多出任务中表现出色，不仅保持了强大的跨图像上下文一致性，而且还生成了超越传统范围的具有非凡动态的场景。我们的主页位于：此 https URL。

Title: Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

Authors: Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20636
Pdf URL: https://arxiv.org/pdf/2511.20636
Copy Paste: [[2511.20636]] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model(https://arxiv.org/abs/2511.20636)
Keywords: generation
Abstract: Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.
摘要：机械设计和制造工作流程通常从概念设计开始，然后创建计算机辅助设计 (CAD) 模型并通过材料挤出 (MEX) 打印进行制造。此过程需要通过切片和路径规划将 CAD 几何图形转换为机器可读的 G 代码。虽然每个步骤都已完善，但对 CAD 建模的依赖仍然是一个主要瓶颈：构建特定于对象的 3D 几何体速度很慢，而且不太适合快速原型制作。即使是微小的设计变化通常也需要在 CAD 软件中进行手动更新，这使得迭代非常耗时且难以扩展。为了解决这一限制，我们引入了 Image2Gcode，这是一种端到端数据驱动框架，可以绕过 CAD 阶段，直接从图像和零件图生成可打印的 G 代码。手绘或捕获的 2D 图像不依赖于显式 3D 模型，而是作为唯一的输入。该框架首先从图像中提取切片结构线索，然后在 G 代码序列上采用去噪扩散概率模型 (DDPM)。通过迭代去噪，该模型将高斯噪声转换为具有相应挤出参数的可执行打印移动轨迹，建立从视觉输入到本机刀具路径的直接映射。通过直接从 2D 图像生成结构化 G 代码，Image2Gcode 消除了对 CAD 或 STL 中间体的需求，降低了增材制造的进入门槛，并加快了设计到制造的周期。该方法支持根据简单草图或视觉参考进行按需原型设计，并与上游 2D 到 3D 重建模块集成，以实现从概念到物理工件的自动化管道。其结果是一个灵活、计算高效的框架，提高了设计迭代、修复工作流程和分布式制造的可访问性。

Title: MotionV2V: Editing Motion in a Video

Authors: Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20640
Pdf URL: https://arxiv.org/pdf/2511.20640
Copy Paste: [[2511.20640]] MotionV2V: Editing Motion in a Video(https://arxiv.org/abs/2511.20640)
Keywords: generation, generative
Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: this https URL
摘要：虽然生成视频模型已经实现了卓越的保真度和一致性，但将这些功能应用于视频编辑仍然是一个复杂的挑战。最近的研究探索了运动可控性作为增强文本到视频生成或图像动画的一种手段；然而，我们认为精确运动控制是一种有前途但尚未充分探索的编辑现有视频的范例。在这项工作中，我们建议通过直接编辑从输入中提取的稀疏轨迹来修改视频运动。我们将输入和输出轨迹之间的偏差称为“运动编辑”，并证明这种表示与生成主干相结合，可以实现强大的视频编辑功能。为了实现这一目标，我们引入了一个用于生成“运动反事实”的管道，即共享相同内容但不同运动的视频对，并且我们在此数据集上微调运动条件视频扩散架构。我们的方法允许在任何时间戳开始编辑并自然传播。在一项四向头对头用户研究中，我们的模型比之前的工作获得了超过 65% 的偏好。请参阅我们的项目页面：此 https URL

Title: PixelDiT: Pixel Diffusion Transformers for Image Generation

Authors: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20645
Pdf URL: https://arxiv.org/pdf/2511.20645
Copy Paste: [[2511.20645]] PixelDiT: Pixel Diffusion Transformers for Image Generation(https://arxiv.org/abs/2511.20645)
Keywords: generation, generative
Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
摘要：潜在空间建模一直是扩散变压器 (DiT) 的标准。然而，它依赖于两级管道，其中预训练的自动编码器引入了有损重建，导致错误累积，同时阻碍联合优化。为了解决这些问题，我们提出了 PixelDiT，这是一种单阶段端到端模型，无需自动编码器并直接在像素空间中学习扩散过程。 PixelDiT 采用完全基于 Transformer 的架构，采用双层设计：捕获全局语义的补丁级 DiT 和细化纹理细节的像素级 DiT，从而能够在保留精细细节的同时有效训练像素空间扩散模型。我们的分析表明，有效的像素级令牌建模对于像素扩散的成功至关重要。 PixelDiT 在 ImageNet 256x256 上实现了 1.61 FID，大幅超越了现有的像素生成模型。我们进一步将 PixelDiT 扩展到文本到图像的生成，并在像素空间中以 1024x1024 分辨率对其进行预训练。它在 GenEval 上达到 0.74，在 DPG-bench 上达到 83.5，接近最佳潜在扩散模型。

Title: Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Authors: Tahira Kazimi, Connor Dunlop, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20647
Pdf URL: https://arxiv.org/pdf/2511.20647
Copy Paste: [[2511.20647]] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization(https://arxiv.org/abs/2511.20647)
Keywords: generation
Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
摘要：虽然最近的文本到视频 (T2V) 扩散模型已经实现了令人印象深刻的质量和提示对齐，但当从单个文本提示采样多个视频时，它们通常会产生低多样性的输出。我们通过将其制定为设定级别的策略优化问题来应对这一挑战，其目标是训练一种可以涵盖给定提示的各种可能结果的策略。为了解决这个问题，我们引入了 DPP-GRPO，这是一种用于多样化视频生成的新颖框架，它结合了行列式点过程 (DPP) 和组相对策略优化 (GRPO) 理论，以对不同代执行明确的奖励。我们的目标是通过对冗余样本施加收益递减（通过 DPP），同时对候选集提供分组反馈（通过 GRPO），将多样性转化为明确的信号。我们的框架是即插即用且与模型无关的，并鼓励视觉外观、相机运动和场景结构的不同世代，而不牺牲即时保真度或感知质量。我们在 WAN 和 CogVideoX 上实现了我们的方法，并表明我们的方法在 VBench、VideoScore 和人类偏好研究等最先进的基准测试中持续提高了视频多样性。此外，我们还发布了代码和包含 30,000 个不同提示的新基准数据集，以支持未来的研究。

Title: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20649
Pdf URL: https://arxiv.org/pdf/2511.20649
Copy Paste: [[2511.20649]] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout(https://arxiv.org/abs/2511.20649)
Keywords: generation
Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
摘要：当前的自回归视频扩散模型受到三个核心瓶颈的限制：(i) 基础模型的 3D 旋转位置嵌入 (3D-RoPE) 所施加的有限时间范围，(ii) 在长格式推出期间维持细粒度动作控制的即时响应速度缓慢，以及 (iii) 无法在单生成流内实现不连续的电影过渡。我们引入了 $\infty$-RoPE，这是一个统一的推理时间框架，它通过三个互连组件解决所有三个限制：块相对论 RoPE、KV Flush 和 RoPE Cut。块相对论 RoPE 将时间编码重新表述为移动局部参考帧，其中每个新生成的潜在块相对于基本模型的最大帧水平旋转，而较早的块向后旋转以保留相对时间几何形状。这种相对论公式消除了固定的时间位置，从而能够生成远远超出基本位置限制的连续视频。为了在不重新编码的情况下获得细粒度的动作控制，KV Flush 通过仅保留两个潜在帧（全局接收器和最后生成的潜在帧）来更新 KV 缓存，从而确保立即响应。最后，RoPE Cut 在时间 RoPE 坐标中引入了受控的不连续性，从而在单个连续推出中实现多剪辑场景过渡。这些组件共同建立了 $\infty$-RoPE 作为无限视野、可控和电影视频传播的免训练基础。综合实验表明，$\infty$-RoPE 在总体 VBench 分数上始终超过了之前的自回归模型。

Title: RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

Authors: Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20651
Pdf URL: https://arxiv.org/pdf/2511.20651
Copy Paste: [[2511.20651]] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation(https://arxiv.org/abs/2511.20651)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
摘要：强化学习（RL）最近成为一种很有前途的方法，可以将文本到图像生成模型与人类偏好相结合。然而，一个关键的挑战在于设计有效且可解释的奖励。现有方法通常依赖于具有固定权重的复合指标（例如 CLIP、OCR 和真实性分数）或从人类偏好模型中提取的单个标量奖励，这会限制可解释性和灵活性。我们提出了 RubricRL，一个简单而通用的基于 rubric 的奖励设计框架，它提供了更好的可解释性、可组合性和用户控制。 RubricRL 没有使用黑盒标量信号，而是为每个提示动态构建一个结构化的 rubric，这是一个针对输入文本量身定制的细粒度视觉标准（例如对象正确性、属性准确性、OCR 保真度和真实性）的可分解清单。每个标准均由多模态评委（例如 o4-mini）独立评估，并且提示自适应加权机制强调最相关的维度。这种设计不仅产生用于策略优化的可解释和模块化的监督信号（例如GRPO或PPO），而且使用户能够直接调整奖励或惩罚的方面。自回归文本到图像模型的实验表明，RubricRL 提高了提示的真实性、视觉细节和通用性，同时为跨文本到图像架构的可解释 RL 对齐提供了灵活且可扩展的基础。