2025-11-25

Title: Learning Straight Flows: Variational Flow Matching for Efficient Generation

Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.17583
Pdf URL: https://arxiv.org/pdf/2511.17583
Copy Paste: [[2511.17583]] Learning Straight Flows: Variational Flow Matching for Efficient Generation(https://arxiv.org/abs/2511.17583)
Keywords: generation
Abstract: Flow Matching has limited ability in achieving one-step generation due to its reliance on learned curved trajectories. Previous studies have attempted to address this limitation by either modifying the coupling distribution to prevent interpolant intersections or introducing consistency and mean-velocity modeling to promote straight trajectory learning. However, these approaches often suffer from discrete approximation errors, training instability, and convergence difficulties. To tackle these issues, in the present work, we propose \textbf{S}traight \textbf{V}ariational \textbf{F}low \textbf{M}atching (\textbf{S-VFM}), which integrates a variational latent code representing the ``generation overview'' into the Flow Matching framework. \textbf{S-VFM} explicitly enforces trajectory straightness, ideally producing linear generation paths. The proposed method achieves competitive performance across three challenge benchmarks and demonstrates advantages in both training and inference efficiency compared with existing methods.
摘要：由于流匹配依赖于学习的弯曲轨迹，因此其实现一步生成的能力有限。先前的研究试图通过修改耦合分布以防止插值相交或引入一致性和平均速度建模以促进直线轨迹学习来解决这一限制。然而，这些方法经常面临离散逼近误差、训练不稳定和收敛困难的问题。为了解决这些问题，在目前的工作中，我们提出了 \textbf{S}traight \textbf{V}ariational \textbf{F}low \textbf{M}atching (\textbf{S-VFM})，它将表示“生成概述”的变分潜在代码集成到流匹配框架中。 \textbf{S-VFM} 明确强制轨迹直线度，理想情况下产生线性生成路径。所提出的方法在三个挑战基准中实现了有竞争力的性能，并且与现有方法相比在训练和推理效率方面表现出优势。

Title: LLM-Powered Text-Attributed Graph Anomaly Detection via Retrieval-Augmented Reasoning

Authors: Haoyan Xu, Ruizhi Qian, Zhengtao Yao, Ziyi Liu, Li Li, Yuqi Li, Yanshu Li, Wenqing Zheng, Daniele Rosa, Daniel Barcklow, Senthil Kumar, Jieyu Zhao, Yue Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17584
Pdf URL: https://arxiv.org/pdf/2511.17584
Copy Paste: [[2511.17584]] LLM-Powered Text-Attributed Graph Anomaly Detection via Retrieval-Augmented Reasoning(https://arxiv.org/abs/2511.17584)
Keywords: generation
Abstract: Anomaly detection on attributed graphs plays an essential role in applications such as fraud detection, intrusion monitoring, and misinformation analysis. However, text-attributed graphs (TAGs), in which node information is expressed in natural language, remain underexplored, largely due to the absence of standardized benchmark datasets. In this work, we introduce TAG-AD, a comprehensive benchmark for anomaly node detection on TAGs. TAG-AD leverages large language models (LLMs) to generate realistic anomalous node texts directly in the raw text space, producing anomalies that are semantically coherent yet contextually inconsistent and thus more reflective of real-world irregularities. In addition, TAG-AD incorporates multiple other anomaly types, enabling thorough and reproducible evaluation of graph anomaly detection (GAD) methods. With these datasets, we further benchmark existing unsupervised GNN-based GAD methods as well as zero-shot LLMs for GAD. As part of our zero-shot detection setup, we propose a retrieval-augmented generation (RAG)-assisted, LLM-based zero-shot anomaly detection framework. The framework mitigates reliance on brittle, hand-crafted prompts by constructing a global anomaly knowledge base and distilling it into reusable analysis frameworks. Our experimental results reveal a clear division of strengths: LLMs are particularly effective at detecting contextual anomalies, whereas GNN-based methods remain superior for structural anomaly detection. Moreover, RAG-assisted prompting achieves performance comparable to human-designed prompts while eliminating manual prompt engineering, underscoring the practical value of our RAG-assisted zero-shot LLM anomaly detection framework.
摘要：属性图的异常检测在欺诈检测、入侵监控和错误信息分析等应用中发挥着重要作用。然而，以自然语言表达节点信息的文本属性图（TAG）仍未得到充分开发，这主要是由于缺乏标准化基准数据集。在这项工作中，我们引入了 TAG-AD，这是一种用于 TAG 异常节点检测的综合基准。 TAG-AD 利用大型语言模型 (LLM) 直接在原始文本空间中生成真实的异常节点文本，产生语义一致但上下文不一致的异常，从而更能反映现实世界的不规则性。此外，TAG-AD 还结合了多种其他异常类型，能够对图形异常检测 (GAD) 方法进行彻底且可重复的评估。借助这些数据集，我们进一步对现有的基于 GNN 的无监督 GAD 方法以及 GAD 的零样本 LLM 进行基准测试。作为零样本检测设置的一部分，我们提出了一种检索增强生成（RAG）辅助、基于 LLM 的零样本异常检测框架。该框架通过构建全局异常知识库并将其提炼成可重用的分析框架，减轻了对脆弱的手工提示的依赖。我们的实验结果揭示了明显的优势划分：LLM 在检测上下文异常方面特别有效，而基于 GNN 的方法在结构异常检测方面仍然更胜一筹。此外，RAG 辅助提示实现了与人工设计提示相当的性能，同时消除了手动提示工程，强调了我们的 RAG 辅助零样本 LLM 异常检测框架的实用价值。

Title: Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

Authors: Saicharan Kolluru
Subjects: cs.LG, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2511.17593
Pdf URL: https://arxiv.org/pdf/2511.17593
Copy Paste: [[2511.17593]] Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI(https://arxiv.org/abs/2511.17593)
Keywords: generation
Abstract: The deployment of Large Language Models (LLMs) in production environments requires efficient inference serving systems that balance throughput, latency, and resource utilization. This paper presents a comprehensive empirical evaluation of two prominent open-source LLM serving frameworks: vLLM and HuggingFace Text Generation Inference (TGI). We benchmark these systems across multiple dimensions including throughput performance, end-to-end latency, GPU memory utilization, and scalability characteristics using LLaMA-2 models ranging from 7B to 70B parameters. Our experiments reveal that vLLM achieves up to 24x higher throughput than TGI under high-concurrency workloads through its novel PagedAttention mechanism, while TGI demonstrates lower tail latencies for interactive single-user scenarios. We provide detailed performance profiles for different deployment scenarios and offer practical recommendations for system selection based on workload characteristics. Our findings indicate that the choice between these frameworks should be guided by specific use-case requirements: vLLM excels in high-throughput batch processing scenarios, while TGI is better suited for latency-sensitive interactive applications with moderate concurrency.
摘要：在生产环境中部署大型语言模型 (LLM) 需要高效的推理服务系统，以平衡吞吐量、延迟和资源利用率。本文对两个著名的开源 LLM 服务框架：vLLM 和 HuggingFace 文本生成推理 (TGI) 进行了全面的实证评估。我们使用参数范围从 7B 到 70B 的 LLaMA-2 模型对这些系统进行多个维度的基准测试，包括吞吐量性能、端到端延迟、GPU 内存利用率和可扩展性特征。我们的实验表明，vLLM 通过其新颖的 PagedAttention 机制在高并发工作负载下实现了比 TGI 高出 24 倍的吞吐量，而 TGI 在交互式单用户场景中表现出更低的尾部延迟。我们为不同的部署场景提供详细的性能概况，并根据工作负载特征为系统选择提供实用的建议。我们的研究结果表明，这些框架之间的选择应以特定的用例要求为指导：vLLM 擅长高吞吐量批处理场景，而 TGI 更适合具有中等并发性的延迟敏感的交互式应用程序。

Title: Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Authors: Yassir Benhammou, Suman Kalyan, Sujay Kumar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17596
Pdf URL: https://arxiv.org/pdf/2511.17596
Copy Paste: [[2511.17596]] Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding(https://arxiv.org/abs/2511.17596)
Keywords: generation
Abstract: Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.
摘要：广播和媒体组织越来越依赖人工智能来自动化内容索引、标记和元数据生成等劳动密集型流程。然而，现有的人工智能系统通常以单一模式（例如视频、音频或文本）运行，这限制了它们对广播材料中复杂的跨模式关系的理解。在这项工作中，我们提出了一种多模态自动编码器（MMAE），它可以学习文本、音频和视觉数据的统一表示，从而实现元数据提取和语义聚类的端到端自动化。该模型在最近推出的 LUMA 数据集上进行训练，该数据集是代表现实世界媒体内容的多模态三元组的完全一致的基准。通过最小化跨模态的联合重建损失，MMAE 可以发现模态不变的语义结构，而不依赖于大型配对或对比数据集。与线性基线相比，我们展示了聚类和对齐指标（Silhouette、ARI、NMI）的显着改进，表明基于重建的多模态嵌入可以作为广播档案中可扩展元数据生成和跨模态检索的基础。这些结果凸显了重建驱动的多模态学习在增强现代广播工作流程中的自动化、可搜索性和内容管理效率方面的潜力。

Title: Energy-based Autoregressive Generation for Neural Population Dynamics

Authors: Ningling Ge, Sicheng Dai, Yu Zhu, Shan Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17606
Pdf URL: https://arxiv.org/pdf/2511.17606
Copy Paste: [[2511.17606]] Energy-based Autoregressive Generation for Neural Population Dynamics(https://arxiv.org/abs/2511.17606)
Keywords: generation
Abstract: Understanding brain function represents a fundamental goal in neuroscience, with critical implications for therapeutic interventions and neural engineering applications. Computational modeling provides a quantitative framework for accelerating this understanding, but faces a fundamental trade-off between computational efficiency and high-fidelity modeling. To address this limitation, we introduce a novel Energy-based Autoregressive Generation (EAG) framework that employs an energy-based transformer learning temporal dynamics in latent space through strictly proper scoring rules, enabling efficient generation with realistic population and single-neuron spiking statistics. Evaluation on synthetic Lorenz datasets and two Neural Latents Benchmark datasets (MC_Maze and Area2_bump) demonstrates that EAG achieves state-of-the-art generation quality with substantial computational efficiency improvements, particularly over diffusion-based methods. Beyond optimal performance, conditional generation applications show two capabilities: generalizing to unseen behavioral contexts and improving motor brain-computer interface decoding accuracy using synthetic neural data. These results demonstrate the effectiveness of energy-based modeling for neural population dynamics with applications in neuroscience research and neural engineering. Code is available at this https URL.
摘要：了解大脑功能是神经科学的一个基本目标，对治疗干预和神经工程应用具有重要意义。计算建模为加速这种理解提供了一个定量框架，但面临着计算效率和高保真建模之间的基本权衡。为了解决这个限制，我们引入了一种新颖的基于能量的自回归生成（EAG）框架，该框架采用基于能量的变压器，通过严格正确的评分规则来学习潜在空间中的时间动态，从而能够利用现实的群体和单神经元尖峰统计数据进行高效生成。对合成 Lorenz 数据集和两个 Neural Latents Benchmark 数据集（MC_Maze 和 Area2_bump）的评估表明，EAG 实现了最先进的生成质量，并显着提高了计算效率，特别是相对于基于扩散的方法。除了最佳性能之外，条件生成应用程序还显示出两种功能：泛化到看不见的行为环境以及使用合成神经数据提高运动脑机接口解码准确性。这些结果证明了基于能量的神经群体动力学建模在神经科学研究和神经工程中的应用的有效性。代码可从此 https URL 获取。

Title: Finding Pre-Injury Patterns in Triathletes from Lifestyle, Recovery and Load Dynamics Features

Authors: Leonardo Rossi, Bruno Rodrigues
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17610
Pdf URL: https://arxiv.org/pdf/2511.17610
Copy Paste: [[2511.17610]] Finding Pre-Injury Patterns in Triathletes from Lifestyle, Recovery and Load Dynamics Features(https://arxiv.org/abs/2511.17610)
Keywords: generation
Abstract: Triathlon training, which involves high-volume swimming, cycling, and running, places athletes at substantial risk for overuse injuries due to repetitive physiological stress. Current injury prediction approaches primarily rely on training load metrics, often neglecting critical factors such as sleep quality, stress, and individual lifestyle patterns that significantly influence recovery and injury susceptibility. We introduce a novel synthetic data generation framework tailored explicitly for triathlon. This framework generates physiologically plausible athlete profiles, simulates individualized training programs that incorporate periodization and load-management principles, and integrates daily-life factors such as sleep quality, stress levels, and recovery states. We evaluated machine learning models (LASSO, Random Forest, and XGBoost) showing high predictive performance (AUC up to 0.86), identifying sleep disturbances, heart rate variability, and stress as critical early indicators of injury risk. This wearable-driven approach not only enhances injury prediction accuracy but also provides a practical solution to overcoming real-world data limitations, offering a pathway toward a holistic, context-aware athlete monitoring.
摘要：铁人三项训练涉及大量游泳、骑自行车和跑步，使运动员因重复的生理压力而面临过度使用受伤的巨大风险。目前的损伤预测方法主要依赖于训练负荷指标，往往忽略了睡眠质量、压力和个人生活方式模式等显着影响恢复和损伤易感性的关键因素。我们引入了一种专门为铁人三项运动量身定制的新颖的合成数据生成框架。该框架生成生理上合理的运动员档案，模拟结合周期和负荷管理原则的个性化训练计划，并整合睡眠质量、压力水平和恢复状态等日常生活因素。我们评估了显示出高预测性能（AUC 高达 0.86）的机器学习模型（LASSO、随机森林和 XGBoost），将睡眠障碍、心率变异性和压力确定为受伤风险的关键早期指标。这种可穿戴设备驱动的方法不仅提高了损伤预测的准确性，而且还提供了克服现实世界数据限制的实用解决方案，为全面、情境感知的运动员监测提供了途径。

Title: AI-driven Generation of MALDI-TOF MS for Microbial Characterization

Authors: Lucía Schmidt-Santiago, David Rodríguez-Temporal, Carlos Sevilla-Salcedo, Vanessa Gómez-Verdejo
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2511.17611
Pdf URL: https://arxiv.org/pdf/2511.17611
Copy Paste: [[2511.17611]] AI-driven Generation of MALDI-TOF MS for Microbial Characterization(https://arxiv.org/abs/2511.17611)
Keywords: generation, generative
Abstract: Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (MALDI-TOF MS) has become a cornerstone technology in clinical microbiology, enabling rapid and accurate microbial identification. However, the development of data-driven diagnostic models remains limited by the lack of sufficiently large, balanced, and standardized spectral datasets. This study investigates the use of deep generative models to synthesize realistic MALDI-TOF MS spectra, aiming to overcome data scarcity and support the development of robust machine learning tools in microbiology. We adapt and evaluate three generative models, Variational Autoencoders (MALDIVAEs), Generative Adversarial Networks (MALDIGANs), and Denoising Diffusion Probabilistic Model (MALDIffusion), for the conditional generation of microbial spectra guided by species labels. Generation is conditioned on species labels, and spectral fidelity and diversity are assessed using diverse metrics. Our experiments show that synthetic data generated by MALDIVAE, MALDIGAN, and MALDIffusion are statistically and diagnostically comparable to real measurements, enabling classifiers trained exclusively on synthetic samples to reach performance levels similar to those trained on real data. While all models faithfully reproduce the peak structure and variability of MALDI-TOF spectra, MALDIffusion obtains this fidelity at a substantially higher computational cost, and MALDIGAN shows competitive but slightly less stable behaviour. In contrast, MALDIVAE offers the most favorable balance between realism, stability, and efficiency. Furthermore, augmenting minority species with synthetic spectra markedly improves classification accuracy, effectively mitigating class imbalance and domain mismatch without compromising the authenticity of the generated data.
摘要：基质辅助激光解吸/电离飞行时间质谱 (MALDI-TOF MS) 已成为临床微生物学的基石技术，可实现快速、准确的微生物鉴定。然而，由于缺乏足够大、平衡和标准化的光谱数据集，数据驱动的诊断模型的发展仍然受到限制。本研究研究了使用深度生成模型来合成真实的 MALDI-TOF MS 谱图，旨在克服数据稀缺性并支持微生物学中强大的机器学习工具的开发。我们采用并评估了三种生成模型：变分自动编码器（MALDIVAE）、生成对抗网络（MALDIGAN）和去噪扩散概率模型（MALDIffusion），用于由物种标签引导的微生物光谱的条件生成。生成以物种标签为条件，并使用不同的指标评估光谱保真度和多样性。我们的实验表明，MALDIVAE、MALDIGAN 和 MALDIffusion 生成的合成数据在统计和诊断上与真实测量结果相当，使专门在合成样本上训练的分类器能够达到与在真实数据上训练的分类器相似的性能水平。虽然所有模型都忠实地再现了 MALDI-TOF 光谱的峰结构和变异性，但 MALDIffusion 以更高的计算成本获得了这种保真度，而 MALDIGAN 则表现出具有竞争力但稳定性稍差的行为。相比之下，MALDIVAE 在现实性、稳定性和效率之间提供了最有利的平衡。此外，用合成光谱增强少数物种可以显着提高分类准确性，有效减轻类别不平衡和域不匹配，而不会影响生成数据的真实性。

Title: Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression

Authors: Siddiqua Namrah
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17612
Pdf URL: https://arxiv.org/pdf/2511.17612
Copy Paste: [[2511.17612]] Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression(https://arxiv.org/abs/2511.17612)
Keywords: restoration
Abstract: Enhancing low-light traffic images is crucial for reliable perception in autonomous driving, intelligent transportation, and urban surveillance systems. Nighttime and dimly lit traffic scenes often suffer from poor visibility due to low illumination, noise, motion blur, non-uniform lighting, and glare from vehicle headlights or street lamps, which hinder tasks such as object detection and scene understanding. To address these challenges, we propose a fully unsupervised multi-stage deep learning framework for low-light traffic image enhancement. The model decomposes images into illumination and reflectance components, progressively refined by three specialized modules: (1) Illumination Adaptation, for global and local brightness correction; (2) Reflectance Restoration, for noise suppression and structural detail recovery using spatial-channel attention; and (3) Over-Exposure Compensation, for reconstructing saturated regions and balancing scene luminance. The network is trained using self-supervised reconstruction, reflectance smoothness, perceptual consistency, and domain-aware regularization losses, eliminating the need for paired ground-truth images. Experiments on general and traffic-specific datasets demonstrate superior performance over state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS, NIQE) and qualitative visual quality. Our approach enhances visibility, preserves structure, and improves downstream perception reliability in real-world low-light traffic scenarios.
摘要：增强弱光交通图像对于自动驾驶、智能交通和城市监控系统的可靠感知至关重要。由于低照度、噪声、运动模糊、照明不均匀以及车辆前灯或路灯的眩光，夜间和昏暗的交通场景常常会出现可见度差的情况，从而阻碍了物体检测和场景理解等任务。为了应对这些挑战，我们提出了一种完全无监督的多阶段深度学习框架，用于弱光交通图像增强。该模型将图像分解为照明和反射分量，并通过三个专门模块逐步细化：（1）照明适应，用于全局和局部亮度校正； (2) 反射率恢复，使用空间通道注意力进行噪声抑制和结构细节恢复； (3)过度曝光补偿，用于重建饱和区域并平衡场景亮度。该网络使用自监督重建、反射平滑度、感知一致性和领域感知正则化损失进行训练，从而无需配对地面实况图像。对一般数据集和特定流量数据集的实验表明，在定量指标（PSNR、SSIM、LPIPS、NIQE）和定性视觉质量方面，其性能优于最先进的方法。我们的方法增强了可见性，保留了结构，并提高了现实世界低光交通场景中的下游感知可靠性。

Title: Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis

Authors: Young-Beom Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17615
Pdf URL: https://arxiv.org/pdf/2511.17615
Copy Paste: [[2511.17615]] Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis(https://arxiv.org/abs/2511.17615)
Keywords: generation
Abstract: Integrating multiple personalized concepts into a single image has recently become a significant area of focus within Text-to-Image (T2I) generation. However, existing methods often underperform on complex multi-object scenes due to unintended alterations in both personalized and non-personalized regions. This not only fails to preserve the intended prompt structure but also disrupts interactions among regions, leading to semantic inconsistencies. To address this limitation, we introduce plug-and-play multi-concept adaptive blending for high-fidelity text-to-image synthesis (PnP-MIX), an innovative, tuning-free approach designed to seamlessly embed multiple personalized concepts into a single generated image. Our method leverages guided appearance attention to faithfully reflect the intended appearance of each personalized concept. To further enhance compositional fidelity, we present a mask-guided noise mixing strategy that preserves the integrity of non-personalized regions such as the background or unrelated objects while enabling the precise integration of personalized objects. Finally, to mitigate concept leakage, i.e., the inadvertent leakage of personalized concept features into other regions, we propose background dilution++, a novel strategy that effectively reduces such leakage and promotes accurate localization of features within personalized regions. Extensive experimental results demonstrate that PnP-MIX consistently surpasses existing methodologies in both single- and multi-concept personalization scenarios, underscoring its robustness and superior performance without additional model tuning.
摘要：将多个个性化概念集成到单个图像中最近已成为文本到图像 (T2I) 生成中的一个重要关注领域。然而，由于个性化和非个性化区域的意外改变，现有方法通常在复杂的多对象场景上表现不佳。这不仅无法保留预期的提示结构，而且还会破坏区域之间的交互，导致语义不一致。为了解决这一限制，我们引入了用于高保真文本到图像合成的即插即用多概念自适应混合（PnP-MIX），这是一种创新的免调整方法，旨在将多个个性化概念无缝嵌入到单个生成的图像中。我们的方法利用引导外观注意力来忠实地反映每个个性化概念的预期外观。为了进一步提高构图保真度，我们提出了一种掩模引导的噪声混合策略，该策略可以保留非个性化区域（例如背景或不相关对象）的完整性，同时实现个性化对象的精确集成。最后，为了减轻概念泄漏，即个性化概念特征无意中泄漏到其他区域，我们提出了背景稀释++，这是一种有效减少这种泄漏并促进个性化区域内特征准确定位的新策略。大量的实验结果表明，PnP-MIX 在单概念和多概念个性化场景中始终超越现有方法，强调了其鲁棒性和卓越性能，无需额外的模型调整。

Title: Tensor Gauge Flow Models

Authors: Alexander Strunk, Roland Assam
Subjects: cs.LG, cs.AI, math.DG
Abstract URL: https://arxiv.org/abs/2511.17616
Pdf URL: https://arxiv.org/pdf/2511.17616
Copy Paste: [[2511.17616]] Tensor Gauge Flow Models(https://arxiv.org/abs/2511.17616)
Keywords: generative
Abstract: This paper introduces Tensor Gauge Flow Models, a new class of Generative Flow Models that generalize Gauge Flow Models and Higher Gauge Flow Models by incorporating higher-order Tensor Gauge Fields into the Flow Equation. This extension allows the model to encode richer geometric and gauge-theoretic structure in the data, leading to more expressive flow dynamics. Experiments on Gaussian mixture models show that Tensor Gauge Flow Models achieve improved generative performance compared to both standard and gauge flow baselines.
摘要：本文介绍了张量计量流模型，这是一类新的生成流模型，它通过将高阶张量计量场合并到流量方程中来推广计量流模型和更高计量流模型。此扩展允许模型在数据中编码更丰富的几何和规范理论结构，从而产生更具表现力的流动动力学。高斯混合模型的实验表明，与标准和表流基线相比，张量表流模型实现了改进的生成性能。

Title: Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach

Authors: Ju-Young Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17618
Pdf URL: https://arxiv.org/pdf/2511.17618
Copy Paste: [[2511.17618]] Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach(https://arxiv.org/abs/2511.17618)
Keywords: generation
Abstract: Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.
摘要：传统的 VQA 方法主要依靠问答 (Q&A) 对来学习视频内容的时空动态。然而，大多数现有注释都是以事件为中心的，这限制了模型捕获场景综合上下文的能力。物体类别、空间配置和描述性视觉属性等基本信息的缺乏阻碍了模型对环境形成完整的理解，最终限制了其泛化和推理能力。在本文中，我们介绍了通过嵌入集成方法（FIQ）进行视频问答的基础问题生成，该框架旨在通过提高 VQA 模型对视频内容的基础理解来增强 VQA 模型的推理能力。 FIQ 根据直接从视频中提取的描述性信息生成问答对，从而丰富了具有核心场景级属性的数据集。这些生成的对有助于模型对视频有更全面的理解，从而提高泛化性和推理性能。此外，我们提出了一个 VQ-CAlign 模块，将特定于任务的问题嵌入与相应的视觉特征对齐，保留必要的上下文线索并增强对下游任务的适应性。 SUTD-TrafficQA 数据集上的实验结果表明，FIQ 实现了最先进的性能，超越了现有的基线方法。

Title: Efficient Large-Scale Learning of Minimax Risk Classifiers

Authors: Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.17626
Pdf URL: https://arxiv.org/pdf/2511.17626
Copy Paste: [[2511.17626]] Efficient Large-Scale Learning of Minimax Risk Classifiers(https://arxiv.org/abs/2511.17626)
Keywords: generation
Abstract: Supervised learning with large-scale data usually leads to complex optimization problems, especially for classification tasks with multiple classes. Stochastic subgradient methods can enable efficient learning with a large number of samples for classification techniques that minimize the average loss over the training samples. However, recent techniques, such as minimax risk classifiers (MRCs), minimize the maximum expected loss and are not amenable to stochastic subgradient methods. In this paper, we present a learning algorithm based on the combination of constraint and column generation that enables efficient learning of MRCs with large-scale data for classification tasks with multiple classes. Experiments on multiple benchmark datasets show that the proposed algorithm provides upto a 10x speedup for general large-scale data and around a 100x speedup with a sizeable number of classes.
摘要：大规模数据的监督学习通常会导致复杂的优化问题，特别是对于具有多个类别的分类任务。随机次梯度方法可以实现对大量样本的有效学习，以实现分类技术，从而最大限度地减少训练样本的平均损失。然而，最近的技术，例如最小最大风险分类器（MRC），最大限度地减少了最大预期损失，并且不适合随机次梯度方法。在本文中，我们提出了一种基于约束和列生成相结合的学习算法，该算法能够利用大规模数据对多类分类任务进行高效的 MRC 学习。在多个基准数据集上的实验表明，所提出的算法对于一般大规模数据提供高达 10 倍的加速，对于大量类别提供大约 100 倍的加速。

Title: Rectifying Mean-Shift in Cascaded Precipitation Nowcasting

Authors: Fanbo Ju, Haiyuan Shi, Qingjian Ni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17628
Pdf URL: https://arxiv.org/pdf/2511.17628
Copy Paste: [[2511.17628]] Rectifying Mean-Shift in Cascaded Precipitation Nowcasting(https://arxiv.org/abs/2511.17628)
Keywords: generation
Abstract: Precipitation nowcasting, which aims to provide high spatio-temporal resolution precipitation forecasts by leveraging current radar observations, is a core task in regional weather forecasting. The cascaded architecture has emerged as the mainstream paradigm for deep learning-based precipitation nowcasting. This paradigm involves a deterministic model to predict macroscopic trends (or posterior mean), followed by a probabilistic model to generate local details (or local stochasticity). However, existing methods commonly overlook the conflation of the systematic distribution shift in deterministic predictions and the local stochasticity. As a result, the deterministic component's distribution shift contaminates the predictions of the probabilistic component, leading to inaccuracies in precipitation patterns and intensity, particularly over longer lead times. To address this issue, we introduce RectiCast, a two-stage framework that explicitly decouples the correction of mean-field shift from the generation of local stochasticity via a dual Flow Matching model. In the first stage, a deterministic model generates the posterior mean. In the second stage, we introduce a Rectifier to explicitly learn the distribution shift and produce a rectified mean. Subsequently, a Generator focuses on modeling the local stochasticity conditioned on the rectified mean. Experiments on SEVIR and MeteoNet demonstrate that RectiCast achieves significant performance improvements over existing state-of-the-art methods.
摘要：降水临近预报是区域天气预报的核心任务，旨在利用现有雷达观测数据提供高时空分辨率的降水预报。级联架构已成为基于深度学习的降水临近预报的主流范例。这种范式涉及一个确定性模型来预测宏观趋势（或后验平均值），然后是一个概率模型来生成局部细节（或局部随机性）。然而，现有的方法通常忽视了确定性预测中的系统分布变化和局部随机性的合并。因此，确定性成分的分布变化会污染概率性成分的预测，导致降水模式和强度不准确，特别是在较长的准备时间内。为了解决这个问题，我们引入了 RectiCast，这是一个两阶段框架，它通过双流匹配模型明确地将平均场偏移的校正与局部随机性的生成解耦。在第一阶段，确定性模型生成后验均值。在第二阶段，我们引入一个整流器来显式学习分布偏移并产生校正均值。随后，生成器专注于对以校正均值为条件的局部随机性进行建模。 SEVIR 和 MeteoNet 上的实验表明，与现有最先进的方法相比，RectiCast 实现了显着的性能改进。

Title: Efficient Score Pre-computation for Diffusion Models via Cross-Matrix Krylov Projection

Authors: Kaikwan Lau, Andrew S. Na, Justin W.L. Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17634
Pdf URL: https://arxiv.org/pdf/2511.17634
Copy Paste: [[2511.17634]] Efficient Score Pre-computation for Diffusion Models via Cross-Matrix Krylov Projection(https://arxiv.org/abs/2511.17634)
Keywords: generation
Abstract: This paper presents a novel framework to accelerate score-based diffusion models. It first converts the standard stable diffusion model into the Fokker-Planck formulation which results in solving large linear systems for each image. For training involving many images, it can lead to a high computational cost. The core innovation is a cross-matrix Krylov projection method that exploits mathematical similarities between matrices, using a shared subspace built from ``seed" matrices to rapidly solve for subsequent ``target" matrices. Our experiments show that this technique achieves a 15.8\% to 43.7\% time reduction over standard sparse solvers. Additionally, we compare our method against DDPM baselines in denoising tasks, showing a speedup of up to 115$\times$. Furthermore, under a fixed computational budget, our model is able to produce high-quality images while DDPM fails to generate recognizable content, illustrating our approach is a practical method for efficient generation in resource-limited settings.
摘要：本文提出了一种加速基于评分的扩散模型的新颖框架。它首先将标准稳定扩散模型转换为 Fokker-Planck 公式，从而为每个图像求解大型线性系统。对于涉及许多图像的训练，可能会导致较高的计算成本。核心创新是跨矩阵 Krylov 投影方法，该方法利用矩阵之间的数学相似性，使用由“种子”矩阵构建的共享子空间来快速求解后续“目标”矩阵。我们的实验表明，与标准稀疏求解器相比，该技术的时间减少了 15.8% 到 43.7%。此外，我们将我们的方法与去噪任务中的 DDPM 基线进行比较，结果显示加速高达 115$\times$。此外，在固定的计算预算下，我们的模型能够生成高质量的图像，而 DDPM 无法生成可识别的内容，这说明我们的方法是在资源有限的环境中高效生成的实用方法。

Title: Model-to-Model Knowledge Transmission (M2KT): A Data-Free Framework for Cross-Model Understanding Transfer

Authors: Pratham Sorte
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17638
Pdf URL: https://arxiv.org/pdf/2511.17638
Copy Paste: [[2511.17638]] Model-to-Model Knowledge Transmission (M2KT): A Data-Free Framework for Cross-Model Understanding Transfer(https://arxiv.org/abs/2511.17638)
Keywords: generation
Abstract: Modern artificial intelligence systems depend heavily on large datasets for both training and transferring knowledge between models. Knowledge distillation, transfer learning, and dataset distillation have made such transfers more efficient, yet they remain fundamentally data-driven: a teacher must produce examples, logits, or gradients for a student to learn. In this work, we introduce Model-to-Model Knowledge Transmission (M2KT), a novel paradigm for data-free conceptual transfer between neural networks. M2KT enables models to exchange knowledge packets that encapsulate structured concept embeddings, abstraction graphs, reasoning traces, and provenance metadata. Unlike classical distillation, M2KT operates primarily in concept space rather than example space, and it does not require labeled datasets or teacher-generated outputs during transfer. We formalize the notion of concept manifolds, introduce an inter-model alignment mapping between teacher and student latent spaces, and derive a composite loss that enforces geometric, structural, and reasoning consistency together with explicit safety constraints. We further present algorithmic procedures for teacher-side packet generation and student-side ingestion and verification. Experiments on symbolic reasoning with large language models show that M2KT can achieve approximately 85 to 90 percent of teacher performance while reducing data usage by over 98 percent compared to standard knowledge distillation. This work establishes a theoretical and practical foundation for data-free AI-to-AI knowledge transfer and self-improving model ecosystems.
摘要：现代人工智能系统在很大程度上依赖于大型数据集来进行模型之间的训练和知识转移。知识蒸馏、迁移学习和数据集蒸馏使此类迁移更加高效，但它们从根本上仍然是数据驱动的：教师必须生成示例、逻辑或梯度供学生学习。在这项工作中，我们介绍了模型到模型知识传输（M2KT），这是一种神经网络之间无数据概念传输的新范例。 M2KT 使模型能够交换封装结构化概念嵌入、抽象图、推理轨迹和来源元数据的知识包。与经典蒸馏不同，M2KT 主要在概念空间而不是示例空间中运行，并且在传输过程中不需要标记数据集或教师生成的输出。我们形式化了概念流形的概念，引入了教师和学生潜在空间之间的模型间对齐映射，并导出了一个复合损失，该复合损失增强了几何、结构和推理的一致性以及明确的安全约束。我们进一步提出了教师端数据包生成和学生端摄取和验证的算法程序。使用大型语言模型进行符号推理的实验表明，与标准知识蒸馏相比，M2KT 可以实现约 85% 至 90% 的教师绩效，同时减少 98% 以上的数据使用量。这项工作为无数据的人工智能到人工智能的知识转移和自我改进的模型生态系统奠定了理论和实践基础。

Title: MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence

Authors: Liyuan Deng, Yunpeng Bai, Yongkang Dai, Xiaoshui Huang, Hongping Gan, Dongshuo Huang, Hao jiacheng, Yilei Shi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17647
Pdf URL: https://arxiv.org/pdf/2511.17647
Copy Paste: [[2511.17647]] MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence(https://arxiv.org/abs/2511.17647)
Keywords: generation
Abstract: Parametric Computer-Aided Design (CAD) is crucial in industrial applications, yet existing approaches often struggle to generate long sequence parametric commands due to complex CAD models' geometric and topological constraints. To address this challenge, we propose MamTiff-CAD, a novel CAD parametric command sequences generation framework that leverages a Transformer-based diffusion model for multi-scale latent representations. Specifically, we design a novel autoencoder that integrates Mamba+ and Transformer, to transfer parameterized CAD sequences into latent representations. The Mamba+ block incorporates a forget gate mechanism to effectively capture long-range dependencies. The non-autoregressive Transformer decoder reconstructs the latent representations. A diffusion model based on multi-scale Transformer is then trained on these latent embeddings to learn the distribution of long sequence commands. In addition, we also construct a dataset that consists of long parametric sequences, which is up to 256 commands for a single CAD model. Experiments demonstrate that MamTiff-CAD achieves state-of-the-art performance on both reconstruction and generation tasks, confirming its effectiveness for long sequence (60-256) CAD model generation.
摘要：参数化计算机辅助设计 (CAD) 在工业应用中至关重要，但由于复杂 CAD 模型的几何和拓扑限制，现有方法常常难以生成长序列参数化命令。为了应对这一挑战，我们提出了 MamTiff-CAD，这是一种新颖的 CAD 参数化命令序列生成框架，它利用基于 Transformer 的扩散模型进行多尺度潜在表示。具体来说，我们设计了一种集成 Mamba+ 和 Transformer 的新型自动编码器，将参数化的 CAD 序列转换为潜在表示。 Mamba+ 块采用了遗忘门机制来有效捕获远程依赖关系。非自回归 Transformer 解码器重建潜在表示。然后在这些潜在嵌入上训练基于多尺度 Transformer 的扩散模型，以学习长序列命令的分布。此外，我们还构建了一个由长参数序列组成的数据集，单个 CAD 模型最多包含 256 个命令。实验表明，MamTiff-CAD 在重建和生成任务上均实现了最先进的性能，证实了其对于长序列 (60-256) CAD 模型生成的有效性。

Title: SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Authors: Jieru Lin, Zhiwei Yu, Börje F. Karlsson
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2511.17649
Pdf URL: https://arxiv.org/pdf/2511.17649
Copy Paste: [[2511.17649]] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios(https://arxiv.org/abs/2511.17649)
Keywords: generation
Abstract: Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: this https URL.
摘要：自主智能不仅需要感知和推理，更重要的是，需要与现有世界及其基础设施进行有效的互动。日常环境中存在着丰富的有形控制接口 (TCI)，例如电灯开关、电器面板和嵌入式 GUI，这些接口需要常识和物理推理，还需要时间和空间上的因果预测和结果验证（例如延迟加热、远程灯光）。此外，这里的故障具有潜在的安全影响，但当前的基准很少测试接地、部分可观察性（视频）或特定环境中的事后验证。我们引入了 SWITCH（用于控制和处理的语义世界接口任务），这是一个通过迭代版本创建的具体的、任务驱动的基准测试，旨在探索这些差距。它的第一次迭代 SWITCH-Basic 在以自我为中心的 RGB 视频输入和设备多样性下评估了五种互补能力：任务感知 VQA、语义 UI 基础、动作生成、状态转换预测和结果验证。在跨越 98 个真实设备和设备的 351 个任务中，商业和开放的 LMMM 即使在单步交互中也表现出不一致的性能，通常过度依赖文本提示而未充分使用视觉或视频证据（高总分可以掩盖此类失败）。 SWITCH 提供数据、代码和保留的分割，以实现可重复的评估和社区贡献，以实现更具挑战性的未来基准迭代和训练数据集的创建。基准资源可在以下位置获得：此 https URL。

Title: GANGR: GAN-Assisted Scalable and Efficient Global Routing Parallelization

Authors: Hadi Khodaei Jooshin, Inna Partin-Vaisband
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17665
Pdf URL: https://arxiv.org/pdf/2511.17665
Copy Paste: [[2511.17665]] GANGR: GAN-Assisted Scalable and Efficient Global Routing Parallelization(https://arxiv.org/abs/2511.17665)
Keywords: generation, generative
Abstract: Global routing is a critical stage in electronic design automation (EDA) that enables early estimation and optimization of the routability of modern integrated circuits with respect to congestion, power dissipation, and design complexity. Batching is a primary concern in top-performing global routers, grouping nets into manageable sets to enable parallel processing and efficient resource usage. This process improves memory usage, scalable parallelization on modern hardware, and routing congestion by controlling net interactions within each batch. However, conventional batching methods typically depend on heuristics that are computationally expensive and can lead to suboptimal results (oversized batches with conflicting nets, excessive batch counts degrading parallelization, and longer batch generation times), ultimately limiting scalability and efficiency. To address these limitations, a novel batching algorithm enhanced with Wasserstein generative adversarial networks (WGANs) is introduced in this paper, enabling more effective parallelization by generating fewer higher-quality batches in less time. The proposed algorithm is tested on the latest ISPD'24 contest benchmarks, demonstrating up to 40% runtime reduction with only 0.002% degradation in routing quality as compared to state-of-the-art router.
摘要：全局布线是电子设计自动化 (EDA) 的关键阶段，可以在拥塞、功耗和设计复杂性方面对现代集成电路的可布线性进行早期评估和优化。批处理是高性能全局路由器的主要关注点，它将网络分组为可管理的集合，以实现并行处理和高效的资源利用。此过程通过控制每个批次内的网络交互来改善内存使用、现代硬件上的可扩展并行化以及路由拥塞。然而，传统的批处理方法通常依赖于计算成本高昂的启发式方法，并且可能导致次优结果（具有冲突网络的过大批次、过多的批次计数降低了并行性以及较长的批次生成时间），最终限制了可扩展性和效率。为了解决这些限制，本文引入了一种利用 Wasserstein 生成对抗网络 (WGAN) 增强的新型批处理算法，通过在更短的时间内生成更少的更高质量的批次来实现更有效的并行化。所提出的算法在最新的 ISPD'24 竞赛基准上进行了测试，与最先进的路由器相比，运行时间减少了 40%，而路由质量仅下降了 0.002%。

Title: VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Authors: Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17731
Pdf URL: https://arxiv.org/pdf/2511.17731
Copy Paste: [[2511.17731]] VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning(https://arxiv.org/abs/2511.17731)
Keywords: generation
Abstract: Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.
摘要：事实证明，思想链 (CoT) 提示对于在大型语言模型 (LLM) 中引发复杂推理非常有效。然而，它在多模态大语言模型（MLLM）中的潜力在很大程度上仍未得到开发，因为缺乏捕获视觉理解固有的丰富的、基于空间的推理的大规模数据集，这阻碍了它的发展。现有的视觉 CoT 资源通常较小、特定于领域，或者缺乏组合视觉推理所需的类人逐步结构。在本文中，我们介绍了 VisReason，这是一个旨在推进视觉思维链推理的大型数据集。 VisReason 包含跨越四个不同领域的 489K 个带注释的示例，每个示例都具有多轮、类人的基本原理，可指导 MLLM 通过可解释的视觉推理步骤。在此基础上，我们策划了 VisReason-Pro，这是一个由更强大的专家级 GPT 注释器生成的 165K 子集，通过深度信息注释丰富了详细的推理轨迹和 3D 空间基础。在 VisReason 和 VisReason-Pro 上对最先进的 Qwen2.5-VL 模型进行微调，可以在逐步视觉推理的准确性、可解释性和跨基准泛化方面产生重大改进。这些结果表明 VisReason 为 MLLM 配备了更加系统化和通用化的推理能力。我们将 VisReason 视为培养类人视觉推理的基石，为下一代多模态智能铺平道路。

Title: Deepfake Geography: Detecting AI-Generated Satellite Images

Authors: Mansur Yerzhanuly
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17766
Pdf URL: https://arxiv.org/pdf/2511.17766
Copy Paste: [[2511.17766]] Deepfake Geography: Detecting AI-Generated Satellite Images(https://arxiv.org/abs/2511.17766)
Keywords: generative
Abstract: The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.
摘要：StyleGAN2 和 Stable Diffusion 等生成模型的快速发展对卫星图像的真实性构成了越来越大的威胁，而卫星图像对于科学和安全领域的可靠分析和决策越来越重要。虽然深度换脸检测已在面部环境中得到广泛研究，但卫星图像带来了明显的挑战，包括地形水平的不一致和结构伪影。在这项研究中，我们对用于检测人工智能生成的卫星图像的卷积神经网络（CNN）和视觉变换器（ViT）进行了全面比较。使用来自 DM-AER 和 FSI 数据集的超过 130,000 张标记 RGB 图像的精选数据集，我们发现 ViT 在准确性（95.11% vs. 87.02%）和整体鲁棒性方面都显着优于 CNN，因为它们能够对远程依赖性和全局语义结构进行建模。我们使用特定于架构的可解释性方法（包括用于 CNN 的 Grad-CAM 和用于 ViT 的 Chefer 注意力归因）进一步增强模型透明度，揭示不同的检测行为并验证模型的可信度。我们的结果凸显了 ViT 在检测合成图像的结构不一致和重复纹理模式特征方面的卓越性能。未来的工作将把这项研究扩展到多光谱和合成孔径雷达模式，并集成频域分析，以进一步增强探测能力并在高风险应用中保护卫星图像的完整性。

Title: QAL: A Loss for Recall Precision Balance in 3D Reconstruction

Authors: Pranay Meshram, Yash Turkar, Kartikeya Singh, Praveen Raj Masilamani, Charuvahan Adhivarahan, Karthik Dantu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.17824
Pdf URL: https://arxiv.org/pdf/2511.17824
Copy Paste: [[2511.17824]] QAL: A Loss for Recall Precision Balance in 3D Reconstruction(https://arxiv.org/abs/2511.17824)
Keywords: generation
Abstract: Volumetric learning underpins many 3D vision tasks such as completion, reconstruction, and mesh generation, yet training objectives still rely on Chamfer Distance (CD) or Earth Mover's Distance (EMD), which fail to balance recall and precision. We propose Quality-Aware Loss (QAL), a drop-in replacement for CD/EMD that combines a coverage-weighted nearest-neighbor term with an uncovered-ground-truth attraction term, explicitly decoupling recall and precision into tunable components. Across diverse pipelines, QAL achieves consistent coverage gains, improving by an average of +4.3 pts over CD and +2.8 pts over the best alternatives. Though modest in percentage, these improvements reliably recover thin structures and under-represented regions that CD/EMD overlook. Extensive ablations confirm stable performance across hyperparameters and across output resolutions, while full retraining on PCN and ShapeNet demonstrates generalization across datasets and backbones. Moreover, QAL-trained completions yield higher grasp scores under GraspNet evaluation, showing that improved coverage translates directly into more reliable robotic manipulation. QAL thus offers a principled, interpretable, and practical objective for robust 3D vision and safety-critical robotics pipelines
摘要：体积学习支持许多 3D 视觉任务，例如完成、重建和网格生成，但训练目标仍然依赖于倒角距离 (CD) 或推土机距离 (EMD)，这无法平衡召回率和精度。我们提出了质量感知损失（QAL），它是 CD/EMD 的直接替代品，它将覆盖加权的最近邻项与未覆盖的地面实况吸引项相结合，将召回率和精度明确地解耦为可调组件。在不同的管道中，QAL 实现了一致的覆盖率增益，比 CD 平均提高了 4.3 点，比最佳替代方案平均提高了 2.8 点。尽管百分比不大，但这些改进可靠地恢复了 CD/EMD 忽略的薄结构和代表性不足的区域。广泛的消融确认了跨超参数和跨输出分辨率的稳定性能，而 PCN 和 ShapeNet 上的全面再训练则证明了跨数据集和主干网的泛化。此外，经过 QAL 训练的完成在 GraspNet 评估下产生了更高的掌握分数，这表明覆盖范围的提高可以直接转化为更可靠的机器人操作。因此，QAL 为强大的 3D 视觉和安全关键型机器人管道提供了原则性的、可解释的和实用的目标

Title: Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Authors: Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17839
Pdf URL: https://arxiv.org/pdf/2511.17839
Copy Paste: [[2511.17839]] Show Me: Unifying Instructional Image and Video Generation with Diffusion Models(https://arxiv.org/abs/2511.17839)
Keywords: generation
Abstract: Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.
摘要：在给定上下文中生成视觉指令对于开发交互式世界模拟器至关重要。虽然之前的工作通过文本引导的图像处理或视频预测来解决这个问题，但这些任务通常是孤立处理的。这种分离揭示了一个基本问题：图像处理方法忽略了动作如何随时间展开，而视频预测模型通常忽略预期结果。为此，我们提出了 ShowMe，一个统一的框架，通过有选择地激活视频扩散模型的空间和时间组件来实现这两项任务。此外，我们引入结构和运动一致性奖励来提高结构保真度和时间连贯性。值得注意的是，这种统一带来了双重好处：通过视频预训练获得的空间知识增强了非刚性图像编辑中的上下文一致性和真实感，而指令引导的操作阶段为模型提供了更强的视频预测目标导向推理能力。对不同基准的实验表明，我们的方法在教学图像和视频生成方面都优于专家模型，凸显了视频扩散模型作为统一动作对象状态转换器的优势。

Title: Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Authors: Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17844
Pdf URL: https://arxiv.org/pdf/2511.17844
Copy Paste: [[2511.17844]] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation(https://arxiv.org/abs/2511.17844)
Keywords: generation, generative
Abstract: Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.
摘要：微调大规模文本到视频扩散模型以添加新的生成控制，例如对物理相机参数（例如快门速度或光圈）的控制，通常需要难以获取的大量高保真数据集。在这项工作中，我们提出了一种数据高效的微调策略，可以从稀疏、低质量的合成数据中学习这些控制。我们证明，对如此简单的数据进行微调不仅可以实现所需的控制，而且实际上比在逼真的“真实”数据上微调的模型产生更好的结果。除了展示这些结果之外，我们还提供了一个框架，可以直观地和定量地证明这种现象的合理性。

Title: Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Authors: Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2511.17879
Pdf URL: https://arxiv.org/pdf/2511.17879
Copy Paste: [[2511.17879]] Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction(https://arxiv.org/abs/2511.17879)
Keywords: generative
Abstract: Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.
摘要：生成式人工智能的大多数应用都涉及顺序交互，其中人输入提示并等待响应，反应时间和适应性并不是重要因素。相比之下，现场干扰是一种协作互动，需要实时协调和适应，而无需了解其他玩家未来的动作，同时保留多样性以维持创意流程。强化学习后训练可以通过策略交互实现有效的适应，但它通常会通过利用基于一致性的奖励来减少输出多样性。这种崩溃被称为“奖励黑客”，会影响许多 RL 训练后流程，但在现场即兴演奏中尤其有害，因为音乐创造力依赖于动态变化和相互响应。在本文中，我们提出了一种针对策略生成轨迹的新颖对抗性训练方法，以减轻旋律到和弦伴奏的 RL 后期训练中的奖励黑客行为。共同进化的判别器将策略轨迹与数据分布分开，而策略除了一致性奖励之外还最大化判别器的输出，以防止崩溃到微不足道的输出。我们使用固定的测试旋律和学习的旋律代理来评估模拟中的伴奏质量和输出多样性，并使用部署在与专家音乐家的实时交互系统中的模型进行用户研究。定量评估和用户反馈表明输出多样性、谐波一致性、适应速度和用户代理得到改善。我们的结果展示了一种简单而有效的方法，可以减轻生成序列模型的强化学习后训练中的奖励黑客行为。

Title: ArticFlow: Generative Simulation of Articulated Mechanisms

Authors: Jiong Lin, Jinchen Ruan, Hod Lipson
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.17883
Pdf URL: https://arxiv.org/pdf/2511.17883
Copy Paste: [[2511.17883]] ArticFlow: Generative Simulation of Articulated Mechanisms(https://arxiv.org/abs/2511.17883)
Keywords: generation, generative
Abstract: Recent advances in generative models have produced strong results for static 3D shapes, whereas articulated 3D generation remains challenging due to action-dependent deformations and limited datasets. We introduce ArticFlow, a two-stage flow matching framework that learns a controllable velocity field from noise to target point sets under explicit action control. ArticFlow couples (i) a latent flow that transports noise to a shape-prior code and (ii) a point flow that transports points conditioned on the action and the shape prior, enabling a single model to represent diverse articulated categories and generalize across actions. On MuJoCo Menagerie, ArticFlow functions both as a generative model and as a neural simulator: it predicts action-conditioned kinematics from a compact prior and synthesizes novel morphologies via latent interpolation. Compared with object-specific simulators and an action-conditioned variant of static point-cloud generators, ArticFlow achieves higher kinematic accuracy and better shape quality. Results show that action-conditioned flow matching is a practical route to controllable and high-quality articulated mechanism generation.
摘要：生成模型的最新进展已经为静态 3D 形状产生了强大的结果，而由于动作相关的变形和有限的数据集，铰接式 3D 生成仍然具有挑战性。我们引入了 ArticFlow，一个两阶段流匹配框架，它在显式动作控制下学习从噪声到目标点集的可控速度场。 ArticFlow 将 (i) 将噪声传输到形状先验代码的潜在流和 (ii) 传输以动作和形状先验为条件的点的点流，使单个模型能够表示不同的铰接类别并跨动作进行泛化。在 MuJoCo Menagerie 上，ArticFlow 既充当生成模型又充当神经模拟器：它根据紧凑的先验预测动作条件运动学，并通过潜在插值合成新颖的形态。与特定于对象的模拟器和静态点云生成器的动作条件变体相比，ArticFlow 实现了更高的运动学精度和更好的形状质量。结果表明，动作条件流匹配是生成可控且高质量铰接机构的实用途径。

Title: Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Authors: Yan Xu, Yixing Wang, Stella X. Yu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.17932
Pdf URL: https://arxiv.org/pdf/2511.17932
Copy Paste: [[2511.17932]] Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion(https://arxiv.org/abs/2511.17932)
Keywords: generation
Abstract: Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.
摘要：只要瞥见一个场景，你能想象当镜头滑过时电影正在播放吗？这就是我们在 \emph{稀疏输入新颖视图合成} 上采用的镜头，不仅可以填充宽间隔视图之间的空间间隙，而且还可以 \emph{完成自然视频} 在空间中展开。我们将任务重新定义为 \emph{测试时自然视频完成}，使用 \emph{预训练视频扩散模型} 中的强大先验来幻化出可信的中间视图。我们的\emph{零镜头，生成引导}框架在新颖的相机姿势下生成伪视图，并通过\emph{不确定性感知机制}进行调制以实现空间一致性。这些合成帧强化了用于场景重建的 \emph{3D Gaussian Splatting} (3D-GS) 的监督，特别是在观察不足的区域。迭代反馈循环让 3D 几何和 2D 视图合成相互通知，从而改进场景重建和生成的视图。结果是从稀疏输入得到连贯的高保真渲染\emph{无需任何特定于场景的训练或微调}。在 LLFF、DTU、DL3DV 和 MipNeRF-360 上，我们的方法在极度稀疏的情况下显着优于强大的 3D-GS 基线。

Title: Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay

Authors: Wenzhang Du
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.17936
Pdf URL: https://arxiv.org/pdf/2511.17936
Copy Paste: [[2511.17936]] Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay(https://arxiv.org/abs/2511.17936)
Keywords: generative
Abstract: Many deployed learning systems must update models on streaming data under memory constraints. The default strategy, sequential fine-tuning on each new phase, is architecture-agnostic but often suffers catastrophic forgetting when later phases correspond to different sub-populations or tasks. Replay with a finite buffer is a simple alternative, yet its behaviour across generative and predictive objectives is not well understood. We present a unified study of stateful replay for streaming autoencoding, time series forecasting, and classification. We view both sequential fine-tuning and replay as stochastic gradient methods for an ideal joint objective, and use a gradient alignment analysis to show when mixing current and historical samples should reduce forgetting. We then evaluate a single replay mechanism on six streaming scenarios built from Rotated MNIST, ElectricityLoadDiagrams 2011-2014, and Airlines delay data, using matched training budgets and three seeds. On heterogeneous multi task streams, replay reduces average forgetting by a factor of two to three, while on benign time based streams both methods perform similarly. These results position stateful replay as a strong and simple baseline for continual learning in streaming environments.
摘要：许多已部署的学习系统必须在内存限制下更新流数据模型。默认策略是对每个新阶段进行顺序微调，与架构无关，但当后续阶段对应于不同的子群体或任务时，通常会遭受灾难性的遗忘。使用有限缓冲区重放是一种简单的替代方案，但其在生成和预测目标方面的行为尚不清楚。我们提出了针对流自动编码、时间序列预测和分类的状态重放的统一研究。我们将顺序微调和重放视为理想联合目标的随机梯度方法，并使用梯度对齐分析来显示何时混合当前和历史样本可以减少遗忘。然后，我们使用匹配的训练预算和三个种子，在根据 Rotated MNIST、ElectricityLoadDiagrams 2011-2014 和航空公司延误数据构建的六个流场景上评估单一重播机制。在异构多任务流上，重放将平均遗忘减少了两到三倍，而在基于良性时间的流上，两种方法的性能相似。这些结果将状态重播定位为流环境中持续学习的强大且简单的基线。

Title: VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Authors: Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian, Jiarui Wang, Zijian Chen, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17962
Pdf URL: https://arxiv.org/pdf/2511.17962
Copy Paste: [[2511.17962]] VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment(https://arxiv.org/abs/2511.17962)
Keywords: generative, quality assessment
Abstract: Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.
摘要：开发强大的视觉质量评估 (VQualA) 大型多模态模型 (LMM) 需要实现多功能性、强大性和可转移性。然而，现有的VQualA LMM通常专注于单一任务并依赖于全参数微调，这使得它们容易对特定模式或任务类型过度拟合，从而限制了它们的泛化能力和可迁移性。为了解决这个问题，我们提出了一种以视觉编码器为中心的生成预训练流程，并开发了 VITAL 系列 LMM。 (1) 我们采用机器执行的注释审查范例，构建了超过 450 万个视觉语言 (VL) 对，这是迄今为止最大的 VQualA 训练数据集。 (2)我们采用多任务训练工作流程，同时提高模型的定量评分精度，并增强其跨图像和视频模式的质量解释能力。 (3) 在视觉编码器的基础上，我们实现了高效的模型动物园扩展：模型动物园表现出强大的零样本性能，并且每个配对解码器仅需要使用不到 1/1000 的预训练数据进行快速预热，即可实现与完全训练的对应解码器相当的性能。总的来说，我们的工作为 VQualA 的基础 LMM 的发展奠定了基石。

Title: FeRA: Frequency-Energy Constrained Routing for Effective Diffusion Adaptation Fine-Tuning

Authors: Bo Yin, Xiaobin Hu, Xingyu Zhou, Peng-Tao Jiang, Yue Liao, Junwei Zhu, Jiangning Zhang, Ying Tai, Chengjie Wang, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17979
Pdf URL: https://arxiv.org/pdf/2511.17979
Copy Paste: [[2511.17979]] FeRA: Frequency-Energy Constrained Routing for Effective Diffusion Adaptation Fine-Tuning(https://arxiv.org/abs/2511.17979)
Keywords: generative
Abstract: Diffusion models have achieved remarkable success in generative modeling, yet how to effectively adapt large pretrained models to new tasks remains challenging. We revisit the reconstruction behavior of diffusion models during denoising to unveil the underlying frequency energy mechanism governing this process. Building upon this observation, we propose FeRA, a frequency driven fine tuning framework that aligns parameter updates with the intrinsic frequency energy progression of diffusion. FeRA establishes a comprehensive frequency energy framework for effective diffusion adaptation fine tuning, comprising three synergistic components: (i) a compact frequency energy indicator that characterizes the latent bandwise energy distribution, (ii) a soft frequency router that adaptively fuses multiple frequency specific adapter experts, and (iii) a frequency energy consistency regularization that stabilizes diffusion optimization and ensures coherent adaptation across bands. Routing operates in both training and inference, with inference time routing dynamically determined by the latent frequency energy. It integrates seamlessly with adapter based tuning schemes and generalizes well across diffusion backbones and resolutions. By aligning adaptation with the frequency energy mechanism, FeRA provides a simple, stable, and compatible paradigm for effective and robust diffusion model adaptation.
摘要：扩散模型在生成建模方面取得了巨大的成功，但如何有效地使大型预训练模型适应新任务仍然具有挑战性。我们重新审视去噪期间扩散模型的重建行为，以揭示控制该过程的潜在频率能量机制。基于这一观察，我们提出了 FeRA，一种频率驱动的微调框架，可将参数更新与扩散的固有频率能量级数保持一致。 FeRA 建立了一个用于有效扩散自适应微调的综合频率能量框架，包括三个协同组件：(i) 一个紧凑的频率能量指示器，用于表征潜在的频带能量分布；(ii) 一个软频率路由器，可自适应地融合多个频率特定适配器专家；(iii) 一个频率能量一致性正则化，可稳定扩散优化并确保跨频带的一致自适应。路由在训练和推理中都起作用，推理时间路由由潜在频率能量动态确定。它与基于适配器的调整方案无缝集成，并在扩散主干和分辨率之间很好地推广。通过将适应与频率能量机制结合起来，FeRA 为有效且稳健的扩散模型适应提供了简单、稳定且兼容的范例。

Title: Plan-X: Instruct Video Generation via Semantic Planning

Authors: Lun Huang, You Xie, Hongyi Xu, Tianpei Gu, Chenxu Zhang, Guoxian Song, Zenan Li, Xiaochen Zhao, Linjie Luo, Guillermo Sapiro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17986
Pdf URL: https://arxiv.org/pdf/2511.17986
Copy Paste: [[2511.17986]] Plan-X: Instruct Video Generation via Semantic Planning(https://arxiv.org/abs/2511.17986)
Keywords: generation
Abstract: Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.
摘要：扩散变形金刚在视觉合成方面表现出了卓越的能力，但它们经常在高级语义推理和长期规划方面遇到困难。这种限制经常导致视觉幻觉和与用户指令的错位，特别是在涉及复杂场景理解、人与物体交互、多阶段动作和上下文运动推理的场景中。为了应对这些挑战，我们提出了 Plan-X，这是一个明确执行高级语义规划来指导视频生成过程的框架。其核心是语义规划器，这是一种可学习的多模态语言模型，可以根据文本提示和视觉上下文推理用户的意图，并自回归生成一系列基于文本的时空语义标记。这些语义标记与高级文本提示指导相辅相成，随着时间的推移，可以充当视频传播模型的结构化“语义草图”，该模型在合成高保真视觉细节方面具有优势。 Plan-X 有效地整合了语言模型在多模态上下文推理和规划中的优势以及扩散模型在逼真视频合成中的优势。大量的实验表明，我们的框架大大减少了视觉幻觉，并能够生成与多模态上下文一致的细粒度、指令一致的视频生成。

Title: SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

Authors: Jiayu Wang, Haoyu Bian, Haoran Sun, Shaoning Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17993
Pdf URL: https://arxiv.org/pdf/2511.17993
Copy Paste: [[2511.17993]] SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining(https://arxiv.org/abs/2511.17993)
Keywords: restoration
Abstract: Image deraining is crucial for vision applications but is challenged by the complex multi-scale physics of rain and its coupling with scenes. To address this challenge, a novel approach inspired by multi-stage image restoration is proposed, incorporating Point Spread Function (PSF) mechanisms to reveal the image degradation process while combining dynamic physical modeling with sequential feature fusion transfer, named SD-PSFNet. Specifically, SD-PSFNet employs a sequential restoration architecture with three cascaded stages, allowing multiple dynamic evaluations and refinements of the degradation process estimation. The network utilizes components with learned PSF mechanisms to dynamically simulate rain streak optics, enabling effective rain-background separation while progressively enhancing outputs through novel PSF components at each stage. Additionally, SD-PSFNet incorporates adaptive gated fusion for optimal cross-stage feature integration, enabling sequential refinement from coarse rain removal to fine detail restoration. Our model achieves state-of-the-art PSNR/SSIM metrics on Rain100H (33.12dB/0.9371), RealRain-1k-L (42.28dB/0.9872), and RealRain-1k-H (41.08dB/0.9838). In summary, SD-PSFNet demonstrates excellent capability in complex scenes and dense rainfall conditions, providing a new physics-aware approach to image deraining.
摘要：图像去雨对于视觉应用至关重要，但受到雨的复杂多尺度物理及其与场景耦合的挑战。为了应对这一挑战，人们提出了一种受多阶段图像恢复启发的新方法，结合点扩散函数（PSF）机制来揭示图像退化过程，同时将动态物理建模与顺序特征融合传输相结合，称为SD-PSFNet。具体来说，SD-PSFNet 采用具有三个级联阶段的顺序恢复架构，允许对退化过程估计进行多次动态评估和细化。该网络利用具有学习 PSF 机制的组件来动态模拟雨条纹光学，实现有效的雨背景分离，同时通过每个阶段的新型 PSF 组件逐步增强输出。此外，SD-PSFNet 结合了自适应门控融合，以实现最佳的跨阶段特征集成，从而实现从粗略除雨到精细细节恢复的顺序细化。我们的模型在 Rain100H (33.12dB/0.9371)、RealRain-1k-L (42.28dB/0.9872) 和 RealRain-1k-H (41.08dB/0.9838) 上实现了最先进的 PSNR/SSIM 指标。综上所述，SD-PSFNet在复杂场景和密集降雨条件下表现出了优异的能力，为图像去雨提供了一种新的物理感知方法。

Title: RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

Authors: Shengyuan Wang, Zhiheng Zheng, Yu Shang, Lixuan He, Yangcheng Yu, Fan Hangyu, Jie Feng, Qingmin Liao, Yong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18005
Pdf URL: https://arxiv.org/pdf/2511.18005
Copy Paste: [[2511.18005]] RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale(https://arxiv.org/abs/2511.18005)
Keywords: generation
Abstract: City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.
摘要：城市规模的 3D 生成对于实体智能和世界模型的发展具有重要意义。然而，现有方法在 3D 世界生成的质量、保真度和可扩展性方面面临着重大挑战。因此，我们提出 RAISECity，一个 \textbf{R}eality-\textbf{A}ligned \textbf{I}intelligent \textbf{S}ynthesis \textbf{E}ngine，它创建详细的、 \textbf{C}ity 规模的 3D 世界。我们引入了一个代理框架，该框架利用不同的多模态基础工具来获取现实世界的知识、维护稳健的中间表示并构建复杂的 3D 场景。这种代理设计以动态数据处理、迭代自我反思和细化以及高级多模式工具的调用为特色，最大限度地减少了累积错误并提高了整体性能。大量的定量实验和定性分析验证了 RAISECity 在现实世界对齐、形状精度、纹理保真度和美观水平方面的卓越性能，相对于现有的整体感知质量基准，实现了 90% 以上的胜率。 3D 质量、现实对齐、可扩展性以及与计算机图形管道的无缝兼容性的结合使 RAISECity 成为沉浸式媒体、具体智能和世界模型应用程序的有前途的基础。

Title: State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection

Authors: Jiaying Zhou, Qingchao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18012
Pdf URL: https://arxiv.org/pdf/2511.18012
Copy Paste: [[2511.18012]] State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection(https://arxiv.org/abs/2511.18012)
Keywords: generation
Abstract: Open-Vocabulary Object Detection (OVOD) aims to generalize object recognition to novel categories, while Weakly Supervised OVOD (WS-OVOD) extends this by combining box-level annotations with image-level labels. Despite recent progress, two critical challenges persist in this setting. First, existing semantic prototypes, even when enriched by LLMs, are static and limited, failing to capture the rich intra-class visual variations induced by different object states (e.g., a cat's pose). Second, the standard pseudo-box generation introduces a semantic mismatch between visual region proposals (which contain context) and object-centric text embeddings. To tackle these issues, we introduce two complementary prototype enhancement strategies. To capture intra-class variations in appearance and state, we propose the State-Enhanced Semantic Prototypes (SESP), which generates state-aware textual descriptions (e.g., "a sleeping cat") to capture diverse object appearances, yielding more discriminative prototypes. Building on this, we further introduce Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch. SAPP incorporates contextual semantics (e.g., "cat lying on sofa") and utilizes a soft alignment mechanism to promote contextually consistent visual-textual representations. By integrating SESP and SAPP, our method effectively enhances both the richness of semantic prototypes and the visual-textual alignment, achieving notable improvements.
摘要：开放词汇对象检测 (OVOD) 旨在将对象识别推广到新类别，而弱监督 OVOD (WS-OVOD) 通过将框级注释与图像级标签相结合来扩展此功能。尽管最近取得了进展，但在这种情况下仍然存在两个关键挑战。首先，现有的语义原型即使通过法学硕士丰富，也是静态的和有限的，无法捕获由不同对象状态（例如猫的姿势）引起的丰富的类内视觉变化。其次，标准伪框生成引入了视觉区域提议（包含上下文）和以对象为中心的文本嵌入之间的语义不匹配。为了解决这些问题，我们引入了两种互补的原型增强策略。为了捕获外观和状态的类内变化，我们提出了状态增强语义原型（SESP），它生成状态感知的文本描述（例如“一只熟睡的猫”）来捕获不同的对象外观，从而产生更具辨别力的原型。在此基础上，我们进一步引入场景增强伪原型（SAPP）来解决语义不匹配的问题。 SAPP 结合了上下文语义（例如，“猫躺在沙发上”），并利用软对齐机制来促进上下文一致的视觉文本表示。通过集成SESP和SAPP，我们的方法有效地增强了语义原型的丰富性和视觉文本对齐，取得了显着的改进。

Title: MambaX: Image Super-Resolution with State Predictive Control

Authors: Chenyu Li, Danfeng Hong, Bing Zhang, Zhaojie Pan, Naoto Yokoya, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18028
Pdf URL: https://arxiv.org/pdf/2511.18028
Copy Paste: [[2511.18028]] MambaX: Image Super-Resolution with State Predictive Control(https://arxiv.org/abs/2511.18028)
Keywords: super-resolution
Abstract: Image super-resolution (SR) is a critical technology for overcoming the inherent hardware limitations of sensors. However, existing approaches mainly focus on directly enhancing the final resolution, often neglecting effective control over error propagation and accumulation during intermediate stages. Recently, Mamba has emerged as a promising approach that can represent the entire reconstruction process as a state sequence with multiple nodes, allowing for intermediate intervention. Nonetheless, its fixed linear mapper is limited by a narrow receptive field and restricted flexibility, which hampers its effectiveness in fine-grained images. To address this, we created a nonlinear state predictive control model \textbf{MambaX} that maps consecutive spectral bands into a latent state space and generalizes the SR task by dynamically learning the nonlinear state parameters of control equations. Compared to existing sequence models, MambaX 1) employs dynamic state predictive control learning to approximate the nonlinear differential coefficients of state-space models; 2) introduces a novel state cross-control paradigm for multimodal SR fusion; and 3) utilizes progressive transitional learning to mitigate heterogeneity caused by domain and modality shifts. Our evaluation demonstrates the superior performance of the dynamic spectrum-state representation model in both single-image SR and multimodal fusion-based SR tasks, highlighting its substantial potential to advance spectrally generalized modeling across arbitrary dimensions and modalities.
摘要：图像超分辨率（SR）是克服传感器固有硬件限制的关键技术。然而，现有的方法主要侧重于直接提高最终分辨率，往往忽略了对中间阶段误差传播和累积的有效控制。最近，Mamba 成为一种有前途的方法，它可以将整个重建过程表示为具有多个节点的状态序列，从而允许中间干预。尽管如此，其固定的线性映射器受到狭窄的感受野和有限的灵活性的限制，这阻碍了其在细粒度图像中的有效性。为了解决这个问题，我们创建了一个非线性状态预测控制模型 \textbf{MambaX}，它将连续的谱带映射到潜在状态空间，并通过动态学习控制方程的非线性状态参数来概括 SR 任务。与现有序列模型相比，MambaX 1) 采用动态状态预测控制学习来逼近状态空间模型的非线性微分系数； 2）引入了一种用于多模态SR融合的新型状态交叉控制范式； 3）利用渐进式过渡学习来减轻由领域和模式转变引起的异质性。我们的评估证明了动态谱状态表示模型在单图像 SR 和基于多模态融合的 SR 任务中的卓越性能，突显了其在任意维度和模态上推进谱广义建模的巨大潜力。

Title: Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Authors: Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.18039
Pdf URL: https://arxiv.org/pdf/2511.18039
Copy Paste: [[2511.18039]] Curvature-Aware Safety Restoration In LLMs Fine-Tuning(https://arxiv.org/abs/2511.18039)
Keywords: restoration
Abstract: Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.
摘要：为下游任务微调大型语言模型 (LLM) 通常会损害安全对齐，即使使用 LoRA 等参数高效的方法也是如此。在这项工作中，我们发现了一个显着的特性：无论采用何种微调方法，微调模型都保留了与有害内容有关的损失景观的几何结构。这表明安全行为并未被消除，而是转移到参数空间影响较小的区域。基于这一见解，我们提出了一种曲率感知对齐恢复方法，该方法利用影响函数和二阶优化来有选择地增加有害输入的损失，同时保持任务性能。通过导航基础模型和微调模型之间的共享几何图形，我们的方法可以阻止不安全的输出，同时保留任务相关的性能，避免完全恢复并实现精确、低影响的更新。对多个模型系列和对抗性设置的广泛评估表明，我们的方法有效地减少了有害反应，同时保持甚至提高了效用和小样本学习性能。

Title: UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Authors: Tian Ye, Song Fei, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18050
Pdf URL: https://arxiv.org/pdf/2511.18050
Copy Paste: [[2511.18050]] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios(https://arxiv.org/abs/2511.18050)
Keywords: generation
Abstract: Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.
摘要：Diffusion Transformer 最近在 1K 分辨率左右提供了强大的文本到图像生成功能，但我们表明，将它们扩展到不同纵横比的原生 4K 会暴露出跨越位置编码、VAE 压缩和优化的紧密耦合故障模式。孤立地解决这些因素中的任何一个都会留下实质性的质量问题。因此，我们采取数据模型协同设计的观点，并引入 UltraFlux，这是一种基于 Flux 的 DiT，在 MultiAspect-4K-1M 上以 4K 进行本机训练，MultiAspect-4K-1M 是一个 100 万图像 4K 语料库，具有受控的多 AR 覆盖、双语字幕和丰富的 VLM/IQA 元数据，用于分辨率和 AR 感知采样。在模型方面，UltraFlux 将 (i) Resonance 2D RoPE 与 YaRN 结合起来，以进行 4K 训练窗口、频率和 AR 感知位置编码； (ii) 一种简单的、非对抗性的 VAE 训练后方案，可提高 4K 重建保真度； (iii) SNR-Aware Huber Wavelet 目标，可跨时间步长和频带重新平衡梯度；（iv）分阶段的审美课程学习策略，将高审美监督集中在由模型先验控制的高噪声步骤上。这些组件共同产生稳定、保留细节的 4K DiT，适用于宽、方形和高的 AR。在 4096 基准和多 AR 4K 设置的美学评估中，UltraFlux 在保真度、美观和对齐指标方面始终优于强大的开源基线，并且通过 LLM 提示细化器匹配或超越专有的 Seedream 4.0。

Title: IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

Authors: Bowen Qu, Shangkun Sun, Xiaoyu Liang, Wei Gao
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.18055
Pdf URL: https://arxiv.org/pdf/2511.18055
Copy Paste: [[2511.18055]] IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment(https://arxiv.org/abs/2511.18055)
Keywords: generation, quality assessment
Abstract: Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.
摘要：文本驱动的图像编辑的最新进展非常显着，但准确评估这些编辑图像的任务仍然构成相当大的挑战。与文本驱动图像生成的评估不同，文本驱动图像编辑的特点是同时对文本和源图像进行调节。编辑后的图像通常保留与原始图像的内在联系，并随着文本的语义而动态变化。然而，以前的方法往往只关注文本-图像对齐，或者没有很好地符合人类的感知。在这项工作中，我们引入了文本驱动图像编辑基准套件（IE-Bench）来增强对文本驱动编辑图像的评估。 IE-Bench 包括一个数据库，其中包含不同的源图像、各种编辑提示以及不同编辑方法对应的编辑结果，以及由 15 位人类受试者提供的近 4,000 个样本及其相应的平均意见得分 (MOS)。此外，我们还引入了 IE-Critic-R1，它受益于可验证奖励的强化学习 (RLVR)，为符合人类感知的文本驱动图像编辑提供了更全面、可解释的质量评估。大量实验证明，与之前的指标相比，IE-Critic-R1 在文本驱动的图像编辑任务上具有优越的主观一致性。相关数据和代码向公众开放。

Title: Versatile Recompression-Aware Perceptual Image Super-Resolution

Authors: Mingwei He, Tongda Xu, Xingtong Ge, Ming Sun, Chao Zhou, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18090
Pdf URL: https://arxiv.org/pdf/2511.18090
Copy Paste: [[2511.18090]] Versatile Recompression-Aware Perceptual Image Super-Resolution(https://arxiv.org/abs/2511.18090)
Keywords: super-resolution, generation
Abstract: Perceptual image super-resolution (SR) methods restore degraded images and produce sharp outputs. In practice, those outputs are usually recompressed for storage and transmission. Ignoring recompression is suboptimal as the downstream codec might add additional artifacts to restored images. However, jointly optimizing SR and recompression is challenging, as the codecs are not differentiable and vary in configuration. In this paper, we present Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR), which makes existing perceptual SR aware of versatile compression. First, we formulate compression as conditional text-to-image generation and utilize a pre-trained diffusion model to build a generalizable codec simulator. Next, we propose a set of training techniques tailored for perceptual SR, including optimizing the simulator using perceptual targets and adopting slightly compressed images as the training target. Empirically, our VRPSR saves more than 10\% bitrate based on Real-ESRGAN and S3Diff under H.264/H.265/H.266 compression. Besides, our VRPSR facilitates joint optimization of the SR and post-processing model after recompression.
摘要：感知图像超分辨率（SR）方法可以恢复退化的图像并产生清晰的输出。在实践中，这些输出通常会被重新压缩以进行存储和传输。忽略重新压缩并不是最理想的，因为下游编解码器可能会向恢复的图像添加额外的伪影。然而，联合优化 SR 和重新压缩具有挑战性，因为编解码器不可微分且配置各异。在本文中，我们提出了通用再压缩感知感知超分辨率（VRPSR），它使现有的感知 SR 能够感知通用压缩。首先，我们将压缩制定为条件文本到图像的生成，并利用预先训练的扩散模型来构建可通用的编解码器模拟器。接下来，我们提出了一套针对感知SR量身定制的训练技术，包括使用感知目标优化模拟器以及采用稍微压缩的图像作为训练目标。根据经验，我们的 VRPSR 基于 Real-ESRGAN 和 S3Diff 在 H.264/H.265/H.266 压缩下节省了超过 10% 的比特率。此外，我们的 VRPSR 有助于在重新压缩后联合优化 SR 和后处理模型。

Title: Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

Authors: Aditya Chinchure, Sahithya Ravi, Pushkar Shukla, Vered Shwartz, Leonid Sigal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18102
Pdf URL: https://arxiv.org/pdf/2511.18102
Copy Paste: [[2511.18102]] Spotlight: Identifying and Localizing Video Generation Errors Using VLMs(https://arxiv.org/abs/2511.18102)
Keywords: generation
Abstract: Current text-to-video models (T2V) can generate high-quality, temporally coherent, and visually realistic videos. Nonetheless, errors still often occur, and are more nuanced and local compared to the previous generation of T2V models. While current evaluation paradigms assess video models across diverse dimensions, they typically evaluate videos holistically without identifying when specific errors occur or describing their nature. We address this gap by introducing Spotlight, a novel task aimed at localizing and explaining video-generation errors. We generate 600 videos using 200 diverse textual prompts and three state-of-the-art video generators (Veo 3, Seedance, and LTX-2), and annotate over 1600 fine-grained errors across six types, including motion, physics, and prompt adherence. We observe that adherence and physics errors are predominant and persist across longer segments, whereas appearance-disappearance and body pose errors manifest in shorter segments. We then evaluate current VLMs on Spotlight and find that VLMs lag significantly behind humans in error identification and localization in videos. We propose inference-time strategies to probe the limits of current VLMs on our task, improving performance by nearly 2x. Our task paves a way forward to building fine-grained evaluation tools and more sophisticated reward models for video generators.
摘要：当前的文本到视频模型（T2V）可以生成高质量、时间连贯且视觉逼真的视频。尽管如此，错误仍然经常发生，并且与上一代 T2V 模型相比，错误更加细微和局部。虽然当前的评估范式评估不同维度的视频模型，但它们通常会整体评估视频，而不会识别特定错误何时发生或描述其性质。我们通过引入 Spotlight 来解决这一差距，这是一项旨在定位和解释视频生成错误的新颖任务。我们使用 200 个不同的文本提示和三个最先进的视频生成器（Veo 3、Seedance 和 LTX-2）生成 600 个视频，并注释了六种类型的 1600 多个细粒度错误，包括运动、物理和提示遵守情况。我们观察到，依从性和物理错误是主要的，并且在较长的片段中持续存在，而外观消失和身体姿势错误则在较短的片段中表现出来。然后，我们在 Spotlight 上评估当前的 VLM，发现 VLM 在视频中的错误识别和定位方面明显落后于人类。我们提出了推理时间策略来探索当前 VLM 对我们任务的限制，将性能提高了近 2 倍。我们的任务为构建细粒度的评估工具和更复杂的视频生成器奖励模型铺平了道路。

Title: VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Authors: Ming Zhong, Yuanlei Wang, Liuzhou Zhang, Arctanx An, Renrui Zhang, Hao Liang, Ming Lu, Ying Shen, Wentao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18121
Pdf URL: https://arxiv.org/pdf/2511.18121
Copy Paste: [[2511.18121]] VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging(https://arxiv.org/abs/2511.18121)
Keywords: generation
Abstract: While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at this https URL .
摘要：虽然多模态大型语言模型 (MLLM) 在基准测试中表现出色，但其处理范式与人类集成视觉信息的能力不同。与自然地连接细节和高级概念的人类不同，模型倾向于孤立地对待这些元素。流行的评估协议通常将低级感知与高级推理分离，忽略它们的语义和因果依赖性，从而产生非诊断结果并掩盖性能瓶颈。我们提出了 VCU-Bridge，一个可操作类人视觉内涵理解层次结构的框架：从基础感知通过语义桥接发展到抽象内涵的多层次推理，具有从具体线索到抽象结论的明确证据到推理的轨迹。在此框架的基础上，我们构建了 HVCU-Bench，这是一个通过显式、逐级诊断进行分层视觉内涵理解的基准。综合实验表明，随着推理发展到更高水平，性能会持续下降。我们进一步开发了一个由蒙特卡罗树搜索（MCTS）引导的指令调整数据生成管道，并表明加强低级功能可以在更高级别产生可衡量的收益。有趣的是，它不仅在 HVCU-Bench 上有所改进，而且在一般基准上也带来了好处（平均+2.53%），尤其是在 MMStar 上有大幅提升（+7.26%），这证明了分层思维模式的重要性及其在增强 MLLM 能力方面的有效性。项目页面位于此 https URL 。

Title: Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Authors: Dachuan Zhao, Weiyue Li, Zhenda Shen, Yushu Qiu, Bowen Xu, Haoyu Chen, Yongchao Chen
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18123
Pdf URL: https://arxiv.org/pdf/2511.18123
Copy Paste: [[2511.18123]] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models(https://arxiv.org/abs/2511.18123)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
摘要：视觉语言模型 (VLM) 已成为多模态推理不可或缺的一部分，但它们的表示通常会编码和放大人口统计偏差，从而导致下游任务中出现有偏差的关联和不一致的预测。这种行为破坏了公平性并扭曲了视觉和语言之间的预期一致性。最近的事后方法试图通过用中性值替换与属性最相关的嵌入坐标来减轻偏差。然而，我们的系统分析揭示了这种坐标方式方法的三个关键失败：特征纠缠、跨数据集泛化不良和偏差消除不完整。我们发现偏差并不局限于几个坐标，而是分布在几个线性子空间上。为了解决这些限制，我们提出 $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$)，这是一个几何原理框架，可以识别并删除线性可解码偏差的整个子空间，同时重新插入中性均值分量以保持语义保真度。零样本分类、文本到图像检索和图像生成的广泛实验验证了 SPD 的有效性：我们的方法实现了更稳健的去偏，在四个公平性指标上平均提高了 18.5%$，同时与最佳去偏基线相比，任务性能损失最小。

Title: Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Authors: Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18131
Pdf URL: https://arxiv.org/pdf/2511.18131
Copy Paste: [[2511.18131]] Video4Edit: Viewing Image Editing as a Degenerate Temporal Process(https://arxiv.org/abs/2511.18131)
Keywords: generation
Abstract: We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.
摘要：我们观察到，多模态基础模型的最新进展推动了指令驱动的图像生成和编辑成为真正的跨模态合作机制。然而，最先进的编辑管道仍然成本高昂：除了训练大型扩散/流动模型之外，它们还需要策划大量高质量的“指令、源图像、编辑图像”三元组来覆盖不同的用户意图。此外，视觉替换的保真度取决于指令引用目标语义的精确程度。我们通过时间建模的角度重新审视这一挑战：如果视频可以被视为一个完整的时间过程，那么图像编辑可以被视为一个退化的时间过程。这种观点使我们能够从视频预训练中转移单帧进化先验，从而实现高度数据效率的微调机制。根据经验，我们的方法与领先的开源基线的性能相匹配，同时仅使用主流编辑模型所需的约百分之一的监督。

Title: UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors

Authors: Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, CHengyu Fang, Yunlong Lin, Fengyang Xiao, Sina Farsiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18152
Pdf URL: https://arxiv.org/pdf/2511.18152
Copy Paste: [[2511.18152]] UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors(https://arxiv.org/abs/2511.18152)
Keywords: restoration
Abstract: Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.
摘要：深度展开网络（DUN）将基于模型的方法的可解释性与深度网络的学习能力结合起来，但对于盲图像恢复（BIR）仍然有限。现有的 DUN 存在以下问题： (1) \textbf{退化特定依赖性}，因为它们的优化框架与已知的退化模型相关联，使得它们不适合 BIR 任务； (2) \textbf{过度平滑偏差}，由将低频内容主导的梯度下降输出直接馈送到近端项中产生，抑制精细纹理。为了克服这些问题，我们建议 UnfoldLDM 将 DUN 与 BIR 的潜在扩散模型 (LDM) 集成。在每个阶段，UnfoldLDM 都采用多粒度降级感知 (MGDA) 模块作为梯度下降步骤。 MGDA 将 BIR 建模为未知的退化估计问题，并估计整体退化矩阵及其分解形式，从而实现稳健的退化消除。对于近端步骤，我们设计了一个抗退化 LDM (DR-LDM)，以从 MGDA 输出中提取紧凑的退化不变先验。在此先验的指导下，过度平滑校正变压器 (OCFormer) 显式地恢复高频分量并增强纹理细节。这种独特的组合确保最终结果不会退化且视觉效果丰富。实验表明，我们的 UnfoldLDM 在各种 BIR 任务上取得了领先地位，并有利于下游任务。此外，我们的设计与现有的基于 DUN 的方法兼容，作为即插即用的框架。代码将被发布。

Title: Nested Unfolding Network for Real-World Concealed Object Segmentation

Authors: Chunming He, Rihan Zhang, Dingming Zhang, Fengyang Xiao, Deng-Ping Fan, Sina Farsiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18164
Pdf URL: https://arxiv.org/pdf/2511.18164
Copy Paste: [[2511.18164]] Nested Unfolding Network for Real-World Concealed Object Segmentation(https://arxiv.org/abs/2511.18164)
Keywords: restoration, quality assessment
Abstract: Deep unfolding networks (DUNs) have recently advanced concealed object segmentation (COS) by modeling segmentation as iterative foreground-background separation. However, existing DUN-based methods (RUN) inherently couple background estimation with image restoration, leading to conflicting objectives and requiring pre-defined degradation types, which are unrealistic in real-world scenarios. To address this, we propose the nested unfolding network (NUN), a unified framework for real-world COS. NUN adopts a DUN-in-DUN design, embedding a degradation-resistant unfolding network (DeRUN) within each stage of a segmentation-oriented unfolding network (SODUN). This design decouples restoration from segmentation while allowing mutual refinement. Guided by a vision-language model (VLM), DeRUN dynamically infers degradation semantics and restores high-quality images without explicit priors, whereas SODUN performs reversible estimation to refine foreground and background. Leveraging the multi-stage nature of unfolding, NUN employs image-quality assessment to select the best DeRUN outputs for subsequent stages, naturally introducing a self-consistency loss that enhances robustness. Extensive experiments show that NUN achieves a leading place on both clean and degraded benchmarks. Code will be released.
摘要：深度展开网络（DUN）最近通过将分割建模为迭代前景-背景分离来推进隐藏对象分割（COS）。然而，现有的基于 DUN 的方法 (RUN) 本质上将背景估计与图像恢复结合起来，导致目标相互冲突，并且需要预定义的退化类型，这在现实场景中是不切实际的。为了解决这个问题，我们提出了嵌套展开网络（NUN），这是现实世界 COS 的统一框架。NUN 采用 DUN-in-DUN 设计，在面向分段的展开网络（SODUN）的每个阶段嵌入抗降级展开网络（DeRUN）。这种设计将恢复与分割分离，同时允许相互细化。在视觉语言模型（VLM）的指导下，DeRUN 动态推断退化语义并在没有明确先验的情况下恢复高质量图像，而 SODUN 执行可逆估计以细化前景和背景。利用展开的多阶段性质，NUN 采用图像质量评估来为后续阶段选择最佳的 DeRUN 输出，自然引入自我一致性损失来增强鲁棒性。大量实验表明，NUN 在干净基准和降级基准上都取得了领先地位。代码将被发布。

Title: EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

Authors: Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, Juergen Gall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18173
Pdf URL: https://arxiv.org/pdf/2511.18173
Copy Paste: [[2511.18173]] EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses(https://arxiv.org/abs/2511.18173)
Keywords: generation
Abstract: Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
摘要：通过身体运动进行细粒度控制的以自我为中心的视频生成是对能够模拟、预测和计划动作的具体人工智能代理的关键要求。在这项工作中，我们提出了 EgoControl，一种基于自我中心数据训练的姿势可控视频扩散模型。我们训练视频预测模型，以根据显式 3D 身体姿势序列调节未来帧的生成。为了实现精确的运动控制，我们引入了一种新颖的姿势表示，可以捕获全局相机动态和关节式身体运动，并通过扩散过程中的专用控制机制将其集成。给定观察帧的短序列和目标姿势序列，EgoControl 生成时间连贯且视觉逼真的未来帧，与提供的姿势控制对齐。实验结果表明，EgoControl 可以生成高质量、姿势一致的以自我为中心的视频，为可控的具体视频模拟和理解铺平了道路。

Title: Early Lung Cancer Diagnosis from Virtual Follow-up LDCT Generation via Correlational Autoencoder and Latent Flow Matching

Authors: Yutong Wu, Yifan Wang, Qining Zhang, Chuan Zhou, Lei Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18185
Pdf URL: https://arxiv.org/pdf/2511.18185
Copy Paste: [[2511.18185]] Early Lung Cancer Diagnosis from Virtual Follow-up LDCT Generation via Correlational Autoencoder and Latent Flow Matching(https://arxiv.org/abs/2511.18185)
Keywords: generation, generative
Abstract: Lung cancer is one of the most commonly diagnosed cancers, and early diagnosis is critical because the survival rate declines sharply once the disease progresses to advanced stages. However, achieving an early diagnosis remains challenging, particularly in distinguishing subtle early signals of malignancy from those of benign conditions. In clinical practice, a patient with a high risk may need to undergo an initial baseline and several annual follow-up examinations (e.g., CT scans) before receiving a definitive diagnosis, which can result in missing the optimal treatment. Recently, Artificial Intelligence (AI) methods have been increasingly used for early diagnosis of lung cancer, but most existing algorithms focus on radiomic features extraction from single early-stage CT scans. Inspired by recent advances in diffusion models for image generation, this paper proposes a generative method, named CorrFlowNet, which creates a virtual, one-year follow-up CT scan after the initial baseline scan. This virtual follow-up would allow for an early detection of malignant/benign nodules, reducing the need to wait for clinical follow-ups. During training, our approach employs a correlational autoencoder to encode both early baseline and follow-up CT images into a latent space that captures the dynamics of nodule progression as well as the correlations between them, followed by a flow matching algorithm on the latent space with a neural ordinary differential equation. An auxiliary classifier is used to further enhance the diagnostic accuracy. Evaluations on a real clinical dataset show our method can significantly improve downstream lung nodule risk assessment compared with existing baseline models. Moreover, its diagnostic accuracy is comparable with real clinical CT follow-ups, highlighting its potential to improve cancer diagnosis.
摘要：肺癌是最常见的癌症之一，早期诊断至关重要，因为一旦疾病进展到晚期，生存率就会急剧下降。然而，实现早期诊断仍然具有挑战性，特别是在区分恶性肿瘤的微妙早期信号和良性病变的早期信号方面。在临床实践中，高风险患者可能需要接受初始基线和多次年度随访检查（例如 CT 扫描）才能得到明确的诊断，这可能会导致错过最佳治疗。近年来，人工智能（AI）方法越来越多地用于肺癌的早期诊断，但大多数现有算法侧重于从单次早期 CT 扫描中提取放射组学特征。受图像生成扩散模型最新进展的启发，本文提出了一种名为 CorrFlowNet 的生成方法，该方法在初始基线扫描后创建虚拟的、为期一年的后续 CT 扫描。这种虚拟随访将有助于及早发现恶性/良性结节，从而减少等待临床随访的需要。在训练过程中，我们的方法采用相关自动编码器将早期基线和后续 CT 图像编码到潜在空间中，捕获结节进展的动态以及它们之间的相关性，然后使用神经常微分方程在潜在空间上进行流匹配算法。使用辅助分类器进一步提高诊断准确性。对真实临床数据集的评估表明，与现有基线模型相比，我们的方法可以显着改善下游肺结节风险评估。此外，其诊断准确性可与真实的临床 CT 随访相媲美，凸显了其改善癌症诊断的潜力。

Title: ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18192
Pdf URL: https://arxiv.org/pdf/2511.18192
Copy Paste: [[2511.18192]] ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization(https://arxiv.org/abs/2511.18192)
Keywords: generation
Abstract: Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.
摘要：文档视觉问答 (VQA) 要求模型不仅能够提取准确的文本答案，而且能够在文档图像中精确定位它们，这对于高风险应用程序中的可解释性至关重要。然而，现有系统实现了很强的文本准确性，但同时产生了不可靠的空间基础，或者为了可解释性而牺牲了性能。我们提出了 ARIAL（可解释答案本地化的代理推理），这是一个模块化框架，通过基于 LLM 的规划代理协调专用工具，以实现精确的答案提取和可靠的空间基础。 ARIAL 将文档 VQA 分解为结构化子任务：使用 TrOCR 进行基于 OCR 的文本提取、使用语义搜索进行检索增强上下文选择、通过微调的 Gemma 3-27B 模型生成答案，以及通过文本到区域对齐进行显式边界框定位。这种模块化架构产生透明的推理轨迹，实现工具级可审计性和独立组件优化。我们使用文本准确性 (ANLS) 和空间精度（IoU 0.50 至 0.95 的 mAP）在四个基准（DocVQA、FUNSD、CORD 和 SROIE）上评估 ARIAL。 ARIAL 在所有数据集上都取得了最先进的结果：DocVQA 上的 88.7 ANLS 和 50.1 mAP、FUNSD 上的 90.0 ANLS 和 50.3 mAP、CORD 上的 85.5 ANLS 和 60.2 mAP、SROIE 上的 93.1 ANLS，超过了之前的最佳方法 (DLaVA) +2.8 ANLS 和DocVQA 上 +3.9 mAP。我们的工作展示了专业工具的代理编排如何同时提高性能和可解释性，为实现值得信赖、可解释的文档人工智能系统提供了一条途径。

Title: InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

Authors: Haoming Wang, Qiyao Xue, Wei Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18200
Pdf URL: https://arxiv.org/pdf/2511.18200
Copy Paste: [[2511.18200]] InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity(https://arxiv.org/abs/2511.18200)
Keywords: generation
Abstract: Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.
摘要：现代视觉语言模型（VLM）预计具有不同场景复杂性的空间推理能力，但由于缺乏多样化和可扩展且完全可定制的基准，评估这种能力很困难。现有基准测试对场景复杂性的可定制性有限，并且无法在不同的空间条件下隔离和分析特定的 VLM 故障模式。为了解决这一差距，在本文中，我们没有单独呈现不同场景复杂性的基准，而是提出了 InfiniBench，这是一种完全自动化、可定制且用户友好的基准生成器，它可以通过对场景复杂性进行参数化控制来合成理论上无限多种 3D 场景。 InfiniBench 独特地将自然语言的场景描述转换为具有复杂且物理上合理的 3D 布局的逼真视频。这是通过三个关键创新实现的：1）基于 LLM 的代理框架，可迭代地从场景描述中细化程序场景约束； 2) 灵活的基于集群的布局优化器，可生成以前程序方法难以处理的密集且杂乱的场景； 3) 任务感知相机轨迹优化方法，将场景渲染为具有完整对象覆盖的视频作为 VLM 输入。实验表明，InfiniBench 在即时保真度和物理合理性方面优于最先进的程序和基于 LLM 的 3D 生成方法，尤其是在高复杂性场景中。我们通过为代表性空间推理任务（包括测量、观点采择和时空跟踪）生成基准，进一步展示了 InfiniBench 的实用性。

Title: Generating Synthetic Human Blastocyst Images for In-Vitro Fertilization Blastocyst Grading

Authors: Pavan Narahari, Suraj Rajendran, Lorena Bori, Jonas E. Malmsten, Qiansheng Zhan, Zev Rosenwaks, Nikica Zaninovic, Iman Hajirasouliha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18204
Pdf URL: https://arxiv.org/pdf/2511.18204
Copy Paste: [[2511.18204]] Generating Synthetic Human Blastocyst Images for In-Vitro Fertilization Blastocyst Grading(https://arxiv.org/abs/2511.18204)
Keywords: generation, generative
Abstract: The success of in vitro fertilization (IVF) at many clinics relies on the accurate morphological assessment of day 5 blastocysts, a process that is often subjective and inconsistent. While artificial intelligence can help standardize this evaluation, models require large, diverse, and balanced datasets, which are often unavailable due to data scarcity, natural class imbalance, and privacy constraints. Existing generative embryo models can mitigate these issues but face several limitations, such as poor image quality, small training datasets, non-robust evaluation, and lack of clinically relevant image generation for effective data augmentation. Here, we present the Diffusion Based Imaging Model for Artificial Blastocysts (DIA) framework, a set of latent diffusion models trained to generate high-fidelity, novel day 5 blastocyst images. Our models provide granular control by conditioning on Gardner-based morphological categories and z-axis focal depth. We rigorously evaluated the models using FID, a memorization metric, an embryologist Turing test, and three downstream classification tasks. Our results show that DIA models generate realistic images that embryologists could not reliably distinguish from real images. Most importantly, we demonstrated clear clinical value. Augmenting an imbalanced dataset with synthetic images significantly improved classification accuracy (p < 0.05). Also, adding synthetic images to an already large, balanced dataset yielded statistically significant performance gains, and synthetic data could replace up to 40% of real data in some cases without a statistically significant loss in accuracy. DIA provides a robust solution for mitigating data scarcity and class imbalance in embryo datasets. By generating novel, high-fidelity, and controllable synthetic images, our models can improve the performance, fairness, and standardization of AI embryo assessment tools.
摘要：许多诊所体外受精 (IVF) 的成功依赖于对第 5 天囊胚的准确形态学评估，这一过程通常是主观且不一致的。虽然人工智能可以帮助标准化这种评估，但模型需要大量、多样化和平衡的数据集，而这些数据集通常由于数据稀缺、自然类别不平衡和隐私限制而无法获得。现有的生成胚胎模型可以缓解这些问题，但面临一些限制，例如图像质量差、训练数据集小、评估不稳健以及缺乏用于有效数据增强的临床相关图像生成。在这里，我们提出了基于扩散的人工囊胚成像模型 (DIA) 框架，这是一组经过训练可生成高保真、新颖的第 5 天囊胚图像的潜在扩散模型。我们的模型通过基于加德纳的形态类别和 z 轴焦点深度进行调节来提供精细控制。我们使用 FID、记忆指标、胚胎学家图灵测试和三个下游分类任务严格评估模型。我们的结果表明，DIA 模型生成胚胎学家无法可靠地将其与真实图像区分开的逼真图像。最重要的是，我们展示了明确的临床价值。使用合成图像增强不平衡数据集可显着提高分类准确性 (p < 0.05)。此外，将合成图像添加到已经很大的平衡数据集中，可以带来统计上显着的性能提升，并且在某些情况下，合成数据可以取代多达 40% 的真实数据，而不会造成统计上显着的准确性损失。 DIA 提供了一个强大的解决方案，用于缓解胚胎数据集中的数据稀缺和类别不平衡。通过生成新颖、高保真、可控的合成图像，我们的模型可以提高人工智能胚胎评估工具的性能、公平性和标准化。

Title: MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Authors: Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18262
Pdf URL: https://arxiv.org/pdf/2511.18262
Copy Paste: [[2511.18262]] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation(https://arxiv.org/abs/2511.18262)
Keywords: generation
Abstract: Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
摘要：统一的多模态模型旨在将理解和生成集成在单个框架内，但弥合离散语义推理和高保真视觉合成之间的差距仍然具有挑战性。我们提出了 MammothModa2 (Mammoth2)，这是一个统一的自回归扩散 (AR-Diffusion) 框架，旨在有效地将自回归语义规划与基于扩散的生成结合起来。 Mammoth2采用串行设计：配备生成专家的AR路径对离散标记进行全局语义建模，而单流Diffusion Transformer（DiT）解码器则处理高保真图像合成。精心设计的 AR-Diffusion 特征对齐模块结合了多层特征聚合、统一条件编码和上下文条件，以稳定地将 AR 的表示与扩散解码器的连续潜伏对齐。 Mammoth2 通过联合 Next-Token 预测和流匹配目标进行端到端训练，然后在生成和编辑过程中进行监督微调和强化学习。 Mammoth2 拥有大约 6000 万个监督生成样本，并且不依赖于预先训练的生成器，在公共基准测试中提供了强大的文本到图像和基于指令的编辑性能，在 GenEval 上达到 0.87，在 DPGBench 上达到 87.2，在 ImgEdit 上达到 4.06，同时在多模态理解任务上与仅理解主干网络（例如 Qwen3-VL-8B）保持竞争力。这些结果表明，仔细耦合的 AR-Diffusion 架构可以提供高保真生成和编辑，同时在单个参数和数据高效的模型中保持强大的多模态理解。

Title: Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Authors: Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, Jialin Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18271
Pdf URL: https://arxiv.org/pdf/2511.18271
Copy Paste: [[2511.18271]] Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models(https://arxiv.org/abs/2511.18271)
Keywords: generative
Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.
摘要：如今的文本到图像 (T2I) 模型能够生成逼真的、遵循指令的图像，但它们仍然经常无法满足需要隐式世界知识的提示。现有的评估协议要么强调成分对齐，要么依赖基于单轮 VQA 的评分，而导致知识基础、多物理场相互作用和可审计证据等关键维度的测试严重不足。为了解决这些限制，我们引入了 PicWorld，这是第一个综合基准，用于评估 T2I 模型对隐性世界知识和物理因果推理的掌握。该基准测试包含三个核心类别的 1,100 个提示。为了促进细粒度的评估，我们提出了 PW-Agent，一种基于证据的多智能体评估器，通过将提示分解为可验证的视觉证据来分层评估图像的物理真实性和逻辑一致性。我们对 PicWorld 上的 17 个主流 T2I 模型进行了全面分析，表明它们在隐性世界知识和物理因果推理能力上普遍表现出不同程度的根本局限性。研究结果强调了未来 T2I 系统中对推理感知、知识集成架构的需求。

Title: Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Authors: Yara Bahram, Melodie Desbos, Mohammadhadi Shateri, Eric Granger
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18281
Pdf URL: https://arxiv.org/pdf/2511.18281
Copy Paste: [[2511.18281]] Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation(https://arxiv.org/abs/2511.18281)
Keywords: generation, generative
Abstract: Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG) and subject-driven personalization (SDP). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.
摘要：扩散模型 (DM) 可以生成高质量的图像，但在适应新领域时，其采样成本仍然很高。精炼的 DM 速度更快，但通常仍局限于教师的工作范围内。因此，新领域的快速和高质量生成依赖于两阶段训练管道：适应然后蒸馏或蒸馏然后适应。然而，两者都增加了设计复杂性并导致质量或多样性下降。我们引入了 Uni-DAD，这是一种将 DM 的蒸馏和适应结合在一起的单级管道。它在训练期间耦合两个信号：（i）双域分布匹配蒸馏目标，引导学生了解源教师和目标教师的分布，以及（ii）多头生成对抗网络（GAN）损失，鼓励跨多个特征尺度的目标真实性。源域蒸馏保留了不同的源知识，而多头 GAN 稳定了训练并减少了过度拟合，尤其是在少样本情况下。目标教师的加入有助于适应结构上更遥远的领域。我们对各种数据集进行评估，以进行少样本图像生成 (FSIG) 和主题驱动个性化 (SDP)。即使采样步骤少于 4 个，Uni-DAD 也能提供比最先进 (SoTA) 适应方法更高的质量，并且在质量和多样性方面优于两阶段训练管道。

Title: TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis

Authors: Rui Peng, Ziru Liu, Lingyuan Ye, Yuxing Lu, Boxin Shi, Jinzhuo Wang
Subjects: cs.LG, cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2511.18287
Pdf URL: https://arxiv.org/pdf/2511.18287
Copy Paste: [[2511.18287]] TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis(https://arxiv.org/abs/2511.18287)
Keywords: generative
Abstract: Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as Perturbation $\rightarrow$ RNA or Perturbation $\rightarrow$ Morphology, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome-phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.
摘要：准确建模扰动、转录反应和表型变化之间的关系对于构建人工智能虚拟细胞 (AIVC) 至关重要。然而，现有方法通常仅限于对直接关联进行建模，例如扰动 $\rightarrow$ RNA 或扰动 $\rightarrow$ 形态学，忽略了从 RNA 到形态的关键因果关系。为了弥补这一差距，我们提出了 TRIDENT，这是一个级联生成框架，它通过调节扰动和相应的基因表达谱来合成真实的细胞形态。为了训练和评估此任务，我们构建了 MorphoGene，这是一个将 L1000 基因表达与 98 种化合物的细胞绘画图像配对的新数据集。 TRIDENT 的性能显着优于最先进的方法，实现了高达 7 倍的改进，并对未见过的化合物具有强大的泛化能力。在多西紫杉醇的案例研究中，我们验证了 RNA 引导的合成能够准确地产生相应的表型。消融研究进一步证实，这种 RNA 调节对于模型的高保真度至关重要。通过对转录组-表型组映射进行显式建模，TRIDENT 提供了强大的计算机工具，使我们更接近预测性虚拟细胞。

Title: MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding

Authors: Mengchun Zhang, Kateryna Shapovalenko, Yucheng Shao, Eddie Guo, Parusha Pradhan
Subjects: cs.LG, cs.AI, cs.HC, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.18294
Pdf URL: https://arxiv.org/pdf/2511.18294
Copy Paste: [[2511.18294]] MultiDiffNet: A Multi-Objective Diffusion Framework for Generalizable Brain Decoding(https://arxiv.org/abs/2511.18294)
Keywords: generation, generative
Abstract: Neural decoding from electroencephalography (EEG) remains fundamentally limited by poor generalization to unseen subjects, driven by high inter-subject variability and the lack of large-scale datasets to model it effectively. Existing methods often rely on synthetic subject generation or simplistic data augmentation, but these strategies fail to scale or generalize reliably. We introduce \textit{MultiDiffNet}, a diffusion-based framework that bypasses generative augmentation entirely by learning a compact latent space optimized for multiple objectives. We decode directly from this space and achieve state-of-the-art generalization across various neural decoding tasks using subject and session disjoint evaluation. We also curate and release a unified benchmark suite spanning four EEG decoding tasks of increasing complexity (SSVEP, Motor Imagery, P300, and Imagined Speech) and an evaluation protocol that addresses inconsistent split practices in prior EEG research. Finally, we develop a statistical reporting framework tailored for low-trial EEG settings. Our work provides a reproducible and open-source foundation for subject-agnostic EEG decoding in real-world BCI systems.
摘要：脑电图（EEG）的神经解码仍然从根本上受到对未知受试者的泛化能力较差的限制，这是由于受试者间的高变异性和缺乏对其进行有效建模的大规模数据集的驱动。现有方法通常依赖于合成主题生成或简单的数据增强，但这些策略无法可靠地扩展或泛化。我们引入了 \textit{MultiDiffNet}，这是一种基于扩散的框架，它通过学习针对多个目标优化的紧凑潜在空间来完全绕过生成增强。我们直接从这个空间进行解码，并使用主题和会话不相交评估在各种神经解码任务中实现最先进的泛化。我们还策划并发布了一个统一的基准套件，涵盖四个日益复杂的脑电图解码任务（SSVEP、运动意象、P300 和想象的语音），以及一个解决先前脑电图研究中不一致的分割实践的评估协议。最后，我们开发了一个针对低试验脑电图设置的统计报告框架。我们的工作为现实 BCI 系统中与受试者无关的脑电图解码提供了可重复的开源基础。

Title: Hierarchical Deep Research with Local-Web RAG: Toward Automated System-Level Materials Discovery

Authors: Rui Ding, Rodrigo Pires Ferreira, Yuxin Chen, Junhong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.18303
Pdf URL: https://arxiv.org/pdf/2511.18303
Copy Paste: [[2511.18303]] Hierarchical Deep Research with Local-Web RAG: Toward Automated System-Level Materials Discovery(https://arxiv.org/abs/2511.18303)
Keywords: generation
Abstract: We present a long-horizon, hierarchical deep research (DR) agent designed for complex materials and device discovery problems that exceed the scope of existing Machine Learning (ML) surrogates and closed-source commercial agents. Our framework instantiates a locally deployable DR instance that integrates local retrieval-augmented generation with large language model reasoners, enhanced by a Deep Tree of Research (DToR) mechanism that adaptively expands and prunes research branches to maximize coverage, depth, and coherence. We systematically evaluate across 27 nanomaterials/device topics using a large language model (LLM)-as-judge rubric with five web-enabled state-of-the-art models as jurors. In addition, we conduct dry-lab validations on five representative tasks, where human experts use domain simulations (e.g., density functional theory, DFT) to verify whether DR-agent proposals are actionable. Results show that our DR agent produces reports with quality comparable to--and often exceeding--those of commercial systems (ChatGPT-5-thinking/o3/o4-mini-high Deep Research) at a substantially lower cost, while enabling on-prem integration with local data and tools.
摘要：我们提出了一种长期、分层的深度研究 (DR) 代理，专为复杂材料和设备发现问题而设计，这些问题超出了现有机器学习 (ML) 代理和闭源商业代理的范围。我们的框架实例化了一个可本地部署的 DR 实例，该实例将本地检索增强生成与大型语言模型推理器集成在一起，并通过深度研究树 (DToR) 机制进行增强，该机制可自适应扩展和修剪研究分支，以最大限度地提高覆盖范围、深度和连贯性。我们使用大型语言模型 (LLM) 作为评审标准，并使用五个支持网络的最先进模型作为评审员，系统地评估 27 个纳米材料/设备主题。此外，我们对五个代表性任务进行了干燥实验室验证，其中人类专家使用领域模拟（例如密度泛函理论、DFT）来验证 DR 代理提案是否可行。结果表明，我们的灾难恢复代理以低得多的成本生成的报告，其质量可与商业系统（ChatGPT-5-thinking/o3/o4-mini-high Deep Research）相媲美甚至常常超过，同时能够与本地数据和工具进行本地集成。

Title: DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition

Authors: Raja Kumar, Arka Sadhu, Ram Nevatia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18305
Pdf URL: https://arxiv.org/pdf/2511.18305
Copy Paste: [[2511.18305]] DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition(https://arxiv.org/abs/2511.18305)
Keywords: generation
Abstract: Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.
摘要：大视觉语言模型 (LVLM) 拥有广泛的文本知识，但很难利用这些知识进行细粒度图像识别，通常无法区分视觉上相似的类别。现有的使用强化学习（RL）和精确匹配奖励信号的微调方法通常很脆弱，会鼓励记忆训练类别，并且无法引出泛化到未见过的类所需的差异推理。为了解决这个问题，我们提出 $\textbf{DiVE-k}$、$\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning 使用 top-$\textbf{k}$ 代，该框架利用模型自身的 top-k 预测作为训练信号。对于每个训练图像，DiVE-k 从模型的 top-k 输出中创建一个多项选择问题，并使用 RL 训练模型以选择正确的答案。这种方法要求模型在合理的选项之间执行细粒度的差分推理，并提供简单、可验证的奖励信号，以减轻记忆并提高泛化能力。对五个标准细粒度数据集的实验表明，我们的方法明显优于现有方法。在标准的基础到新颖的泛化设置中，DiVE-k 在调和平均指标上分别超过 QWEN2.5-VL-7B 和 ViRFT 10.04% 和 6.16%。进一步的实验表明，在混合域和少样本场景中具有类似的增益。

Title: ScriptViT: Vision Transformer-Based Personalized Handwriting Generation

Authors: Sajjan Acharya, Rajendra Baskota
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18307
Pdf URL: https://arxiv.org/pdf/2511.18307
Copy Paste: [[2511.18307]] ScriptViT: Vision Transformer-Based Personalized Handwriting Generation(https://arxiv.org/abs/2511.18307)
Keywords: generation
Abstract: Styled handwriting generation aims to synthesize handwritten text that looks both realistic and aligned with a specific writer's style. While recent approaches involving GAN, transformer and diffusion-based models have made progress, they often struggle to capture the full spectrum of writer-specific attributes, particularly global stylistic patterns that span long-range spatial dependencies. As a result, capturing subtle writer-specific traits such as consistent slant, curvature or stroke pressure, while keeping the generated text accurate is still an open problem. In this work, we present a unified framework designed to address these limitations. We introduce a Vision Transformer-based style encoder that learns global stylistic patterns from multiple reference images, allowing the model to better represent long-range structural characteristics of handwriting. We then integrate these style cues with the target text using a cross-attention mechanism, enabling the system to produce handwritten images that more faithfully reflect the intended style. To make the process more interpretable, we utilize Salient Stroke Attention Analysis (SSAA), which reveals the stroke-level features the model focuses on during style transfer. Together, these components lead to handwriting synthesis that is not only more stylistically coherent, but also easier to understand and analyze.
摘要：风格手写生成旨在合成看起来既逼真又符合特定作者风格的手写文本。虽然最近涉及 GAN、transformer 和基于扩散的模型的方法已经取得了进展，但它们常常难以捕获作者特定属性的全部范围，特别是跨越长期空间依赖性的全局风格模式。因此，捕捉微妙的特定于作者的特征，例如一致的倾斜、曲率或笔划压力，同时保持生成的文本准确仍然是一个悬而未决的问题。在这项工作中，我们提出了一个旨在解决这些限制的统一框架。我们引入了一种基于 Vision Transformer 的风格编码器，它可以从多个参考图像中学习全局风格模式，从而使模型能够更好地表示手写体的远程结构特征。然后，我们使用交叉注意机制将这些风格提示与目标文本集成，使系统能够生成更忠实地反映预期风格的手写图像。为了使该过程更易于解释，我们利用显着笔画注意力分析（SSAA），它揭示了模型在风格迁移过程中关注的笔画级别特征。这些组件共同导致手写合成不仅在风格上更加连贯，而且更易于理解和分析。

Title: DiM-TS: Bridge the Gap between Selective State Space Models and Time Series for Generative Modeling

Authors: Zihao Yao, Jiankai Zuo, Yaying Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.18312
Pdf URL: https://arxiv.org/pdf/2511.18312
Copy Paste: [[2511.18312]] DiM-TS: Bridge the Gap between Selective State Space Models and Time Series for Generative Modeling(https://arxiv.org/abs/2511.18312)
Keywords: generation, generative
Abstract: Time series data plays a pivotal role in a wide variety of fields but faces challenges related to privacy concerns. Recently, synthesizing data via diffusion models is viewed as a promising solution. However, existing methods still struggle to capture long-range temporal dependencies and complex channel interrelations. In this research, we aim to utilize the sequence modeling capability of a State Space Model called Mamba to extend its applicability to time series data generation. We firstly analyze the core limitations in State Space Model, namely the lack of consideration for correlated temporal lag and channel permutation. Building upon the insight, we propose Lag Fusion Mamba and Permutation Scanning Mamba, which enhance the model's ability to discern significant patterns during the denoising process. Theoretical analysis reveals that both variants exhibit a unified matrix multiplication framework with the original Mamba, offering a deeper understanding of our method. Finally, we integrate two variants and introduce Diffusion Mamba for Time Series (DiM-TS), a high-quality time series generation model that better preserves the temporal periodicity and inter-channel correlations. Comprehensive experiments on public datasets demonstrate the superiority of DiM-TS in generating realistic time series while preserving diverse properties of data.
摘要：时间序列数据在各个领域发挥着关键作用，但面临与隐私问题相关的挑战。最近，通过扩散模型合成数据被视为一种有前途的解决方案。然而，现有的方法仍然难以捕获长程时间依赖性和复杂的通道相互关系。在这项研究中，我们的目标是利用称为 Mamba 的状态空间模型的序列建模功能，将其适用性扩展到时间序列数据生成。我们首先分析状态空间模型的核心局限性，即缺乏对相关时间滞后和通道排列的考虑。基于这一见解，我们提出了滞后融合 Mamba 和排列扫描 Mamba，它们增强了模型在去噪过程中辨别重要模式的能力。理论分析表明，这两种变体都与原始 Mamba 表现出统一的矩阵乘法框架，使我们能够更深入地理解我们的方法。最后，我们集成了两个变体并引入了 Diffusion Mamba for Time Series (DiM-TS)，这是一种高质量的时间序列生成模型，可以更好地保留时间周期性和通道间相关性。对公共数据集的综合实验证明了 DiM-TS 在生成真实时间序列同时保留数据不同属性方面的优越性。

Title: ConsistCompose: Unified Multimodal Layout Control for Image Composition

Authors: Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Dahua Lin, Quan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18333
Pdf URL: https://arxiv.org/pdf/2511.18333
Copy Paste: [[2511.18333]] ConsistCompose: Unified Multimodal Layout Control for Image Composition(https://arxiv.org/abs/2511.18333)
Keywords: generation, generative
Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.
摘要：将视觉理解与图像生成相结合的统一多模态模型发展迅速，但大多数系统仍然专注于视觉基础——将语言与图像区域对齐——而其生成对应物，即用于布局可控多实例生成的语言嵌入式布局基础生成（LELG），仍处于探索之中，并限制了精确的构图控制。我们提出了 ConsistCompose，这是一个统一的多模式框架，它将布局坐标直接嵌入到语言提示中，从而能够在单个生成界面中从交错图像文本生成布局控制的多实例图像。我们进一步构建了 ConsistCompose3M，一个 3.4M 多实例生成数据集，具有布局和身份注释（2.6M 文本引导和 0.8M 图像引导数据对），为布局条件生成提供大规模监督。在此框架内，LELG 通过实例坐标绑定提示和坐标感知的无分类器指导进行实例化，将语言布局线索转化为精确的空间控制，而无需特定于任务的分支。 COCO-Position 和 MS-Bench 上的实验表明，ConsistCompose 显着提高了布局控制基线的空间精度，同时保留了身份保真度和有竞争力的通用多模态理解，为布局可控多模态图像生成建立了统一的范例。

Title: FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement

Authors: Wenshuo Gao, Junyi Fan, Jiangyue Zeng, Shuai Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18346
Pdf URL: https://arxiv.org/pdf/2511.18346
Copy Paste: [[2511.18346]] FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement(https://arxiv.org/abs/2511.18346)
Keywords: generation
Abstract: Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel training-free flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency. Project Page: this https URL.
摘要：具有背景替换的视频重新照明是一项具有挑战性的任务，对于电影制作和创意媒体中的应用至关重要。现有方法很难平衡时间一致性、空间保真度和照明自然度。为了解决这些问题，我们引入了 FlowPortal，这是一种新颖的免训练的基于流的视频重照框架。我们的核心创新是残差校正流机制，将标准的基于流的模型转换为编辑模型，保证输入条件相同时完美重建，不同时忠实重亮，从而实现高度的结构一致性。用于精确照明控制的解耦条件设计和用于保留细节的高频传输机制进一步增强了这一点。此外，掩蔽策略将前景重新照明与背景纯生成过程隔离。实验表明，FlowPortal 在时间连贯性、结构保存和光照真实感方面实现了卓越的性能，同时保持了高效率。项目页面：此 https URL。

Title: MagicWand: A Universal Agent for Generation and Evaluation Aligned with User Preference

Authors: Zitong Xu, Dake Shen, Yaosong Du, Kexiang Hao, Jinghan Huang, Xiande Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18352
Pdf URL: https://arxiv.org/pdf/2511.18352
Copy Paste: [[2511.18352]] MagicWand: A Universal Agent for Generation and Evaluation Aligned with User Preference(https://arxiv.org/abs/2511.18352)
Keywords: generation
Abstract: Recent advances in AIGC (Artificial Intelligence Generated Content) models have enabled significant progress in image and video generation. However, users still struggle to obtain content that aligns with their preferences due to the difficulty of crafting detailed prompts and the lack of mechanisms to retain their preferences. To address these challenges, we construct \textbf{UniPrefer-100K}, a large-scale dataset comprising images, videos, and associated text that describes the styles users tend to prefer. Based on UniPrefer-100K, we propose \textbf{MagicWand}, a universal generation and evaluation agent that enhances prompts based on user preferences, leverages advanced generation models for high-quality content, and applies preference-aligned evaluation and refinement. In addition, we introduce \textbf{UniPreferBench}, the first large-scale benchmark with over 120K annotations for assessing user preference alignment across diverse AIGC tasks. Experiments on UniPreferBench demonstrate that MagicWand consistently generates content and evaluations that are well aligned with user preferences across a wide range of scenarios.
摘要：AIGC（人工智能生成内容）模型的最新进展使图像和视频生成取得了重大进展。然而，由于难以制作详细的提示以及缺乏保留其偏好的机制，用户仍然难以获得符合其偏好的内容。为了解决这些挑战，我们构建了 \textbf{UniPrefer-100K}，这是一个包含图像、视频和描述用户倾向于喜欢的样式的相关文本的大型数据集。基于 UniPrefer-100K，我们提出了 \textbf{MagicWand}，这是一种通用生成和评估代理，它可以根据用户偏好增强提示，利用先进的生成模型来生成高质量内容，并应用符合偏好的评估和细化。此外，我们还引入了 \textbf{UniPreferBench}，这是第一个具有超过 120K 注释的大规模基准测试，用于评估跨不同 AIGC 任务的用户偏好一致性。 UniPreferBench 上的实验表明，MagicWand 在各种场景中始终能够生成与用户偏好非常一致的内容和评估。

Title: TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Authors: Alexandros Stergiou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18359
Pdf URL: https://arxiv.org/pdf/2511.18359
Copy Paste: [[2511.18359]] TRANSPORTER: Transferring Visual Semantics from VLM Manifolds(https://arxiv.org/abs/2511.18359)
Keywords: generation, generative
Abstract: How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.
摘要：视频理解模型如何获得答案？尽管当前的视觉语言模型（VLM）能够对具有不同对象、动作表演和场景动态的复杂场景进行推理，但理解和控制其内部过程仍然是一个开放的挑战。受文本到视频 (T2V) 生成模型最新进展的推动，本文引入了逻辑到视频 (L2V) 任务以及独立于模型的方法 TRANSPORTER，以生成捕获 VLM 预测背后的基本规则的视频。鉴于 T2V 模型产生的高视觉保真度，TRANSPORTER 学习与 VLM 的高语义嵌入空间的最佳传输耦合。反过来，逻辑分数定义了条件视频生成的嵌入方向。 TRANSPORTER 生成的视频反映了不同对象属性、动作副词和场景上下文的字幕变化。 VLM 的定量和定性评估表明，L2V 可以为模型可解释性提供一个以前从未探索过的保真度丰富的新颖方向。

Title: MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

Authors: Zenghao Chai, Chen Tang, Yongkang Wong, Xulei Yang, Mohan Kankanhalli
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.18370
Pdf URL: https://arxiv.org/pdf/2511.18370
Copy Paste: [[2511.18370]] MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer(https://arxiv.org/abs/2511.18370)
Keywords: generation
Abstract: 3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).
摘要：3D 姿势迁移旨在将源网格的姿势风格迁移到目标角色，同时保留目标的几何形状和源的姿势特征。现有方法很大程度上仅限于具有相似结构的角色，并且无法推广到无类别设置（例如，将人形机器人的姿势转移到四足动物）。关键的挑战在于不同字符类型固有的结构和转换多样性，这通常会导致区域不匹配和传输质量差。为了解决这些问题，我们首先构建了一个涵盖数百个不同角色的百万级姿势数据集。我们进一步提出了 MimiCAT，一种级联转换器模型，专为无类别 3D 姿势传输而设计。 MimiCAT 不依赖严格的一对一对应映射，而是利用语义关键点标签来学习一种新颖的软对应，从而实现跨字符的灵活多对多匹配。然后将姿势转移公式化为条件生成过程，其中源变换首先通过软对应匹配投影到目标上，然后使用形状条件表示进行细化。广泛的定性和定量实验表明，MimiCAT 在不同角色之间转移合理的姿势，明显优于仅限于窄类别转移（例如，人形到人形）的现有方法。

Title: Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

Authors: Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang, Jiarui Jin, Yuan Lu, Hanqian Wu, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18378
Pdf URL: https://arxiv.org/pdf/2511.18378
Copy Paste: [[2511.18378]] Synthetic Curriculum Reinforces Compositional Text-to-Image Generation(https://arxiv.org/abs/2511.18378)
Keywords: generation
Abstract: Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.
摘要：文本到图像（T2I）的生成长期以来一直是一个悬而未决的问题，其中合成合成仍然特别具有挑战性。这项任务需要准确渲染包含多个对象的复杂场景，这些对象表现出不同的属性以及复杂的空间和语义关系，要求精确的对象放置和连贯的对象间交互。在本文中，我们提出了一种名为 CompGen 的新型作文课程强化学习框架，该框架解决了现有 T2I 模型中的作文弱点。具体来说，我们利用场景图建立了一种新颖的构图能力难度标准，并开发了相应的自适应马尔可夫链蒙特卡罗图采样算法。这种难度感知方法可以合成培训课程数据，通过强化学习逐步优化 T2I 模型。我们将我们的课程学习方法整合到组相对策略优化（GRPO）中，并研究不同的课程安排策略。我们的实验表明，CompGen 在不同的课程安排策略下表现出不同的缩放曲线，与随机采样相比，由易到难和高斯采样策略产生了优越的缩放性能。大量实验表明，CompGen 显着增强了基于扩散和自回归 T2I 模型的组合生成能力，突出了其在改进组合 T2I 生成系统方面的有效性。

Title: ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

Authors: Timing Yang, Sucheng Ren, Alan Yuille, Feng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18382
Pdf URL: https://arxiv.org/pdf/2511.18382
Copy Paste: [[2511.18382]] ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access(https://arxiv.org/abs/2511.18382)
Keywords: generation
Abstract: Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal retrieval, text-to-video generation, and video question answering tasks, observing consistent improvements over counterpart datasets. We hope this work can help removing the key barrier to training and fine-tuning open-source video foundation models, and provide insights of building high-quality and generalizable video-text datasets.
摘要：自 Sora 以来，文本到视频生成的兴趣激增，但开源模型仍然面临数据瓶颈：没有大型、高质量、易于获取的视频文本语料库。现有的公共数据集通常需要手动进行 YouTube 抓取，由于链接失效和访问限制，可用量较低，并增加了许可的不确定性。这项工作通过引入 ViMix-14M 来解决这一挑战，ViMix-14M 是一个精心策划的多源视频文本数据集，包含约 1400 万对，可提供免爬行、可下载的访问以及与视频紧密结合的长格式、高质量字幕。 ViMix-14M 的构建方式是合并不同的开放视频源，然后进行统一的重复数据删除和质量过滤，以及多粒度、地面实况引导的重新字幕管道，该管道可细化描述以更好地匹配动作、场景和时间结构。我们通过多模式检索、文本到视频生成和视频问答任务来评估数据集，观察相对于对应数据集的一致改进。我们希望这项工作能够帮助消除训练和微调开源视频基础模型的关键障碍，并提供构建高质量和可泛化的视频文本数据集的见解。

Title: SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation

Authors: Peter Siegel, Federico Tombari, Marc Pollefeys, Daniel Barath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18386
Pdf URL: https://arxiv.org/pdf/2511.18386
Copy Paste: [[2511.18386]] SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation(https://arxiv.org/abs/2511.18386)
Keywords: generation
Abstract: We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding. By constructing a compact semantic memory bank from multi-view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially \textit{without} requiring any per-scene optimization for semantic feature integration. This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.
摘要：我们推出了 SegSplat，这是一个新颖的框架，旨在弥合快速前馈 3D 重建与丰富的开放词汇语义理解之间的差距。通过从多视图 2D 基础模型特征构建紧凑的语义存储库，并在单次传递中预测离散语义索引以及每个 3D 高斯的几何和外观属性，SegSplat 有效地为场景注入可查询的语义。我们的实验表明，SegSplat 实现了与最先进的前馈 3D 高斯分布方法相当的几何保真度，同时实现了鲁棒的开放集语义分割，至关重要的是 \textit{不需要} 任何针对语义特征集成的每场景优化。这项工作代表了朝着实际、动态生成语义感知 3D 环境迈出的重要一步，对于推进机器人交互、增强现实和其他智能系统至关重要。

Title: When Generative Replay Meets Evolving Deepfakes: Domain-Aware Relative Weighting for Incremental Face Forgery Detection

Authors: Hao Shen, Jikang Cheng, Renye Yan, Zhongyuan Wang, Wei Peng, Baojin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18436
Pdf URL: https://arxiv.org/pdf/2511.18436
Copy Paste: [[2511.18436]] When Generative Replay Meets Evolving Deepfakes: Domain-Aware Relative Weighting for Incremental Face Forgery Detection(https://arxiv.org/abs/2511.18436)
Keywords: generation, generative
Abstract: The rapid advancement of face generation techniques has led to a growing variety of forgery methods. Incremental forgery detection aims to gradually update existing models with new forgery data, yet current sample replay-based methods are limited by low diversity and privacy concerns. Generative replay offers a potential solution by synthesizing past data, but its feasibility for forgery detection remains unclear. In this work, we systematically investigate generative replay and identify two scenarios: when the replay generator closely resembles the new forgery model, generated real samples blur the domain boundary, creating domain-risky samples; when the replay generator differs significantly, generated samples can be safely supervised, forming domain-safe samples. To exploit generative replay effectively, we propose a novel Domain-Aware Relative Weighting (DARW) strategy. DARW directly supervises domain-safe samples while applying a Relative Separation Loss to balance supervision and potential confusion for domain-risky samples. A Domain Confusion Score dynamically adjusts this tradeoff according to sample reliability. Extensive experiments demonstrate that DARW consistently improves incremental learning performance for forgery detection under different generative replay settings and alleviates the adverse impact of domain overlap.
摘要：人脸生成技术的快速进步导致伪造方法的种类越来越多。增量伪造检测旨在用新的伪造数据逐步更新现有模型，但当前基于样本重放的方法受到多样性低和隐私问题的限制。生成重放通过综合过去的数据提供了一种潜在的解决方案，但其伪造检测的可行性仍不清楚。在这项工作中，我们系统地研究了生成重放并确定了两种情况：当重放生成器与新的伪造模型非常相似时，生成的真实样本模糊了域边界，创建了域风险样本；当重放生成器差异显着时，可以安全地监督生成的样本，形成域安全样本。为了有效地利用生成重放，我们提出了一种新颖的领域感知相对权重（DARW）策略。 DARW 直接监督域安全样本，同时应用相对分离损失来平衡域风险样本的监督和潜在混淆。域混淆分数根据样本可靠性动态调整这种权衡。大量实验表明，DARW 持续提高了不同生成重放设置下伪造检测的增量学习性能，并减轻了域重叠的不利影响。

Title: NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Authors: Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18452
Pdf URL: https://arxiv.org/pdf/2511.18452
Copy Paste: [[2511.18452]] NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering(https://arxiv.org/abs/2511.18452)
Keywords: restoration
Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at this https URL.
摘要：视觉基础模型（VFM）提取空间下采样表示，给像素级任务带来了挑战。现有的上采样方法面临着一个基本的权衡：经典滤波器快速且适用范围广泛，但依赖于固定形式，而现代上采样器通过可学习的、VFM 特定的形式实现卓越的精度，但代价是对每个 VFM 进行重新训练。我们引入了邻域注意力过滤（NAF），它通过仅由高分辨率输入图像引导的跨尺度邻域注意力和旋转位置嵌入（RoPE）来学习自适应空间和内容权重，从而弥补了这一差距。 NAF 运行零样本：它无需重新训练即可对任何 VFM 的特征进行上采样，使其成为第一个与 VFM 无关的架构，其性能优于 VFM 特定的上采样器，并在多个下游任务中实现最先进的性能。它保持高效率，缩放至 2K 特征图并以 18 FPS 重建中间分辨率图。除了特征上采样之外，NAF 在图像恢复方面也表现出强大的性能，凸显了其多功能性。代码和检查点可从此 https URL 获取。

Title: Robust Posterior Diffusion-based Sampling via Adaptive Guidance Scale

Authors: Liav Hen, Tom Tirer, Raja Giryes, Shady Abu-Hussein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18471
Pdf URL: https://arxiv.org/pdf/2511.18471
Copy Paste: [[2511.18471]] Robust Posterior Diffusion-based Sampling via Adaptive Guidance Scale(https://arxiv.org/abs/2511.18471)
Keywords: super-resolution, generative
Abstract: Diffusion models have recently emerged as powerful generative priors for solving inverse problems, achieving state-of-the-art results across various imaging tasks. A central challenge in this setting lies in balancing the contribution of the prior with the data fidelity term: overly aggressive likelihood updates may introduce artifacts, while conservative updates can slow convergence or yield suboptimal reconstructions. In this work, we propose an adaptive likelihood step-size strategy to guide the diffusion process for inverse-problem formulations. Specifically, we develop an observation-dependent weighting scheme based on the agreement between two different approximations of the intractable intermediate likelihood gradients, that adapts naturally to the diffusion schedule, time re-spacing, and injected stochasticity. The resulting approach, Adaptive Posterior diffusion Sampling (AdaPS), is hyperparameter-free and improves reconstruction quality across diverse imaging tasks - including super-resolution, Gaussian deblurring, and motion deblurring - on CelebA-HQ and ImageNet-256 validation sets. AdaPS consistently surpasses existing diffusion-based baselines in perceptual quality with minimal or no loss in distortion, without any task-specific tuning. Extensive ablation studies further demonstrate its robustness to the number of diffusion steps, observation noise levels, and varying stochasticity.
摘要：扩散模型最近已成为解决逆问题的强大生成先验，在各种成像任务中取得了最先进的结果。这种设置中的一个核心挑战在于平衡先验的贡献与数据保真度项：过于激进的似然更新可能会引入伪影，而保守的更新可能会减慢收敛速度或产生次优的重建。在这项工作中，我们提出了一种自适应似然步长策略来指导反问题公式的扩散过程。具体来说，我们基于棘手的中间似然梯度的两个不同近似值之间的一致性，开发了一种依赖于观察的加权方案，该方案自然地适应扩散计划、时间重新间隔和注入随机性。由此产生的方法自适应后扩散采样 (AdaPS) 是无超参数的，可提高 CelebA-HQ 和 ImageNet-256 验证集上不同成像任务（包括超分辨率、高斯去模糊和运动去模糊）的重建质量。 AdaPS 在感知质量方面始终超越现有的基于扩散的基线，失真损失最小或没有损失，无需任何特定于任务的调整。广泛的消融研究进一步证明了其对扩散步骤数、观察噪声水平和变化随机性的鲁棒性。

Title: Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion

Authors: Haidong Kang, Ketong Qian, Yi Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18516
Pdf URL: https://arxiv.org/pdf/2511.18516
Copy Paste: [[2511.18516]] Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion(https://arxiv.org/abs/2511.18516)
Keywords: generative
Abstract: Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.
摘要：在少样本类增量学习（FSCIL）中克服灾难性遗忘的努力主要集中在开发更有效的基于梯度的优化策略。相比之下，很少有人关注随着新类别数量的增加而不可避免地出现的训练成本爆炸，这是即使在数据极度稀缺的情况下也依赖梯度学习的结果。更关键的是，由于 FSCIL 通常只为每个新类提供几个样本，因此基于梯度的更新不仅会导致基类发生严重的灾难性遗忘，而且还会阻碍对新类的适应。本文试图通过提出以下问题来打破这一长期存在的限制：我们能否设计一种完全消除梯度优化的免训练 FSCIL 范式？我们通过揭示基于梯度的优化和条件扩散过程之间有趣的联系来提供肯定的答案。基于这一观察，我们提出了一种条件扩散驱动的 FSCIL (CD-FSCIL) 框架，该框架用基于扩散的生成过渡替代传统的梯度更新过程，从而实现无训练的增量适应，同时有效减轻遗忘。此外，为了增强少数镜头约束下的表示，我们引入了一种多模态学习策略，该策略将视觉特征与大型语言模型（LLM）自动生成的自然语言描述相结合。这种协同作用大大缓解了样本稀缺问题，并提高了跨新类别的泛化能力。对主流 FSCIL 基准的大量实验表明，我们的方法不仅实现了最先进的性能，而且还大大减少了计算和内存开销，标志着向免训练持续适应的范式转变。

Title: Hyperspectral Variational Autoencoders for Joint Data Compression and Component Extraction

Authors: Core Francisco Park, Manuel Perez-Carrasco, Caroline Nowlan, Cecilia Garraffo
Subjects: cs.LG, astro-ph.EP, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2511.18521
Pdf URL: https://arxiv.org/pdf/2511.18521
Copy Paste: [[2511.18521]] Hyperspectral Variational Autoencoders for Joint Data Compression and Component Extraction(https://arxiv.org/abs/2511.18521)
Keywords: generation
Abstract: Geostationary hyperspectral satellites generate terabytes of data daily, creating critical challenges for storage, transmission, and distribution to the scientific community. We present a variational autoencoder (VAE) approach that achieves x514 compression of NASA's TEMPO satellite hyperspectral observations (1028 channels, 290-490nm) with reconstruction errors 1-2 orders of magnitude below the signal across all wavelengths. This dramatic data volume reduction enables efficient archival and sharing of satellite observations while preserving spectral fidelity. Beyond compression, we investigate to what extent atmospheric information is retained in the compressed latent space by training linear and nonlinear probes to extract Level-2 products (NO2, O3, HCHO, cloud fraction). Cloud fraction and total ozone achieve strong extraction performance (R^2 = 0.93 and 0.81 respectively), though these represent relatively straightforward retrievals given their distinct spectral signatures. In contrast, tropospheric trace gases pose genuine challenges for extraction (NO2 R^2 = 0.20, HCHO R^2 = 0.51) reflecting their weaker signals and complex atmospheric interactions. Critically, we find the VAE encodes atmospheric information in a semi-linear manner - nonlinear probes substantially outperform linear ones - and that explicit latent supervision during training provides minimal improvement, revealing fundamental encoding challenges for certain products. This work demonstrates that neural compression can dramatically reduce hyperspectral data volumes while preserving key atmospheric signals, addressing a critical bottleneck for next-generation Earth observation systems. Code - this https URL
摘要：对地静止高光谱卫星每天生成数 TB 的数据，给科学界的存储、传输和分发带来了严峻的挑战。我们提出了一种变分自动编码器 (VAE) 方法，可实现 NASA TEMPO 卫星高光谱观测数据（1028 通道，290-490nm）的 x514 压缩，并且所有波长的重建误差均低于信号 1-2 个数量级。数据量的大幅减少可以实现卫星观测数据的高效归档和共享，同时保持光谱保真度。除了压缩之外，我们还通过训练线性和非线性探针来提取 2 级产品（NO2、O3、HCHO、云分数）来研究大气信息在压缩的潜在空间中保留的程度。云分数和总臭氧实现了强大的提取性能（R^2 分别 = 0.93 和 0.81），尽管考虑到其独特的光谱特征，这些代表相对简单的检索。相比之下，对流层痕量气体对提取提出了真正的挑战（NO2 R^2 = 0.20，HCHO R^2 = 0.51），反映了它们较弱的信号和复杂的大气相互作用。重要的是，我们发现 VAE 以半线性方式编码大气信息 - 非线性探针大大优于线性探针 - 并且训练期间的显式潜在监督提供了最小的改进，揭示了某些产品的基本编码挑战。这项工作表明，神经压缩可以显着减少高光谱数据量，同时保留关键的大气信号，解决下一代地球观测系统的关键瓶颈。代码 - 此 https URL

Title: Zero-Shot Video Deraining with Video Diffusion Models

Authors: Tuomas Varanka, Juan Luis Gonzalez, Hyeongwoo Kim, Pablo Garrido, Xu Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18537
Pdf URL: https://arxiv.org/pdf/2511.18537
Copy Paste: [[2511.18537]] Zero-Shot Video Deraining with Video Diffusion Models(https://arxiv.org/abs/2511.18537)
Keywords: generative
Abstract: Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model's concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.
摘要：现有的视频去雨方法通常在配对数据集上进行训练，这些数据集要么是合成的，这限制了它们推广到现实世界降雨的能力，要么是由静态摄像机捕获的，这限制了它们在具有背景和摄像机运动的动态场景中的有效性。此外，最近在微调扩散模型方面的工作已经显示出有希望的结果，但微调往往会削弱生成先验，从而限制了对未见过情况的泛化。在本文中，我们通过利用预训练的文本到视频扩散模型，介绍了第一个针对复杂动态场景的零镜头视频去雨方法，该方法不需要合成数据，也不需要模型微调，该模型表现出强大的泛化能力。通过将输入视频反转到扩散模型的潜在空间中，可以使用负提示来干预其重建过程并使其远离模型的下雨概念。我们方法的核心是注意力切换机制，我们发现该机制对于维持动态背景以及输入和脱色视频之间的结构一致性至关重要，从而减轻天真的负面提示引入的伪影。我们的方法通过对现实世界降雨数据集的广泛实验进行了验证，证明了对先前方法的实质性改进，并展示了强大的泛化能力，而无需监督训练。

Title: TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

Authors: Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, Xin Zhang, Ziming Zhang, Zhengzhong Tu, Michael Zielewski, Kazunori Yamada, Fangzhou Lin
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.18539
Pdf URL: https://arxiv.org/pdf/2511.18539
Copy Paste: [[2511.18539]] TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting(https://arxiv.org/abs/2511.18539)
Keywords: generative
Abstract: Probabilistic Time-Series Forecasting (PTSF) is critical for uncertainty-aware decision making, but existing generative models, such as diffusion-based approaches, are computationally prohibitive due to expensive iterative sampling. Non-sampling frameworks like Multiple Choice Learning (MCL) offer an efficient alternative, but suffer from severe training instability and hypothesis collapse, which has historically hindered their performance. This problem is dramatically exacerbated when attempting to combine them with modern, efficient MLP-based backbones. To resolve this fundamental incompatibility, we propose TimePre, a novel framework that successfully unifies the efficiency of MLP-based models with the distributional flexibility of the MCL paradigm. The core of our solution is Stabilized Instance Normalization (SIN), a novel normalization layer that explicitly remedies this incompatibility. SIN stabilizes the hybrid architecture by correcting channel-wise statistical shifts, definitively resolving the catastrophic hypothesis collapse. Extensive experiments on six benchmark datasets demonstrate that TimePre achieves new state-of-the-art accuracy on key probabilistic metrics. Critically, TimePre achieves inference speeds orders of magnitude faster than sampling-based models and, unlike prior MCL work, demonstrates stable performance scaling. It thus bridges the long-standing gap between accuracy, efficiency, and stability in probabilistic forecasting.
摘要：概率时间序列预测 (PTSF) 对于不确定性感知决策至关重要，但现有的生成模型（例如基于扩散的方法）由于昂贵的迭代采样而在计算上难以实现。像多重选择学习（MCL）这样的非抽样框架提供了一种有效的替代方案，但遭受严重的训练不稳定和假设崩溃的困扰，这在历史上阻碍了它们的性能。当尝试将它们与现代、高效的基于 MLP 的主干网结合起来时，这个问题会急剧恶化。为了解决这种根本性的不兼容性，我们提出了 TimePre，这是一种新颖的框架，成功地将基于 MLP 的模型的效率与 MCL 范式的分布灵活性结合起来。我们解决方案的核心是稳定实例标准化（SIN），这是一种新颖的标准化层，可以显式地解决这种不兼容性。 SIN 通过纠正通道统计变化来稳定混合架构，最终解决灾难性假设崩溃的问题。对六个基准数据集的广泛实验表明，TimePre 在关键概率指标上实现了新的最先进的准确性。至关重要的是，TimePre 的推理速度比基于采样的模型快几个数量级，并且与之前的 MCL 工作不同，它展示了稳定的性能扩展。因此，它弥合了概率预测的准确性、效率和稳定性之间长期存在的差距。

Title: Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

Authors: Wei Dong, Han Zhou, Junwei Lin, Jun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18591
Pdf URL: https://arxiv.org/pdf/2511.18591
Copy Paste: [[2511.18591]] Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation(https://arxiv.org/abs/2511.18591)
Keywords: restoration, generative
Abstract: Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores. In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.
摘要：现实世界的暗图像通常不仅表现出低可见度和对比度，而且还表现出复杂的噪声和模糊，给恢复带来了巨大的挑战。现有方法通常依赖于成对数据或无法对动态照明和模糊特征进行建模，导致泛化能力较差。为了解决这个问题，我们提出了一个基于视觉自回归（VAR）模型的生成框架，以视觉语言模型（VLM）的感知先验为指导。具体来说，为了为 VAR 模型提供信息丰富的调节线索，我们部署了自适应曲线估计方案，以根据 VLM 导出的可见度分数来调节不同的照明。此外，我们将动态和空间频率感知旋转位置编码（SF-RoPE）集成到 VAR 中，以增强其对因模糊而退化的结构进行建模的能力。此外，我们提出了一种递归相域调制策略，该策略通过由 VLM 评估的模糊分数引导的有界迭代细化来减轻相域中模糊引起的伪影。我们的框架完全不受监督，并在基准数据集上实现了最先进的性能。

Title: Generative Myopia: Why Diffusion Models Fail at Structure

Authors: Milad Siami
Subjects: cs.LG, eess.SY, math.SP
Abstract URL: https://arxiv.org/abs/2511.18593
Pdf URL: https://arxiv.org/pdf/2511.18593
Copy Paste: [[2511.18593]] Generative Myopia: Why Diffusion Models Fail at Structure(https://arxiv.org/abs/2511.18593)
Keywords: generative
Abstract: Graph Diffusion Models (GDMs) optimize for statistical likelihood, implicitly acting as \textbf{frequency filters} that favor abundant substructures over spectrally critical ones. We term this phenomenon \textbf{Generative Myopia}. In combinatorial tasks like graph sparsification, this leads to the catastrophic removal of ``rare bridges,'' edges that are structurally mandatory ($R_{\text{eff}} \approx 1$) but statistically scarce. We prove theoretically and empirically that this failure is driven by \textbf{Gradient Starvation}: the optimization landscape itself suppresses rare structural signals, rendering them unlearnable regardless of model capacity. To resolve this, we introduce \textbf{Spectrally-Weighted Diffusion}, which re-aligns the variational objective using Effective Resistance. We demonstrate that spectral priors can be amortized into the training phase with zero inference overhead. Our method eliminates myopia, matching the performance of an optimal Spectral Oracle and achieving \textbf{100\% connectivity} on adversarial benchmarks where standard diffusion fails completely (0\%).
摘要：图扩散模型 (GDM) 针对统计可能性进行优化，隐式充当 \textbf{频率滤波器}，有利于丰富的子结构而不是光谱关键的子结构。我们将这种现象称为\textbf{生成性近视}。在图稀疏化等组合任务中，这会导致灾难性地删除“稀有桥梁”，这些边在结构上是强制性的（$R_{\text{eff}} \approx 1$），但在统计上却很少。我们从理论上和经验上证明，这种失败是由梯度饥饿驱动的：优化景观本身抑制了罕见的结构信号，无论模型容量如何，都使得它们无法学习。为了解决这个问题，我们引入了 \textbf{谱加权扩散}，它使用有效阻力重新调整变分目标。我们证明了谱先验可以以零推理开销分摊到训练阶段。我们的方法消除了近视，匹配最佳 Spectral Oracle 的性能，并在标准扩散完全失败 (0\%) 的对抗性基准上实现 \textbf{100\% 连接性}。

Title: Functional Localization Enforced Deep Anomaly Detection Using Fundus Images

Authors: Jan Benedikt Ruhland, Thorsten Papenbrock, Jan-Peter Sowa, Ali Canbay, Nicole Eter, Bernd Freisleben, Dominik Heider
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18627
Pdf URL: https://arxiv.org/pdf/2511.18627
Copy Paste: [[2511.18627]] Functional Localization Enforced Deep Anomaly Detection Using Fundus Images(https://arxiv.org/abs/2511.18627)
Keywords: generation
Abstract: Reliable detection of retinal diseases from fundus images is challenged by the variability in imaging quality, subtle early-stage manifestations, and domain shift across datasets. In this study, we systematically evaluated a Vision Transformer (ViT) classifier under multiple augmentation and enhancement strategies across several heterogeneous public datasets, as well as the AEyeDB dataset, a high-quality fundus dataset created in-house and made available for the research community. The ViT demonstrated consistently strong performance, with accuracies ranging from 0.789 to 0.843 across datasets and diseases. Diabetic retinopathy and age-related macular degeneration were detected reliably, whereas glaucoma remained the most frequently misclassified disease. Geometric and color augmentations provided the most stable improvements, while histogram equalization benefited datasets dominated by structural subtlety. Laplacian enhancement reduced performance across different settings. On the Papila dataset, the ViT with geometric augmentation achieved an AUC of 0.91, outperforming previously reported convolutional ensemble baselines (AUC of 0.87), underscoring the advantages of transformer architectures and multi-dataset training. To complement the classifier, we developed a GANomaly-based anomaly detector, achieving an AUC of 0.76 while providing inherent reconstruction-based explainability and robust generalization to unseen data. Probabilistic calibration using GUESS enabled threshold-independent decision support for future clinical implementation.
摘要：从眼底图像可靠地检测视网膜疾病面临着成像质量的可变性、微妙的早期表现以及跨数据集的域转移的挑战。在这项研究中，我们在多个异构公共数据集以及 AEyeDB 数据集（内部创建并提供给研究界的高质量眼底数据集）的多种增强和增强策略下系统地评估了视觉变换器 (ViT) 分类器。 ViT 表现出一贯的强劲性能，在数据集和疾病方面的准确度范围为 0.789 到 0.843。糖尿病视网膜病变和年龄相关性黄斑变性得到了可靠检测，而青光眼仍然是最常被错误分类的疾病。几何和颜色增强提供了最稳定的改进，而直方图均衡化则有利于以结构微妙为主导的数据集。拉普拉斯增强降低了不同设置下的性能。在 Papila 数据集上，具有几何增强的 ViT 达到了 0.91 的 AUC，优于之前报道的卷积集成基线（AUC 为 0.87），凸显了 Transformer 架构和多数据集训练的优势。为了补充分类器，我们开发了一种基于 GANomaly 的异常检测器，实现了 0.76 的 AUC，同时提供了固有的基于重建的可解释性和对未见数据的稳健泛化。使用 GUESS 进行概率校准可为未来的临床实施提供与阈值无关的决策支持。

Title: Health system learning achieves generalist neuroimaging models

Authors: Akhil Kondepudi, Akshay Rao, Chenhui Zhao, Yiwei Lyu, Samir Harake, Soumyanil Banerjee, Rushikesh Joshi, Anna-Katharina Meissner, Renly Hou, Cheng Jiang, Asadur Chowdury, Ashok Srinivasan, Brian Athey, Vikas Gulani, Aditya Pandey, Honglak Lee, Todd Hollon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18640
Pdf URL: https://arxiv.org/pdf/2511.18640
Copy Paste: [[2511.18640]] Health system learning achieves generalist neuroimaging models(https://arxiv.org/abs/2511.18640)
Keywords: generation
Abstract: Frontier artificial intelligence (AI) models, such as OpenAI's GPT-5 and Meta's DINOv3, have advanced rapidly through training on internet-scale public data, yet such systems lack access to private clinical data. Neuroimaging, in particular, is underrepresented in the public domain due to identifiable facial features within MRI and CT scans, fundamentally restricting model performance in clinical medicine. Here, we show that frontier models underperform on neuroimaging tasks and that learning directly from uncurated data generated during routine clinical care at health systems, a paradigm we call health system learning, yields high-performance, generalist neuroimaging models. We introduce NeuroVFM, a visual foundation model trained on 5.24 million clinical MRI and CT volumes using a scalable volumetric joint-embedding predictive architecture. NeuroVFM learns comprehensive representations of brain anatomy and pathology, achieving state-of-the-art performance across multiple clinical tasks, including radiologic diagnosis and report generation. The model exhibits emergent neuroanatomic understanding and interpretable visual grounding of diagnostic findings. When paired with open-source language models through lightweight visual instruction tuning, NeuroVFM generates radiology reports that surpass frontier models in accuracy, clinical triage, and expert preference. Through clinically grounded visual understanding, NeuroVFM reduces hallucinated findings and critical errors, offering safer clinical decision support. These results establish health system learning as a paradigm for building generalist medical AI and provide a scalable framework for clinical foundation models.
摘要：前沿人工智能 (AI) 模型，例如 OpenAI 的 GPT-5 和 Meta 的 DINOv3，通过互联网规模公共数据的培训取得了快速发展，但此类系统缺乏对私人临床数据的访问。尤其是神经影像学，由于 MRI 和 CT 扫描中可识别的面部特征，在公共领域的代表性不足，从根本上限制了模型在临床医学中的表现。在这里，我们表明前沿模型在神经影像任务上表现不佳，并且直接从卫生系统常规临床护理期间生成的未经整理的数据中学习（我们称之为卫生系统学习的范例）可以产生高性能、通用的神经影像模型。我们推出 NeuroVFM，这是一种视觉基础模型，使用可扩展的体积关节嵌入预测架构，在 524 万临床 MRI 和 CT 体积上进行了训练。 NeuroVFM 学习大脑解剖学和病理学的全面表征，在多项临床任务（包括放射诊断和报告生成）中实现最先进的性能。该模型展示了新兴的神经解剖学理解和诊断结果的可解释的视觉基础。当通过轻量级视觉指令调整与开源语言模型配合使用时，NeuroVFM 生成的放射学报告在准确性、临床分类和专家偏好方面超越了前沿模型。通过基于临床的视觉理解，NeuroVFM 减少了幻觉结果和严重错误，提供更安全的临床决策支持。这些结果将卫生系统学习确立为构建通用医疗人工智能的范例，并为临床基础模型提供了可扩展的框架。

Title: From Healthy Scans to Annotated Tumors: A Tumor Fabrication Framework for 3D Brain MRI Synthesis

Authors: Nayu Dong, Townim Chowdhury, Hieu Phan, Mark Jenkinson, Johan Verjans, Zhibin Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18654
Pdf URL: https://arxiv.org/pdf/2511.18654
Copy Paste: [[2511.18654]] From Healthy Scans to Annotated Tumors: A Tumor Fabrication Framework for 3D Brain MRI Synthesis(https://arxiv.org/abs/2511.18654)
Keywords: generative
Abstract: The scarcity of annotated Magnetic Resonance Imaging (MRI) tumor data presents a major obstacle to accurate and automated tumor segmentation. While existing data synthesis methods offer promising solutions, they often suffer from key limitations: manual modeling is labor intensive and requires expert knowledge. Deep generative models may be used to augment data and annotation, but they typically demand large amounts of training pairs in the first place, which is impractical in data limited clinical settings. In this work, we propose Tumor Fabrication (TF), a novel two-stage framework for unpaired 3D brain tumor synthesis. The framework comprises a coarse tumor synthesis process followed by a refinement process powered by a generative model. TF is fully automated and leverages only healthy image scans along with a limited amount of real annotated data to synthesize large volumes of paired synthetic data for enriching downstream supervised segmentation training. We demonstrate that our synthetic image-label pairs used as data enrichment can significantly improve performance on downstream tumor segmentation tasks in low-data regimes, offering a scalable and reliable solution for medical image enrichment and addressing critical challenges in data scarcity for clinical AI applications.
摘要：带注释的磁共振成像 (MRI) 肿瘤数据的缺乏给准确和自动化的肿瘤分割带来了主要障碍。虽然现有的数据合成方法提供了有前景的解决方案，但它们常常受到关键限制：手动建模是劳动密集型的，并且需要专业知识。深度生成模型可用于增强数据和注释，但它们通常首先需要大量的训练对，这在数据有限的临床环境中是不切实际的。在这项工作中，我们提出了肿瘤制造（TF），这是一种用于不配对 3D 脑肿瘤合成的新型两阶段框架。该框架包括粗略的肿瘤合成过程和由生成模型驱动的细化过程。 TF 是完全自动化的，仅利用健康的图像扫描以及有限数量的真实注释数据来合成大量配对合成数据，以丰富下游监督分割训练。我们证明，用作数据丰富的合成图像标签对可以显着提高低数据状态下下游肿瘤分割任务的性能，为医学图像丰富提供可扩展且可靠的解决方案，并解决临床人工智能应用数据稀缺的关键挑战。

Title: Data Augmentation Strategies for Robust Lane Marking Detection

Authors: Flora Lian, Dinh Quang Huynh, Hector Penades, J. Stephany Berrio Perez, Mao Shan, Stewart Worrall
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2511.18668
Pdf URL: https://arxiv.org/pdf/2511.18668
Copy Paste: [[2511.18668]] Data Augmentation Strategies for Robust Lane Marking Detection(https://arxiv.org/abs/2511.18668)
Keywords: generative
Abstract: Robust lane detection is essential for advanced driver assistance and autonomous driving, yet models trained on public datasets such as CULane often fail to generalise across different camera viewpoints. This paper addresses the challenge of domain shift for side-mounted cameras used in lane-wheel monitoring by introducing a generative AI-based data enhancement pipeline. The approach combines geometric perspective transformation, AI-driven inpainting, and vehicle body overlays to simulate deployment-specific viewpoints while preserving lane continuity. We evaluated the effectiveness of the proposed augmentation in two state-of-the-art models, SCNN and UFLDv2. With the augmented data trained, both models show improved robustness to different conditions, including shadows. The experimental results demonstrate gains in precision, recall, and F1 score compared to the pre-trained model. By bridging the gap between widely available datasets and deployment-specific scenarios, our method provides a scalable and practical framework to improve the reliability of lane detection in a pilot deployment scenario.
摘要：强大的车道检测对于高级驾驶员辅助和自动驾驶至关重要，但在 CULane 等公共数据集上训练的模型通常无法泛化不同的摄像机视点。本文通过引入基于人工智能的生成数据增强管道，解决了用于车道车轮监控的侧装摄像头的域转移挑战。该方法结合了几何透视变换、人工智能驱动的修复和车身叠加，以模拟特定于部署的视点，同时保持车道连续性。我们在两个最先进的模型 SCNN 和 UFLDv2 中评估了所提出的增强的有效性。通过训练增强数据，这两个模型都显示出对不同条件（包括阴影）的鲁棒性有所提高。实验结果表明，与预训练模型相比，准确率、召回率和 F1 分数都有所提高。通过弥合广泛可用的数据集和特定于部署的场景之间的差距，我们的方法提供了一个可扩展且实用的框架，以提高试点部署场景中车道检测的可靠性。

Title: Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement

Authors: Yuchen Xia, Souvik Kundu, Mosharaf Chowdhury, Nishil Talati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18672
Pdf URL: https://arxiv.org/pdf/2511.18672
Copy Paste: [[2511.18672]] Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement(https://arxiv.org/abs/2511.18672)
Keywords: generation
Abstract: Novel View Synthesis (NVS) is the task of generating new images of a scene from viewpoints that were not part of the original input. Diffusion-based NVS can generate high-quality, temporally consistent images, however, remains computationally prohibitive. Conversely, regression-based NVS offers suboptimal generation quality despite requiring significantly lower compute; leaving the design objective of a high-quality, inference-efficient NVS framework an open challenge. To close this critical gap, we present Sphinx, a training-free hybrid inference framework that achieves diffusion-level fidelity at a significantly lower compute. Sphinx proposes to use regression-based fast initialization to guide and reduce the denoising workload for the diffusion model. Additionally, it integrates selective refinement with adaptive noise scheduling, allowing more compute to uncertain regions and frames. This enables Sphinx to provide flexible navigation of the performance-quality trade-off, allowing adaptation to latency and fidelity requirements for dynamically changing inference scenarios. Our evaluation shows that Sphinx achieves an average 1.8x speedup over diffusion model inference with negligible perceptual degradation of less than 5%, establishing a new Pareto frontier between quality and latency in NVS serving.
摘要：新视图合成 (NVS) 是从不属于原始输入一部分的视点生成场景新图像的任务。基于扩散的 NVS 可以生成高质量、时间一致的图像，但计算量仍然过高。相反，基于回归的 NVS 提供了次优的生成质量，尽管所需的计算量明显较低；高质量、推理高效的 NVS 框架的设计目标成为一个开放的挑战。为了弥补这一关键差距，我们提出了 Sphinx，这是一种无需训练的混合推理框架，可以以显着较低的计算量实现扩散级保真度。 Sphinx提出使用基于回归的快速初始化来指导和减少扩散模型的去噪工作量。此外，它将选择性细化与自适应噪声调度相结合，允许对不确定区域和帧进行更多计算。这使得 Sphinx 能够提供性能与质量权衡的灵活导航，从而能够适应动态变化的推理场景的延迟和保真度要求。我们的评估表明，Sphinx 的速度比扩散模型推理平均提高了 1.8 倍，感知下降幅度可忽略不计，不到 5%，在 NVS 服务的质量和延迟之间建立了新的帕累托前沿。

Title: Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

Authors: Yiqing Shi, Yiren Song, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18673
Pdf URL: https://arxiv.org/pdf/2511.18673
Copy Paste: [[2511.18673]] Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers(https://arxiv.org/abs/2511.18673)
Keywords: generation
Abstract: Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
摘要：扩散变换器的最新进展在视觉合成方面表现出了显着的通用性，但大多数密集感知方法仍然依赖于为随机生成而设计的文本到图像（T2I）生成器。我们重新审视这个范式，并表明图像编辑扩散模型本质上是图像到图像一致的，为密集感知任务提供了更合适的基础。我们引入了 Edit2Perceive，这是一个统一的扩散框架，可适应深度、法线和抠图的编辑模型。我们的方法基于 FLUX.1 Kontext 架构，采用全参数微调和像素空间一致性损失来强制跨中间去噪状态进行结构保留细化。此外，我们的单步确定性推理在相对较小的数据集上进行训练时可以产生更快的运行时间。大量的实验证明了所有三项任务的综合最先进的结果，揭示了面向编辑的扩散变压器在几何感知感知方面的强大潜力。

Title: Neural Geometry Image-Based Representations with Optimal Transport (OT)

Authors: Xiang Gao, Yuanpeng Liu, Xinmu Wang, Jiazhi Li, Minghao Guo, Yu Guo, Xiyun Song, Heather Yu, Zhiqiang Lao, Xianfeng David Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18679
Pdf URL: https://arxiv.org/pdf/2511.18679
Copy Paste: [[2511.18679]] Neural Geometry Image-Based Representations with Optimal Transport (OT)(https://arxiv.org/abs/2511.18679)
Keywords: restoration, super-resolution
Abstract: Neural representations for 3D meshes are emerging as an effective solution for compact storage and efficient processing. Existing methods often rely on neural overfitting, where a coarse mesh is stored and progressively refined through multiple decoder networks. While this can restore high-quality surfaces, it is computationally expensive due to successive decoding passes and the irregular structure of mesh data. In contrast, images have a regular structure that enables powerful super-resolution and restoration frameworks, but applying these advantages to meshes is difficult because their irregular connectivity demands complex encoder-decoder architectures. Our key insight is that a geometry image-based representation transforms irregular meshes into a regular image grid, making efficient image-based neural processing directly applicable. Building on this idea, we introduce our neural geometry image-based representation, which is decoder-free, storage-efficient, and naturally suited for neural processing. It stores a low-resolution geometry-image mipmap of the surface, from which high-quality meshes are restored in a single forward pass. To construct geometry images, we leverage Optimal Transport (OT), which resolves oversampling in flat regions and undersampling in feature-rich regions, and enables continuous levels of detail (LoD) through geometry-image mipmapping. Experimental results demonstrate state-of-the-art storage efficiency and restoration accuracy, measured by compression ratio (CR), Chamfer distance (CD), and Hausdorff distance (HD).
摘要：3D 网格的神经表示正在成为紧凑存储和高效处理的有效解决方案。现有方法通常依赖于神经过度拟合，其中存储粗网格并通过多个解码器网络逐步细化。虽然这可以恢复高质量的表面，但由于连续的解码过程和网格数据的不规则结构，计算成本很高。相比之下，图像具有规则的结构，可以实现强大的超分辨率和恢复框架，但将这些优势应用于网格是很困难的，因为它们的不规则连接需要复杂的编码器-解码器架构。我们的主要见解是，基于几何图像的表示将不规则网格转换为规则图像网格，从而使基于图像的高效神经处理直接适用。基于这个想法，我们引入了基于神经几何图像的表示，它无需解码器，存储效率高，并且自然适合神经处理。它存储表面的低分辨率几何图像 mipmap，通过一次前向传递即可从中恢复高质量网格。为了构建几何图像，我们利用了最佳传输 (OT)，它解决了平坦区域中的过采样和特征丰富区域中的欠采样问题，并通过几何图像 mipmapping 实现了连续的细节级别 (LoD)。实验结果证明了最先进的存储效率和恢复精度，通过压缩比 (CR)、倒角距离 (CD) 和豪斯多夫距离 (HD) 来衡量。

Title: Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation

Authors: Shristi Das Biswas, Arani Roy, Kaushik Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18684
Pdf URL: https://arxiv.org/pdf/2511.18684
Copy Paste: [[2511.18684]] Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation(https://arxiv.org/abs/2511.18684)
Keywords: generation, generative
Abstract: Robust concept removal for text-to-image (T2I) and text-to-video (T2V) models is essential for their safe deployment. Existing methods, however, suffer from costly retraining, inference overhead, or vulnerability to adversarial attacks. Crucially, they rarely model the latent semantic overlap between the target erase concept and surrounding content -- causing collateral damage post-erasure -- and even fewer methods work reliably across both T2I and T2V domains. We introduce Instant Concept Erasure (ICE), a training-free, modality-agnostic, one-shot weight modification approach that achieves precise, persistent unlearning with zero overhead. ICE defines erase and preserve subspaces using anisotropic energy-weighted scaling, then explicitly regularises against their intersection using a unique, closed-form overlap projector. We pose a convex and Lipschitz-bounded Spectral Unlearning Objective, balancing erasure fidelity and intersection preservation, that admits a stable and unique analytical solution. This solution defines a dissociation operator that is translated to the model's text-conditioning layers, making the edit permanent and runtime-free. Across targeted removals of artistic styles, objects, identities, and explicit content, ICE efficiently achieves strong erasure with improved robustness to red-teaming, all while causing only minimal degradation of original generative abilities in both T2I and T2V models.
摘要：文本到图像 (T2I) 和文本到视频 (T2V) 模型的稳健概念删除对于其安全部署至关重要。然而，现有方法面临昂贵的再训练、推理开销或容易受到对抗性攻击的困扰。至关重要的是，他们很少对目标擦除概念与周围内容之间潜在的语义重叠进行建模，从而导致擦除后的附带损害，而且在 T2I 和 T2V 域中可靠工作的方法就更少了。我们引入了即时概念擦除（ICE），这是一种免训练、与模态无关的一次性权重修改方法，可以以零开销实现精确、持久的忘却。 ICE 使用各向异性能量加权缩放定义擦除和保留子空间，然后使用独特的封闭式重叠投影仪明确地规范它们的交集。我们提出了一个凸且 Lipschitz 有界的谱遗忘目标，平衡擦除保真度和交集保留，从而获得稳定且独特的分析解决方案。该解决方案定义了一个解离运算符，该运算符被转换为模型的文本调节层，从而使编辑永久且无需运行时。通过有针对性地删除艺术风格、对象、身份和露骨内容，ICE 有效地实现了强力擦除，并提高了对红队的鲁棒性，同时仅对 T2I 和 T2V 模型的原始生成能力造成最小程度的降低。

Title: CoD: A Diffusion Foundation Model for Image Compression

Authors: Zhaoyang Jia, Zihan Zheng, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Houqiang Li, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18706
Pdf URL: https://arxiv.org/pdf/2511.18706
Copy Paste: [[2511.18706]] CoD: A Diffusion Foundation Model for Image Compression(https://arxiv.org/abs/2511.18706)
Keywords: generation
Abstract: Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.
摘要：现有的扩散编解码器通常建立在文本到图像扩散基础模型（例如稳定扩散）之上。然而，从压缩的角度来看，文本调节并不是最理想的，阻碍了下游扩散编解码器的潜力，特别是在超低比特率下。为了解决这个问题，我们引入了 \textbf{CoD}，这是第一个 \textbf{Co} 面向压缩的 \textbf{D}iffusion 基础模型，从头开始训练，以实现压缩和生成的端到端优化。 CoD不是固定的编解码器，而是为各种基于扩散的编解码器设计的通用基础模型。它具有以下几个优点： \textbf{高压缩效率}，在 DiffC 等下游编解码器中用 CoD 替换稳定扩散，实现了 SOTA 结果，尤其是在超低比特率（例如 0.0039 bpp）下； \textbf{低成本且可重复的训练}，在完全开放的纯图像数据集上，训练速度比稳定扩散快 300$\time$（$\sim$ 20 vs. $\sim$ 6,250 A100 GPU 天）； \textbf{提供新的见解}，例如，我们发现像素空间扩散可以实现具有高感知质量的 VTM 级 PSNR，并且可以使用更少的参数超越基于 GAN 的编解码器。我们希望 CoD 为未来的扩散编解码器研究奠定基础。代码将被释放。

Title: Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Authors: Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibing Huang, Chi Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18719
Pdf URL: https://arxiv.org/pdf/2511.18719
Copy Paste: [[2511.18719]] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation(https://arxiv.org/abs/2511.18719)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.
摘要：强化学习 (RL) 已成为训练后视觉生成模型的强大工具，组相对策略优化 (GRPO) 越来越多地用于使生成器与人类偏好保持一致。然而，现有的 GRPO 管道依赖于每个样本的单个标量奖励，将每个图像或视频视为一个整体实体，并忽略了视觉内容丰富的空间和时间结构。这种粗略的监督阻碍了局部伪影的纠正和细粒度感知线索的建模。我们引入了视觉偏好策略优化 (ViPO)，这是一种 GRPO 变体，可将标量反馈提升为结构化的像素级优势。 ViPO 采用感知结构模块，该模块使用预先训练的视觉主干来构建空间和时间感知优势图，将优化压力重新分配到感知重要区域，同时保持标准 GRPO 的稳定性。在图像和视频基准测试中，ViPO 始终优于普通 GRPO，改善了域内与人类偏好奖励的一致性，并增强了域外评估的泛化。该方法与架构无关、轻量级，并且与现有的 GRPO 训练流程完全兼容，为视觉生成提供更具表现力和信息量的学习信号。

Title: LogSyn: A Few-Shot LLM Framework for Structured Insight Extraction from Unstructured General Aviation Maintenance Logs

Authors: Devansh Agarwal, Maitreyi Chatterjee, Biplab Chatterjee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.18727
Pdf URL: https://arxiv.org/pdf/2511.18727
Copy Paste: [[2511.18727]] LogSyn: A Few-Shot LLM Framework for Structured Insight Extraction from Unstructured General Aviation Maintenance Logs(https://arxiv.org/abs/2511.18727)
Keywords: generation
Abstract: Aircraft maintenance logs hold valuable safety data but remain underused due to their unstructured text format. This paper introduces LogSyn, a framework that uses Large Language Models (LLMs) to convert these logs into structured, machine-readable data. Using few-shot in-context learning on 6,169 records, LogSyn performs Controlled Abstraction Generation (CAG) to summarize problem-resolution narratives and classify events within a detailed hierarchical ontology. The framework identifies key failure patterns, offering a scalable method for semantic structuring and actionable insight extraction from maintenance logs. This work provides a practical path to improve maintenance workflows and predictive analytics in aviation and related industries.
摘要：飞机维护日志保存着宝贵的安全数据，但由于其非结构化文本格式而仍未得到充分利用。本文介绍了 LogSyn，这是一个使用大型语言模型 (LLM) 将这些日志转换为结构化的机器可读数据的框架。 LogSyn 通过对 6,169 条记录进行少量上下文学习，执行受控抽象生成 (CAG)，以总结问题解决叙述并在详细的分层本体中对事件进行分类。该框架识别关键故障模式，提供一种可扩展的方法，用于语义结构化和从维护日志中提取可操作的见解。这项工作为改进航空及相关行业的维护工作流程和预测分析提供了一条实用途径。

Title: GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Authors: Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, Yandan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18729
Pdf URL: https://arxiv.org/pdf/2511.18729
Copy Paste: [[2511.18729]] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving(https://arxiv.org/abs/2511.18729)
Keywords: generation, generative
Abstract: Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose \textit{\textbf{GuideFlow}}, a novel planning framework that leverages Constrained Flow Matching. Concretely, \textit{\textbf{GuideFlow}} explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, \textit{\textbf{GuideFlow}} unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, \textit{\textbf{GuideFlow}} parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of \textit{\textbf{GuideFlow}}. Notably, on the NavSim test hard split (Navhard), \textit{\textbf{GuideFlow}} achieved SOTA with an EPDMS score of 43.0. The code will be released.
摘要：驾驶规划是端到端（E2E）自动驾驶的关键组成部分。然而，流行的模仿E2E规划器经常遭受多模态轨迹模式崩溃的困扰，无法产生多样化的轨迹建议。与此同时，生成式端到端规划人员努力将关键的安全和物理约束直接纳入生成过程，因此需要额外的优化阶段来完善其输出。在本文中，我们提出了 \textit{\textbf{GuideFlow}}，一种利用约束流匹配的新颖规划框架。具体来说， \textit{\textbf{GuideFlow}} 明确地模拟了流匹配过程，这本质上减轻了模式崩溃，并允许来自各种调节信号的灵活指导。我们的核心贡献在于直接在流匹配生成过程中强制执行显式约束，而不是依赖于隐式约束编码。至关重要的是，\textit{\textbf{GuideFlow}}将流匹配的训练与基于能量的模型（EBM）统一起来，增强模型的自主优化能力，稳健地满足物理约束。其次，\textit{\textbf{GuideFlow}}在生成过程中将驾驶攻击性参数化为控制信号，从而能够精确操纵轨迹风格。对主要驾驶基准（Bench2Drive、NuScenes、NavSim 和 ADV-NuScenes）的广泛评估验证了 \textit{\textbf{GuideFlow}} 的有效性。值得注意的是，在 NavSim 测试硬分割 (Navhard) 中，\textit{\textbf{GuideFlow}} 实现了 SOTA，EPDMS 分数为 43.0。代码将被发布。

Title: Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Authors: Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18734
Pdf URL: https://arxiv.org/pdf/2511.18734
Copy Paste: [[2511.18734]] Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion(https://arxiv.org/abs/2511.18734)
Keywords: generation
Abstract: Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.
摘要：逼真的 3D 城市生成是虚拟现实和数字孪生等广泛应用的基础。然而，大多数现有方法依赖于训练单一扩散模型，这限制了它们生成个性化和无限城市规模场景的能力。在本文中，我们提出了 Yo'City，这是一种新颖的代理框架，通过利用现成的大型模型的推理和组合功能，可以实现用户定制和无限扩展的 3D 城市生成。具体来说，Yo'City 首先通过自上而下的规划策略对城市进行概念化，定义了分层的“城市-地区-网格”结构。全局规划师确定总体布局和潜在的功能区，而局部设计师则通过详细的网格级描述进一步细化每个区域。随后，通过“生成-细化-评估”等距图像合成循环实现网格级 3D 生成，然后生成图像到 3D。为了模拟持续的城市演化，Yo'City 进一步引入了用户交互、关系引导的扩展机制，该机制执行基于场景图的距离和语义感知的布局优化，确保空间上连贯的城市增长。为了全面评估我们的方法，我们构建了一个多样化的基准数据集，并设计了六个多维指标，从语义、几何、纹理和布局的角度评估生成质量。大量实验表明，Yo'City 在所有评估方面始终优于现有最先进的方法。

Title: Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Authors: Zhantao Gong, Liaoyuan Fan, Qing Guo, Xun Xu, Xulei Yang, Shijie Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18735
Pdf URL: https://arxiv.org/pdf/2511.18735
Copy Paste: [[2511.18735]] Thinking Ahead: Foresight Intelligence in MLLMs and World Models(https://arxiv.org/abs/2511.18735)
Keywords: generation
Abstract: In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.
摘要：在这项工作中，我们将预见智能定义为预测和解释未来事件的能力，这种能力对于自动驾驶等应用至关重要，但在很大程度上被现有研究所忽视。为了弥补这一差距，我们引入了 FSU-QA，这是一种新的视觉问答 (VQA) 数据集，专门用于引发和评估预见智能。使用 FSU-QA，我们在面向前瞻的任务下对最先进的视觉语言模型 (VLM) 进行了首次全面研究，揭示了当前的模型仍然难以推理未来的情况。除了作为基准之外，FSU-QA 还可以通过测量生成的预测的语义一致性来评估世界模型，并通过使用此类输出增强 VLM 时的性能增益进行量化。我们的实验进一步证明 FSU-QA 可以有效地增强预见性推理：即使是在 FSU-QA 上微调的小型 VLM 也能大幅超越更大、更先进的模型。总之，这些发现使 FSU-QA 成为开发能够真正预测和理解未来事件的下一代模型的原则基础。

Title: ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion

Authors: Zhenghan Fang, Jian Zheng, Qiaozi Gao, Xiaofeng Gao, Jeremias Sulam
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.18742
Pdf URL: https://arxiv.org/pdf/2511.18742
Copy Paste: [[2511.18742]] ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion(https://arxiv.org/abs/2511.18742)
Keywords: generation, generative
Abstract: Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.
摘要：扩散模型已成为跨广泛领域（包括提示条件生成）生成建模的主导范例。然而，绝大多数采样器依赖于反向扩散过程的前向离散化，并使用从数据中学习的评分函数。这种前向和显式离散化可能很慢且不稳定，需要大量采样步骤才能产生高质量的样本。在这项工作中，我们开发了一种基于后向离散化的文本到图像 (T2I) 扩散模型，称为 ProxT2I，依赖于学习的条件近端算子而不是评分函数。我们进一步利用强化学习和策略优化方面的最新进展来优化我们的采样器以获得特定于任务的奖励。此外，我们还开发了一个新的大规模开源数据集，其中包含 1500 万张带有细粒度字幕的高质量人类图像，称为 LAION-Face-T2I-15M，用于训练和评估。与基于分数的基线相比，我们的方法持续提高了采样效率和人类偏好对齐，并实现了与现有最先进的开源文本到图像模型相当的结果，同时需要更低的计算量和更小的模型大小，为人类文本到图像生成提供了轻量级且高性能的解决方案。

Title: Any4D: Open-Prompt 4D Generation from Natural Language and Images

Authors: Hao Li, Qiao Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18746
Pdf URL: https://arxiv.org/pdf/2511.18746
Copy Paste: [[2511.18746]] Any4D: Open-Prompt 4D Generation from Natural Language and Images(https://arxiv.org/abs/2511.18746)
Keywords: generation, generative
Abstract: While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
摘要：虽然基于视频生成的具体世界模型受到越来越多的关注，但它们对大规模具体交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、采集难度和高维度从根本上限制了语言和动作之间的对齐粒度，并加剧了长视域视频生成的挑战——阻碍生成模型在具身领域实现\textit{“GPT矩”}。有一个天真的观察：\textit{体现数据的多样性远远超过了可能的原始运动的相对较小的空间}。基于这一见解，我们提出 \textbf{原始体现世界模型} (PEWM)，它将视频生成限制在固定的较短范围内，我们的方法 \textit{1) 能够在语言概念和机器人动作的视觉表示之间进行细粒度对齐，\textit{2) 降低}学习复杂性，\textit{3) 提高}体现数据收集中的数据效率，\textit{4) 减少}推理延迟。通过配备模块化视觉语言模型 (VLM) 规划器和启动目标热图指导机制 (SGG)，PEWM 进一步实现灵活的闭环控制，并支持在扩展的复杂任务上对原始级别策略进行组合泛化。我们的框架利用视频模型中的时空视觉先验和 VLM 的语义意识来弥合细粒度物理交互和高级推理之间的差距，为可扩展、可解释和通用的体现智能铺平道路。

Title: NI-Tex: Non-isometric Image-based Garment Texture Generation

Authors: Hui Shan, Ming Li, Haitao Yang, Kai Zheng, Sizhe Zheng, Yanwei Fu, Xiangru Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18765
Pdf URL: https://arxiv.org/pdf/2511.18765
Copy Paste: [[2511.18765]] NI-Tex: Non-isometric Image-based Garment Texture Generation(https://arxiv.org/abs/2511.18765)
Keywords: generation, generative
Abstract: Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility. To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.
摘要：现有的工业 3D 服装网格已经涵盖了大多数现实世界的服装几何形状，但其纹理多样性仍然有限。为了获取更真实的纹理，通常使用生成方法从大量野生图像中提取基于物理的渲染 (PBR) 纹理和材质，并将它们投影回服装网格上。然而，大多数图像条件纹理生成方法需要输入图像和输入3D网格之间严格的拓扑一致性，或者依赖于精确的网格变形来匹配图像姿态，这极大地限制了纹理生成的质量和灵活性。为了解决基于非等距图像的服装纹理生成的挑战性问题，我们构建了 3D 服装视频，这是一个物理模拟的、以服装为中心的数据集，可在不同的变形中提供一致的几何形状和材料监督，从而实现强大的交叉姿势纹理学习。我们进一步采用 Nano Banana 进行高质量的非等距图像编辑，在非等距图像几何对之间实现可靠的跨拓扑纹理生成。最后，我们提出了一种迭代烘焙方法，通过不确定性引导的视图选择和重新加权将多视图预测融合到无缝、可用于生产的 PBR 纹理中。通过广泛的实验，我们证明了我们的前馈双分支架构可生成适合工业级 3D 服装设计的多功能且空间对齐的 PBR 材料。

Title: ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Authors: Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18780
Pdf URL: https://arxiv.org/pdf/2511.18780
Copy Paste: [[2511.18780]] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection(https://arxiv.org/abs/2511.18780)
Keywords: generation, generative
Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.
摘要：视频生成模型的最新进展使得能够根据结合文本和图像的多模式提示创建高质量视频。虽然这些系统提供了增强的可控性，但它们也引入了新的安全风险，因为有害内容可能从个体模式或其交互中出现。现有的安全方法通常是纯文本的，需要预先了解风险类别，或者作为生成后的审计员来操作，努力主动减轻这种组合的、多模式的风险。为了应对这一挑战，我们推出了 ConceptGuard，这是一个统一的防护框架，用于主动检测和减轻多模态视频生成中的不安全语义。 ConceptGuard 分两个阶段运行：首先，对比检测模块通过将融合的图像文本输入投影到结构化概念空间来识别潜在的安全风险；其次，语义抑制机制通过干预提示的多模态调节来引导生成过程远离不安全的概念。为了支持该框架的开发和严格评估，我们引入了两个新颖的基准：ConceptRisk（用于多模式风险训练的大型数据集）和 T2VSafetyBench-TI2V（第一个改编自 T2VSafetyBench 的用于文本和图像到视频 (TI2V) 安全设置的基准）。对这两个基准的综合实验表明，ConceptGuard 始终优于现有基准，在风险检测和安全视频生成方面均取得了最先进的结果。

Title: STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

Authors: Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18786
Pdf URL: https://arxiv.org/pdf/2511.18786
Copy Paste: [[2511.18786]] STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution(https://arxiv.org/abs/2511.18786)
Keywords: super-resolution, generation
Abstract: We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.
摘要：我们提出了 STCDiT，这是一种基于预先训练的视频扩散模型构建的视频超分辨率框架，旨在从降级的输入中恢复结构忠实且时间稳定的视频，即使在复杂的相机运动下也是如此。主要挑战在于在重建过程中保持时间稳定性以及在生成过程中保持结构保真度。为了解决这些挑战，我们首先开发了一种运动感知 VAE 重建方法，该方法执行分段重建，每个片段片段都表现出均匀的运动特征，从而有效地处理具有复杂相机运动的视频。此外，我们观察到每个剪辑中 VAE 编码器提取的第一帧潜在信息（称为锚帧潜在信息）不受时间压缩的影响，并且保留了比后续帧潜在信息更丰富的空间结构信息。我们进一步开发了一种锚帧引导方法，该方法利用锚帧的结构信息来约束生成过程并提高视频特征的结构保真度。这两种设计的结合使得视频扩散模型能够实现高质量的视频超分辨率。大量实验表明，STCDiT 在结构保真度和时间一致性方面优于最先进的方法。

Title: Doubly Wild Refitting: Model-Free Evaluation of High Dimensional Black-Box Predictions under Convex Losses

Authors: Haichen Hu, David Simchi-Levi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.18789
Pdf URL: https://arxiv.org/pdf/2511.18789
Copy Paste: [[2511.18789]] Doubly Wild Refitting: Model-Free Evaluation of High Dimensional Black-Box Predictions under Convex Losses(https://arxiv.org/abs/2511.18789)
Keywords: generative
Abstract: We study the problem of excess risk evaluation for empirical risk minimization (ERM) under general convex loss functions. Our contribution is an efficient refitting procedure that computes the excess risk and provides high-probability upper bounds under the fixed-design setting. Assuming only black-box access to the training algorithm and a single dataset, we begin by generating two sets of artificially modified pseudo-outcomes termed wild response, created by stochastically perturbing the gradient vectors with carefully chosen scaling. Using these two pseudo-labeled datasets, we then refit the black-box procedure twice to obtain two corresponding wild predictors. Finally, leveraging the original predictor, the two wild predictors, and the constructed wild responses, we derive an efficient excess risk upper bound. A key feature of our analysis is that it requires no prior knowledge of the complexity of the underlying function class. As a result, the method is essentially model-free and holds significant promise for theoretically evaluating modern opaque machine learning system--such as deep nerral networks and generative model--where traditional capacity-based learning theory becomes infeasible due to the extreme complexity of the hypothesis class.
摘要：我们研究一般凸损失函数下经验风险最小化（ERM）的超额风险评估问题。我们的贡献是一种有效的改装程序，可以计算超额风险并在固定设计设置下提供高概率上限。假设仅黑盒访问训练算法和单个数据集，我们首先生成两组人工修改的伪结果，称为狂野响应，通过精心选择的缩放比例随机扰动梯度向量来创建。使用这两个伪标记数据集，我们重新拟合黑盒程序两次以获得两个相应的野生预测变量。最后，利用原始预测变量、两个狂野预测变量和构建的狂野响应，我们得出有效的超额风险上限。我们分析的一个关键特征是它不需要事先了解底层函数类的复杂性。因此，该方法本质上是无模型的，并且对于从理论上评估现代不透明机器学习系统（例如深度神经网络和生成模型）具有重大前景，在这些系统中，由于假设类的极端复杂性，传统的基于能力的学习理论变得不可行。

Title: PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

Authors: Yichen Yang, Hong Li, Haodong Zhu, Linin Yang, Guojun Lei, Sheng Xu, Baochang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18801
Pdf URL: https://arxiv.org/pdf/2511.18801
Copy Paste: [[2511.18801]] PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion(https://arxiv.org/abs/2511.18801)
Keywords: generation
Abstract: Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.
摘要：现有的用于生成艺术家设计的网格的自回归 (AR) 方法很难平衡全局结构一致性与高保真局部细节，并且容易受到误差累积的影响。为了解决这个问题，我们提出了 PartDiffuser，一种用于点云到网格生成的新型半自回归扩散框架。该方法首先对网格进行语义分割，然后以“部分方式”进行操作：它利用部分之间的自回归来确保全局拓扑，同时利用每个语义部分内的并行离散扩散过程来精确重建高频几何特征。 PartDiffuser基于DiT架构，引入了part-aware cross-attention机制，使用点云作为分层几何条件来动态控制生成过程，从而有效解耦全局和局部生成任务。实验表明，该方法在生成具有丰富细节的 3D 网格方面显着优于最先进的 (SOTA) 模型，展现出适合实际应用的卓越细节表示。

Title: Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Authors: Siyuan Wei, Chunjie Wang, Xiao Liu, Xiaosheng Yan, Zhishan Zhou, Rui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18817
Pdf URL: https://arxiv.org/pdf/2511.18817
Copy Paste: [[2511.18817]] Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring(https://arxiv.org/abs/2511.18817)
Keywords: generation
Abstract: 3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.
摘要：3D 多模态大型语言模型 (MLLM) 仍然落后于 2D 同行，很大程度上是因为大规模、高质量的 3D 场景对话数据集仍然稀缺。先前的努力依赖于昂贵的人工注释，并留下了两个未解决的关键模糊性：视点模糊性，其中空间语言假定未知的相机姿势，以及对象引用模糊性，其中非排他性描述模糊了目标和干扰物之间的界限。因此，我们提出了一个完全自动化的管道，可以将原始 3D 扫描转换为明确的高质量对话数据，而成本只是以前的一小部分。通过将基于规则的约束与 2D MLLM 和 LLM 相结合，该管道无需人工干预即可实现可控、可扩展的生成。该管道包括四个阶段：（1）元注释收集收集对象、帧和场景级标题，（2）通过关系校正构建场景图以捕获邻近对象关系，（3）生成唯一且紧凑的描述的判别性对象引用，以及（4）合成不同对话的多任务数据生成。我们的管道系统地减轻了源数据集中的固有缺陷，并生成最终的 Disc3D 数据集，在 25K 混合 3D 场景中包含超过 200 万个样本，涵盖场景、视图和对象字幕、视觉基础和五个以对象为中心的 QA 任务。大量实验表明，使用 Disc3D 进行训练可以在公共基准测试和我们的多方面 Disc3D-QA 任务上产生一致、显着的改进。代码、数据和模型将公开。

Title: DiP: Taming Diffusion Models in Pixel Space

Authors: Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18822
Pdf URL: https://arxiv.org/pdf/2511.18822
Copy Paste: [[2511.18822]] DiP: Taming Diffusion Models in Pixel Space(https://arxiv.org/abs/2511.18822)
Keywords: generation
Abstract: Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256$\times$256.
摘要：扩散模型面临着生成质量和计算效率之间的基本权衡。潜在扩散模型 (LDM) 提供了一种有效的解决方案，但存在潜在的信息丢失和非端到端训练的问题。相比之下，现有的像素空间模型绕过了 VAE，但在计算上却无法实现高分辨率合成。为了解决这个困境，我们提出了 DiP，一种高效的像素空间扩散框架。 DiP 将生成解耦为全局和局部阶段：扩散变压器 (DiT) 主干在大型补丁上运行，以实现高效的全局结构构建，而联合训练的轻量级补丁细节头则利用上下文特征来恢复细粒度的局部细节。这种协同设计在不依赖 VAE 的情况下实现了与 LDM 相当的计算效率。 DiP 的推理速度比之前的方法快了 10$\times$，同时参数总数仅增加了 0.3%，并且在 ImageNet 256$\times$256 上获得了 1.90 FID 分数。

Title: Q-Save: Towards Scoring and Attribution for Generated Video Evaluation

Authors: Xiele Wu, Zicheng Zhang, Mingtao Chen, Yixian Liu, Yiming Liu, Shushi Wang, Zhichao Hu, Yuhong Liu, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18825
Pdf URL: https://arxiv.org/pdf/2511.18825
Copy Paste: [[2511.18825]] Q-Save: Towards Scoring and Attribution for Generated Video Evaluation(https://arxiv.org/abs/2511.18825)
Keywords: generation, generative, quality assessment
Abstract: We present Q-Save, a new benchmark dataset and model for holistic and explainable evaluation of AI-generated video (AIGV) quality. The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels along three core dimensions: visual quality, dynamic quality, and text-video alignment. These multi-aspect annotations enable both accurate quality assessment and interpretable reasoning behind the scores. To leverage this data, we propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation. The model adopts the SlowFast framework to distinguish between fast frames and slow frames - slow frames are processed with high resolution while fast frames use low resolution, balancing evaluation accuracy and computational efficiency. For training, we use data formatted in Chain-of-Thought (COT) style and employ a multi-stage strategy: we first conduct Supervised Fine-Tuning (SFT), then further enhance the model with Grouped Relative Policy Optimization (GRPO), and finally perform SFT again to improve model stability. Experimental results demonstrate that our model achieves state-of-the-art performance in video quality prediction while also providing human-aligned, interpretable justifications. Our dataset and model establish a strong foundation for explainable evaluation in generative video research, contributing to the development of multimodal generation and trustworthy AI. Code and dataset will be released upon publication.
摘要：我们推出了 Q-Save，这是一个新的基准数据集和模型，用于对人工智能生成的视频 (AIGV) 质量进行全面且可解释的评估。该数据集包含近 10000 个视频，每个视频都沿三个核心维度标注了标量平均意见得分 (MOS) 和细粒度归因标签：视觉质量、动态质量和文本视频对齐。这些多方面注释可以实现准确的质量评估和分数背后的可解释推理。为了利用这些数据，我们提出了一个统一的评估模型，联合执行质量评分和基于归因的解释。该模型采用SlowFast框架来区分快帧和慢帧——慢帧采用高分辨率处理，快帧采用低分辨率处理，平衡评估精度和计算效率。在训练中，我们使用思想链（COT）风格的数据并采用多阶段策略：首先进行监督微调（SFT），然后使用分组相对策略优化（GRPO）进一步增强模型，最后再次执行SFT以提高模型稳定性。实验结果表明，我们的模型在视频质量预测方面实现了最先进的性能，同时还提供了与人类一致的、可解释的理由。我们的数据集和模型为生成视频研究中的可解释评估奠定了坚实的基础，有助于多模式生成和值得信赖的人工智能的发展。代码和数据集将在发布后发布。

Title: FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories

Authors: Lei Ke, Hubery Yin, Gongye Liu, Zhengyao Lv, Jingcai Guo, Chen Li, Wenhan Luo, Yujiu Yang, Jing Lyu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18834
Pdf URL: https://arxiv.org/pdf/2511.18834
Copy Paste: [[2511.18834]] FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories(https://arxiv.org/abs/2511.18834)
Keywords: generation
Abstract: With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used FlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.
摘要：随着流匹配在视觉生成中的成功，采样效率仍然是其实际应用的关键瓶颈。在流模型的加速方法中，ReFlow尽管与流匹配具有理论上的一致性，但在某种程度上却被忽视了。这主要是由于与一致性蒸馏和分数蒸馏相比，其在实际场景中的性能欠佳。在这项工作中，我们在 ReFlow 框架内研究了这个问题，并提出了 FlowSteer，这是一种通过引导学生沿着老师真实的生成轨迹来释放基于 ReFlow 的蒸馏潜力的方法。我们首先发现分段重流的性能受到训练期间关键分布不匹配的影响，并提出在线轨迹对齐（OTA）来解决它。然后，我们引入直接应用于 ODE 轨迹的对抗性蒸馏目标，提高学生对教师生成轨迹的依从性。此外，我们发现并修复了广泛使用的 FlowMatchEulerDiscreteScheduler 中以前未发现的缺陷，该缺陷很大程度上降低了少步推理质量。我们在 SD3 上的实验结果证明了我们方法的有效性。

Title: FVAR: Visual Autoregressive Modeling via Next Focus Prediction

Authors: Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, Dingkang Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18838
Pdf URL: https://arxiv.org/pdf/2511.18838
Copy Paste: [[2511.18838]] FVAR: Visual Autoregressive Modeling via Next Focus Prediction(https://arxiv.org/abs/2511.18838)
Keywords: generation
Abstract: Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.
摘要：视觉自回归模型通过跨多尺度令牌金字塔的下一尺度预测实现了卓越的生成质量。然而，传统方法使用统一比例下采样来构建这些金字塔，从而导致混叠伪影，从而损害精细细节并引入不需要的锯齿和莫尔图案。为了解决这个问题，我们提出了 \textbf{FVAR}，它将范式从 \emph{下一个尺度预测} 重新构建为 \emph{下一个焦点预测}，模仿相机对焦从模糊到清晰的自然过程。我们的方法引入了三个关键创新： \textbf{1) Next-Focus Prediction Paradigm} 通过逐步减少模糊而不是简单地下采样来转换多尺度自回归； \textbf{2) 渐进式重新聚焦金字塔构造}，使用物理一致的散焦内核来构建干净、无混叠的多尺度表示；和 \textbf{3) 高频残差学习}，它采用专门的残差教师网络在训练期间有效地合并别名信息，同时保持部署简单性。具体来说，我们使用半径递减的散焦点扩散函数 (PSF) 内核构建光学低通视图，从而创建平滑的模糊到清晰的过渡，从源头消除锯齿。为了进一步增强细节生成，我们引入了高频残差教师，它可以从干净的结构和别名残差中学习，并将这些知识提炼到普通的 VAR 部署网络中以实现无缝推理。 ImageNet 上的大量实验表明，FVAR 大大减少了混叠伪影，改善了精细细节保留，并增强了文本可读性，实现了卓越的性能，并与现有 VAR 框架完美兼容。

Title: Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling

Authors: Xiao Cui, Yulei Qin, Xinyue Li, Wengang Zhou, Hongsheng Li, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18858
Pdf URL: https://arxiv.org/pdf/2511.18858
Copy Paste: [[2511.18858]] Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling(https://arxiv.org/abs/2511.18858)
Keywords: generation
Abstract: Dataset distillation creates a small distilled set that enables efficient training by capturing key information from the full dataset. While existing dataset distillation methods perform well on balanced datasets, they struggle under long-tailed distributions, where imbalanced class frequencies induce biased model representations and corrupt statistical estimates such as Batch Normalization (BN) statistics. In this paper, we rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods, and instead adopt the statistical alignment perspective to jointly mitigate model bias and restore fair supervision. To this end, we introduce three dedicated components that enable unbiased recovery of distilled images and soft relabeling: (1) enhancing expert models (an observer model for recovery and a teacher model for relabeling) to enable reliable statistics estimation and soft-label generation; (2) recalibrating BN statistics via a full forward pass with dynamically adjusted momentum to reduce representation skew; (3) initializing synthetic images by incrementally selecting high-confidence and diverse augmentations via a multi-round mechanism that promotes coverage and diversity. Extensive experiments on four long-tailed benchmarks show consistent improvements over state-of-the-art methods across varying degrees of class this http URL, our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT under IPC=10 and IF=10.
摘要：数据集蒸馏创建一个小型蒸馏集，通过从完整数据集中捕获关键信息来实现高效训练。虽然现有的数据集蒸馏方法在平衡数据集上表现良好，但它们在长尾分布下表现不佳，其中不平衡的类频率会导致有偏差的模型表示和损坏的统计估计，例如批量归一化（BN）统计数据。在本文中，我们通过重新审视基于轨迹的方法的局限性来重新思考长尾数据集蒸馏，并采用统计对齐的视角来共同减轻模型偏差并恢复公平监督。为此，我们引入了三个专用组件，可以实现蒸馏图像的无偏恢复和软重新标记：（1）增强专家模型（用于恢复的观察者模型和用于重新标记的教师模型）以实现可靠的统计估计和软标签生成； (2) 通过动态调整动量的完整前向传递重新校准 BN 统计数据，以减少表示偏差；（3）通过促进覆盖率和多样性的多轮机制逐步选择高置信度和多样化的增强来初始化合成图像。对四个长尾基准的广泛实验表明，在不同程度的该 http URL 上，我们的方法比最先进的方法有了一致的改进，在 IPC=10 和 IF=10 下，我们的方法在 CIFAR-100-LT 上将 top-1 准确率提高了 15.6%，在 Tiny-ImageNet-LT 上提高了 11.8%。

Title: KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

Authors: Dezhi Ran, Shuxiao Xie, Mingfang Ji, Ziyue Hua, Mengzhou Wu, Yuan Cao, Yuzhe Guo, Yu Hao, Linyi Li, Yitao Hu, Tao Xie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18868
Pdf URL: https://arxiv.org/pdf/2511.18868
Copy Paste: [[2511.18868]] KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit(https://arxiv.org/abs/2511.18868)
Keywords: generation
Abstract: High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
摘要：高质量内核对于降低大型语言模型 (LLM) 的训练和推理成本至关重要，但传统上它们需要硬件架构和软件优化方面的丰富专业知识。虽然基于 LLM 的代码生成的最新进展显示出复杂优化的前景，但由于硬件领域知识不足，现有方法在巨大的优化空间中苦苦挣扎，无法有效平衡探索和利用。我们提出了 KernelBand，这是一种新颖的框架，它将内核优化表述为分层多臂老虎机问题，使 LLM 代理能够通过将内核选择和优化策略应用视为顺序决策过程来战略性地导航优化空间。我们的方法利用硬件分析信息来识别有前途的优化策略，并采用运行时行为集群来减少跨内核候选者的探索开销。 TritonBench 上的大量实验表明，KernelBand 的性能显着优于最先进的方法，用更少的令牌实现了卓越的性能，同时随着计算资源的增加而表现出持续的改进，而不会出现饱和。

Title: HunyuanVideo 1.5 Technical Report

Authors: Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coopers Li, Gu Gong, Guojian Xiao, Jiahe Tian, Jiaxin Lin, Jie Liu, Jihong Zhang, Jiesong Lian, Kaihang Pan, Lei Wang, Lin Niu, Mingtao Chen, Mingyang Chen, Mingzhe Zheng, Miles Yang, Qiangqiang Hu, Qi Yang, Qiuyong Xiao, Runzhou Wu, Ryan Xu, Rui Yuan, Shanshan Sang, Shisheng Huang, Siruis Gong, Shuo Huang, Weiting Guo, Xiang Yuan, Xiaojia Chen, Xiawei Hu, Wenzhi Sun, Xiele Wu, Xianshun Ren, Xiaoyan Yuan, Xiaoyue Mi, Yepeng Zhang, Yifu Sun, Yiting Lu, Yitong Li, You Huang, Yu Tang, Yixuan Li, Yuhang Deng, Yuan Zhou, Zhichao Hu, Zhiguang Liu, Zhihe Yang, Zilin Yang, Zhenzhi Lu, Zixiang Zhou, Zhao Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18870
Pdf URL: https://arxiv.org/pdf/2511.18870
Copy Paste: [[2511.18870]] HunyuanVideo 1.5 Technical Report(https://arxiv.org/abs/2511.18870)
Keywords: super-resolution, generation
Abstract: We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and this http URL experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at this https URL.
摘要：我们推出了HunyuanVideo 1.5，这是一种轻量级但功能强大的开源视频生成模型，仅用 83 亿个参数即可实现最先进的视觉质量和运动连贯性，从而能够在消费级 GPU 上进行高效推理。这一成就建立在几个关键组成部分的基础上，包括细致的数据管理、具有选择性和滑动平铺注意力 (SSTA) 功能的先进 DiT 架构、通过字形感知文本编码增强双语理解、渐进式预训练和后训练以及高效的视频超分辨率网络。利用这些设计，我们开发了一个统一的框架，能够跨多个持续时间生成高质量的文本到视频和图像到视频，并且这个 http URL 实验表明，这种紧凑而熟练的模型在开源视频生成模型中建立了新的最先进的模型。通过发布代码和模型权重，我们为社区提供了高性能基础，降低了视频创作和研究的障碍，使更广泛的受众可以使用高级视频生成。所有开源资产均可通过此 https URL 公开获取。

Title: MagicWorld: Interactive Geometry-driven Video World Exploration

Authors: Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, Peng-Tao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18886
Pdf URL: https://arxiv.org/pdf/2511.18886
Copy Paste: [[2511.18886]] MagicWorld: Interactive Geometry-driven Video World Exploration(https://arxiv.org/abs/2511.18886)
Keywords: generation
Abstract: Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.
摘要：最近的交互式视频世界模型方法根据用户指令生成场景演化。尽管他们取得了令人印象深刻的成果，但仍然存在两个关键限制。首先，他们未能充分利用指令驱动的场景运动和底层 3D 几何之间的对应关系，从而导致视点变化下的结构不稳定。其次，他们在多步交互过程中很容易忘记历史信息，导致错误累积以及场景语义和结构的渐进漂移。为了解决这些问题，我们提出了 MagicWorld，这是一种集成了 3D 几何先验和历史检索的交互式视频世界模型。 MagicWorld从单个场景图像开始，利用用户动作驱动动态场景演化，并自回归合成连续场景。我们引入了动作引导的 3D 几何模块（AG3D），它从每个交互的第一帧和相应的动作构建点云，为视点转换提供明确的几何约束，从而提高结构一致性。我们进一步提出历史缓存检索（HCR）机制，该机制在生成过程中检索相关的历史帧并将其作为条件信号注入，帮助模型利用过去的场景信息并减轻错误累积。实验结果表明，MagicWorld 在交互迭代中的场景稳定性和连续性方面取得了显着改进。

Title: MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model

Authors: Qian Jiang, Qianqian Wang, Xin Jin, Michal Wozniak, Shaowen Yao, Wei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18888
Pdf URL: https://arxiv.org/pdf/2511.18888
Copy Paste: [[2511.18888]] MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model(https://arxiv.org/abs/2511.18888)
Keywords: restoration, super-resolution
Abstract: Remote sensing images are becoming increasingly widespread in military, earth resource exploration. Because of the limitation of a single sensor, we can obtain high spatial resolution grayscale panchromatic (PAN) images and low spatial resolution color multispectral (MS) images. Therefore, an important issue is to obtain a color image with high spatial resolution when there is only a PAN image at the input. The existing methods improve spatial resolution using super-resolution (SR) technology and spectral recovery using colorization technology. However, the SR technique cannot improve the spectral resolution, and the colorization technique cannot improve the spatial resolution. Moreover, the pansharpening method needs two registered inputs and can not achieve SR. As a result, an integrated approach is expected. To solve the above problems, we designed a novel multi-function model (MFmamba) to realize the tasks of SR, spectral recovery, joint SR and spectral recovery through three different inputs. Firstly, MFmamba utilizes UNet++ as the backbone, and a Mamba Upsample Block (MUB) is combined with UNet++. Secondly, a Dual Pool Attention (DPA) is designed to replace the skip connection in UNet++. Finally, a Multi-scale Hybrid Cross Block (MHCB) is proposed for initial feature extraction. Many experiments show that MFmamba is competitive in evaluation metrics and visual results and performs well in the three tasks when only the input PAN image is used.
摘要：遥感图像在军事、地球资源勘探中变得越来越广泛。由于单个传感器的限制，我们可以获得高空间分辨率的灰度全色（PAN）图像和低空间分辨率的彩色多光谱（MS）图像。因此，当输入只有PAN图像时，如何获得具有高空间分辨率的彩色图像是一个重要的问题。现有方法使用超分辨率（SR）技术提高空间分辨率，并使用彩色化技术提高光谱恢复。然而，SR技术无法提高光谱分辨率，并且彩色化技术无法提高空间分辨率。而且，全色锐化方法需要两个注册输入并且无法实现SR。因此，预计将采用综合方法。为了解决上述问题，我们设计了一种新颖的多功能模型（MFmamba），通过三个不同的输入实现SR、光谱恢复、联合SR和光谱恢复的任务。首先，MFmamba利用UNet++作为骨干，并将Mamba Upsample Block（MUB）与UNet++相结合。其次，设计了双池注意力（DPA）来取代 UNet++ 中的跳跃连接。最后，提出了一种多尺度混合交叉块（MHCB）用于初始特征提取。许多实验表明，当仅使用输入 PAN 图像时，MFmamba 在评估指标和视觉结果方面具有竞争力，并且在三个任务中表现良好。

Title: Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

Authors: Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.18919
Pdf URL: https://arxiv.org/pdf/2511.18919
Copy Paste: [[2511.18919]] Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation(https://arxiv.org/abs/2511.18919)
Keywords: generation, generative
Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
摘要：组相对策略优化（GRPO）已成为训练后视觉生成模型的有效且轻量级的框架。然而，它的性能从根本上受到文本视觉对应的模糊性的限制：单个提示可以有效地描述不同的视觉输出，并且单个图像或视频可以支持多个同样正确的解释。这种多对多的关系导致奖励模型产生不确定且弱区分的信号，导致 GRPO 未充分利用可靠的反馈并过度拟合噪声反馈。我们引入贝叶斯先验引导优化（BPGO），这是 GRPO 的一种新颖扩展，它通过语义先验锚明确地模拟奖励不确定性。 BPGO在两个层面上自适应地调节优化信任：组间贝叶斯信任分配强调与先前一致的组的更新，同时降低模糊性的权重，组内先验锚定重正化通过扩大置信偏差和压缩不确定分数来锐化样本区别。在图像和视频生成任务中，BPGO 始终提供比标准 GRPO 和最新变体更强的语义对齐、增强的感知保真度和更快的收敛速度。

Title: One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

Authors: Zhenxing Mi, Yuxin Wang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18922
Pdf URL: https://arxiv.org/pdf/2511.18922
Copy Paste: [[2511.18922]] One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control(https://arxiv.org/abs/2511.18922)
Keywords: generation
Abstract: We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: this https URL
摘要：我们推出 One4D，这是一个用于 4D 生成和重建的统一框架，可将动态 4D 内容生成为同步 RGB 帧和点图。通过统一蒙版调节 (UMC) 机制一致地处理调节帧的不同稀疏度，One4D 可以在单个图像的 4D 生成、完整视频的 4D 重建以及稀疏帧的混合生成和重建之间无缝过渡。我们的框架采用强大的视频生成模型来联合 RGB 和点图生成，并具有精心设计的网络架构。用于深度图或点图重建的常用扩散微调策略通常在联合 RGB 和点图生成上失败，从而快速降低基础视频模型的性能。为了应对这一挑战，我们引入了解耦 LoRA 控制 (DLC)，它采用两个特定于模态的 LoRA 适配器来形成 RGB 帧和点图的解耦计算分支，通过轻量级、零初始化的控制链接连接，逐渐学习相互像素级一致性。 One4D 在适度的计算预算下对合成和真实 4D 数据集的混合进行训练，在生成和重建任务中生成高质量的 RGB 帧和准确的点图。这项工作代表了使用视频扩散模型向通用、高质量的基于几何的 4D 世界建模迈出了一步。项目页面：此 https URL

Title: FineXtrol: Controllable Motion Generation via Fine-Grained Text

Authors: Keming Shen, Bizhu Wu, Junliang Chen, Xiaoqin Wang, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18927
Pdf URL: https://arxiv.org/pdf/2511.18927
Copy Paste: [[2511.18927]] FineXtrol: Controllable Motion Generation via Fine-Grained Text(https://arxiv.org/abs/2511.18927)
Keywords: generation
Abstract: Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
摘要：最近的工作试图增强文本驱动运动生成的可控性和精度。一些方法利用大型语言模型 (LLM) 来生成更详细的文本，而另一些方法则将全局 3D 坐标序列作为附加控制信号。然而，前者经常引入未对齐的细节并且缺乏明确的时间线索，而后者在将坐标转换为标准运动表示时会产生大量的计算成本。为了解决这些问题，我们提出了 FineXtrol，这是一种新颖的控制框架，用于高效运动生成，由时间感知、精确、用户友好和细粒度的文本控制信号引导，这些文本控制信号描述特定身体部位随时间的运动。为了支持这个框架，我们设计了一个分层对比学习模块，鼓励文本编码器为我们的新颖控制信号产生更具辨别力的嵌入，从而提高运动可控性。定量结果表明，FineXtrol 在可控运动生成方面取得了强大的性能，而定性分析则证明了其在指导特定身体部位运动方面的灵活性。

Title: VeCoR - Velocity Contrastive Regularization for Flow Matching

Authors: Zong-Wei Hong, Jing-lun Li, Lin-Ze Li, Shen Zhang, Yao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18942
Pdf URL: https://arxiv.org/pdf/2511.18942
Copy Paste: [[2511.18942]] VeCoR - Velocity Contrastive Regularization for Flow Matching(https://arxiv.org/abs/2511.18942)
Keywords: generation, generative
Abstract: Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: this https URL
摘要：流量匹配 (FM) 最近已成为扩散模型的一种原则性且有效的替代方案。标准 FM 鼓励学习的速度场遵循目标方向；然而，它可能会沿着轨迹累积误差，并将样本驱离数据流形，导致感知退化，特别是在轻量级或低步配置中。为了增强稳定性和泛化性，我们将 FM 扩展为平衡的吸引-排斥方案，为“去哪里”和“不去哪里”提供明确的指导。为了正式起见，我们提出了 \textbf{速度对比正则化（VeCoR）}，这是一种基于流的生成模型的补充训练方案，通过对比、双边监督增强了标准 FM 目标。 VeCoR 不仅将预测速度与稳定的参考方向对齐（正监督），而且将其推离不一致的、偏离流形的方向（负监督）。这种对比公式将 FM 从纯粹有吸引力的片面目标转变为双面训练信号，规范轨迹演化并提高跨数据集和骨干网的感知保真度。在 ImageNet-1K 256$\times$256 上，VeCoR 在 SiT-XL/2 和 REPA-SiT-XL/2 主干上分别产生 22\% 和 35\% 的相对 FID 减少，并在 MS-COCO 文本到图像生成上实现进一步的 FID 增益（相对 32\%），证明了稳定性、收敛性和图像质量的持续改进，特别是在低步和轻量级设置中。项目页面：此 https URL

Title: Leveraging Adversarial Learning for Pathological Fidelity in Virtual Staining

Authors: José Teixeira, Pascal Klöckner, Diana Montezuma, Melis Erdal Cesur, João Fraga, Hugo M. Horlings, Jaime S. Cardoso, Sara P. Oliveira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18946
Pdf URL: https://arxiv.org/pdf/2511.18946
Copy Paste: [[2511.18946]] Leveraging Adversarial Learning for Pathological Fidelity in Virtual Staining(https://arxiv.org/abs/2511.18946)
Keywords: generative
Abstract: In addition to evaluating tumor morphology using H&E staining, immunohistochemistry is used to assess the presence of specific proteins within the tissue. However, this is a costly and labor-intensive technique, for which virtual staining, as an image-to-image translation task, offers a promising alternative. Although recent, this is an emerging field of research with 64% of published studies just in 2024. Most studies use publicly available datasets of H&E-IHC pairs from consecutive tissue sections. Recognizing the training challenges, many authors develop complex virtual staining models based on conditional Generative Adversarial Networks, but ignore the impact of adversarial loss on the quality of virtual staining. Furthermore, overlooking the issues of model evaluation, they claim improved performance based on metrics such as SSIM and PSNR, which are not sufficiently robust to evaluate the quality of virtually stained images. In this paper, we developed CSSP2P GAN, which we demonstrate to achieve heightened pathological fidelity through a blind pathological expert evaluation. Furthermore, while iteratively developing our model, we study the impact of the adversarial loss and demonstrate its crucial role in the quality of virtually stained images. Finally, while comparing our model with reference works in the field, we underscore the limitations of the currently used evaluation metrics and demonstrate the superior performance of CSSP2P GAN.
摘要：除了使用 H&E 染色评估肿瘤形态外，免疫组织化学还用于评估组织内特定蛋白质的存在。然而，这是一种成本高昂且劳动密集型的技术，虚拟染色作为图像到图像的转换任务，提供了一种有前途的替代方案。虽然这是一个新兴的研究领域，但到 2024 年已发表的研究中有 64% 是最近才出现的。大多数研究使用来自连续组织切片的 H&E-IHC 对的公开数据集。认识到训练的挑战，许多作者基于条件生成对抗网络开发了复杂的虚拟染色模型，但忽略了对抗性损失对虚拟染色质量的影响。此外，他们忽略了模型评估的问题，声称基于 SSIM 和 PSNR 等指标提高了性能，但这些指标不足以评估虚拟染色图像的质量。在本文中，我们开发了 CSSP2P GAN，并证明它可以通过盲目的病理专家评估来实现更高的病理保真度。此外，在迭代开发我们的模型时，我们研究了对抗性损失的影响，并证明了它对虚拟染色图像质量的关键作用。最后，在将我们的模型与该领域的参考作品进行比较时，我们强调了当前使用的评估指标的局限性，并展示了 CSSP2P GAN 的优越性能。

Title: Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

Authors: Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun, Dongyang Jin, Ryan Xu, Nannan Zhang, Dan Song, Xiangxiang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18957
Pdf URL: https://arxiv.org/pdf/2511.18957
Copy Paste: [[2511.18957]] Eevee: Towards Close-up High-resolution Video-based Virtual Try-on(https://arxiv.org/abs/2511.18957)
Keywords: generation
Abstract: Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
摘要：视频虚拟试穿技术为时尚电商营销视频制作提供了一种经济高效的解决方案。然而，它的实际采用受到两个关键限制的阻碍。首先，当前虚拟试穿数据集中对单个服装图像作为输入的依赖限制了对真实纹理细节的准确捕捉。其次，大多数现有方法仅专注于生成全景虚拟试穿视频，而忽略了企业对还提供详细特写镜头的视频的需求。为了解决这些挑战，我们引入了用于基于视频的虚拟试穿的高分辨率数据集。该数据集提供了两个关键特征。首先，它提供了有关服装的更多详细信息，其中包括带有详细特写镜头和文字描述的高保真图像；其次，它独特地包括真人模特的全景和特写试穿视频。此外，准确评估一致性对于特写视频来说变得更加重要，因为特写视频需要高保真地保留服装细节。为了促进这种细粒度的评估，我们提出了一种新的服装一致性指标 VGID（视频服装起始距离），它可以量化纹理和结构的保留。我们的实验验证了这些贡献。我们证明，通过利用数据集中的详细图像，现有的视频生成模型可以提取并合并纹理特征，从而显着增强虚拟试穿结果的真实性和细节保真度。此外，我们对最新模型进行了全面的基准测试。该基准有效地识别了当前方法中的纹理和结构保存问题。

Title: AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Authors: Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu
Subjects: cs.LG, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.18960
Pdf URL: https://arxiv.org/pdf/2511.18960
Copy Paste: [[2511.18960]] AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention(https://arxiv.org/abs/2511.18960)
Keywords: generation
Abstract: Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
摘要：视觉-语言-动作（VLA）模型在具体的人工智能任务中表现出了卓越的能力。然而，现有的 VLA 模型通常基于视觉语言模型 (VLM) 构建，通常在每个时间步独立处理密集的视觉输入。这种方法将任务隐式建模为马尔可夫决策过程 (MDP)。然而，这种与历史无关的设计对于动态顺序决策中的有效视觉标记处理而言并不是最佳的，因为它无法利用历史背景。为了解决这个限制，我们从部分可观察马尔可夫决策过程（POMDP）的角度重新表述了这个问题，并提出了一个名为 AVA-VLA 的新框架。受到 POMDP 的启发，行动的生成应该以信念状态为条件。 AVA-VLA 引入主动视觉注意（AVA）来动态调节视觉处理。它通过利用循环状态来实现这一点，循环状态是从先前决策步骤得出的代理信念状态的神经近似。具体来说，AVA 模块使用循环状态来计算软权重，以根据其历史上下文主动处理与任务相关的视觉标记。综合评估表明，AVA-VLA 在流行的机器人基准测试中实现了最先进的性能，包括 LIBERO 和 CALVIN。此外，双臂机器人平台上的实际部署验证了该框架的实际适用性和强大的模拟到真实的可迁移性。

Title: View-Consistent Diffusion Representations for 3D-Consistent Video Generation

Authors: Duolikun Danier, Ge Gao, Steven McDonagh, Changjian Li, Hakan Bilen, Oisin Mac Aodha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.18991
Pdf URL: https://arxiv.org/pdf/2511.18991
Copy Paste: [[2511.18991]] View-Consistent Diffusion Representations for 3D-Consistent Video Generation(https://arxiv.org/abs/2511.18991)
Keywords: generation
Abstract: Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: this https URL.
摘要：视频生成模型在生成真实内容方面取得了重大进展，从而实现了模拟、游戏和电影制作中的应用。然而，当前生成的视频仍然包含因 3D 不一致而产生的视觉伪影，例如，物体和结构在相机姿势变化下变形，这可能会损害用户体验和模拟保真度。受最近关于扩散模型表示对齐的研究结果的启发，我们假设提高视频扩散表示的多视图一致性将产生更加一致的 3D 视频生成。通过对多个最新摄像机控制的视频扩散模型的详细分析，我们揭示了 3D 一致表示和视频之间的强相关性。我们还提出了 ViCoDR，这是一种通过学习多视图一致扩散表示来提高视频模型 3D 一致性的新方法。我们在相机控制的图像到视频、文本到视频和多视图生成模型上评估 ViCoDR，证明生成视频的 3D 一致性有显着改进。项目页面：此 https URL。

Title: A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

Authors: Wentao Qu, Guofeng Mei, Yang Wu, Yongshun Gong, Xiaoshui Huang, Liang Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19004
Pdf URL: https://arxiv.org/pdf/2511.19004
Copy Paste: [[2511.19004]] A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation(https://arxiv.org/abs/2511.19004)
Keywords: generation
Abstract: Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
摘要：文本到激光雷达的生成可以为下游任务定制具有丰富结构和多样化场景的3D数据。然而，文本-LiDAR 对的稀缺通常会导致训练先验不足，从而生成过于平滑的 3D 场景。此外，低质量的文本描述可能会降低生成质量和可控性。在本文中，我们提出了一种用于场景生成的文本到激光雷达扩散模型，称为 T2LDM，具有自条件表示指导（SCRG）。具体来说，SCRG 通过与真实表示对齐，为训练中的去噪网络 (DN) 提供具有重建细节的软监督，同时在推理中解耦。这样，T2LDM就可以从数据分布中感知丰富的几何结构，生成场景中的详细物体。同时，我们构建了一个内容可组合的文本激光雷达基准 T2nuScenes 以及一个可控性指标。基于此，我们分析了不同文本提示对激光雷达生成质量和可控性的影响，提供实用的提示范式和见解。此外，方向位置先验旨在减轻街道失真，进一步提高场景保真度。此外，通过冻结 DN 学习条件编码器，T2LDM 可以支持多种条件任务，包括稀疏到密集、密集到稀疏和语义到激光雷达生成。无条件和条件生成方面的大量实验表明，T2LDM 优于现有方法，实现了最先进的场景生成。

Title: Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

Authors: Long Tang, Guoquan Zhen, Jie Hao, Jianbo Zhang, Huiyu Duan, Liang Yuan, Guangtao Zhai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19024
Pdf URL: https://arxiv.org/pdf/2511.19024
Copy Paste: [[2511.19024]] Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling(https://arxiv.org/abs/2511.19024)
Keywords: quality assessment
Abstract: Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA this http URL code is available at: \href{this https URL}{\texttt{Life-IQA}}.
摘要：盲图像质量评估（BIQA）在评估和优化视觉体验方面发挥着至关重要的作用。大多数现有的 BIQA 方法融合了从骨干网络提取的浅层和深层特征，同时忽视了对质量预测的不平等贡献。此外，虽然 BIQA 中广泛采用了各种视觉编码器主干，但有效的质量解码架构仍未得到充分探索。为了解决这些限制，本文研究了浅层和深层特征对 BIQA 的贡献，并通过 GCN 增强的 \underline{l}ayer\underline{i} 交互和基于 MoE 的 \underline{f}feature d\underline{e} 耦合提出了一种有效的质量特征解码框架，称为 \textbf{(Life-IQA)}。具体来说，GCN增强层交互模块利用GCN增强的最深层特征作为查询，倒数第二层特征作为key、value，然后进行交叉注意力来实现特征交互。此外，提出了一种基于 MoE 的特征解耦模块，通过专门针对特定失真类型或质量维度的不同专家来解耦融合表示。大量实验表明，Life-IQA 在准确性和成本之间显示出比普通 Transformer 解码器更有利的平衡，并且在多个 BIQA 上实现了最先进的性能，此 http URL 代码可在以下位置获取：\href{this https URL}{\texttt{Life-IQA}}。

Title: Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

Authors: Ruojun Xu, Yu Kai, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Tianxiang Zheng, Qinhlin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19049
Pdf URL: https://arxiv.org/pdf/2511.19049
Copy Paste: [[2511.19049]] Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation(https://arxiv.org/abs/2511.19049)
Keywords: generation, generative
Abstract: Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.
摘要：直接偏好优化（DPO）通过区分选择和拒绝的样本，在将生成输出与人类偏好保持一致方面显示出了有希望的结果。然而，DPO 的一个关键限制是似然位移，即所选样本的概率在训练过程中矛盾地下降，从而损害了生成的质量。尽管这个问题已经在自回归模型中得到了研究，但它在基于扩散的模型中的影响仍然很大程度上未被探索。这种差距导致涉及视频生成的任务性能不佳。为了解决这个问题，我们通过更新扩散框架内的策略对 DPO 损失进行了正式分析，该分析描述了特定训练样本的更新如何影响模型对其他样本的预测。使用此工具，我们确定了两种主要的失败模式：（1）优化冲突，由所选样本和拒绝样本之间的较小奖励裕度引起，以及（2）次优最大化，由较大奖励裕度引起。根据这些见解，我们引入了一种名为策略引导 DPO (PG-DPO) 的新颖解决方案，它将自适应拒绝缩放 (ARS) 和隐式偏好正则化 (IPR) 相结合，以有效减轻似然位移。实验表明，PG-DPO 在定量指标和定性评估方面均优于现有方法，为改善视频生成任务中的偏好对齐提供了强大的解决方案。

Title: Understanding, Accelerating, and Improving MeanFlow Training

Authors: Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19065
Pdf URL: https://arxiv.org/pdf/2511.19065
Copy Paste: [[2511.19065]] Understanding, Accelerating, and Improving MeanFlow Training(https://arxiv.org/abs/2511.19065)
Keywords: generation, generative
Abstract: MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
摘要：MeanFlow 通过共同学习瞬时速度场和平均速度场，承诺只需几个步骤即可实现高质量的生成建模。然而，潜在的训练动态仍不清楚。我们分析了两个速度之间的相互作用，发现：（i）确定的瞬时速度是学习平均速度的先决条件； (ii) 当时间间隙很小时，瞬时速度的学习受益于平均速度，但随着间隙的增加而降低； (iii)任务亲和力分析表明，大间隙平均速度的顺利学习对于一步生成至关重要，取决于事先形成准确的瞬时和小间隙平均速度。在这些观察的指导下，我们设计了一种有效的训练方案，加速瞬时速度的形成，然后将重点从短间隔平均速度转移到长间隔平均速度。我们增强的 MeanFlow 训练可实现更快的收敛和明显更好的几步生成：使用相同的 DiT-XL 主干，我们的方法在 1-NFE ImageNet 256x256 上达到令人印象深刻的 2.87 FID，而传统 MeanFlow 基线的 FID 为 3.43。或者，我们的方法将 MeanFlow 基线的性能与缩短 2.5 倍的训练时间或较小的 DiT-L 主干网络相匹配。

Title: EnfoPath: Energy-Informed Analysis of Generative Trajectories in Flow Matching

Authors: Ziyun Li, Ben Dai, Huancheng Hu, Henrik Boström, Soon Hoe Lim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19087
Pdf URL: https://arxiv.org/pdf/2511.19087
Copy Paste: [[2511.19087]] EnfoPath: Energy-Informed Analysis of Generative Trajectories in Flow Matching(https://arxiv.org/abs/2511.19087)
Keywords: generation, generative
Abstract: Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.
摘要：基于流的生成模型通过将学习到的速度场从参考分布集成到目标数据分布来合成数据。之前的工作重点关注端点指标（例如保真度、可能性、感知质量），而忽略了更深层次的问题：采样轨迹揭示了什么？受经典力学的启发，我们引入了动路径能 (KPE)，这是一种简单而强大的诊断方法，可量化基于 ODE 的采样器的每个生成路径上的总动能。通过对 CIFAR-10 和 ImageNet-256 的综合实验，我们发现了两个关键现象：({i}) 较高的 KPE 预测更强的语义质量，表明语义更丰富的样本需要更大的动力努力；({ii}) 较高的 KPE 与数据密度成反比，信息丰富的样本位于稀疏、低密度的区域。总之，这些发现表明，语义信息样本自然位于数据分布的稀疏前沿，需要更大的生成工作。我们的结果表明，轨迹级分析为理解生成难度和样本特征提供了一个受物理启发且可解释的框架。

Title: HABIT: Human Action Benchmark for Interactive Traffic in CARLA

Authors: Mohan Ramesh, Mark Azer, Fabian B. Flohr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19109
Pdf URL: https://arxiv.org/pdf/2511.19109
Copy Paste: [[2511.19109]] HABIT: Human Action Benchmark for Interactive Traffic in CARLA(https://arxiv.org/abs/2511.19109)
Keywords: generation
Abstract: Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.
摘要：当前的自动驾驶（AD）模拟由于无法充分体现真实且多样化的人类行为而受到严重限制，而这对于确保安全性和可靠性至关重要。现有的基准通常简化行人交互，无法捕获对于稳健系统部署至关重要的复杂、动态意图和变化的响应。为了克服这个问题，我们引入了 HABIT（交互式流量的人类行为基准），这是一种高保真模拟基准。 HABIT 通过模块化、可扩展且物理一致的运动重定向管道，将来自动作捕捉和视频的现实世界人体运动集成到 CARLA（汽车学习行动，完全自动驾驶模拟器）中。从大约 30,000 个重定向运动的初始池中，我们策划了 4,730 个与交通兼容的行人运动，并以 SMPL 格式进行标准化，以实现物理上一致的轨迹。 HABIT 与 CARLA 排行榜无缝集成，实现自动场景生成和严格的代理评估。我们的安全指标，包括简明伤害量表 (AIS) 和误报制动率 (FPBR)，揭示了先前评估中遗漏的最先进 AD 代理的关键故障模式。通过评估三种最先进的自动驾驶代理 InterFuser、TransFuser 和 BEVDriver，展示了 HABIT 如何揭露隐藏在脚本模拟中的规划器弱点。尽管在 CARLA 排行榜上实现了接近或等于零碰撞，但自主代理在 HABIT 上的表现明显较差，碰撞次数高达 7.43 次/公里，AIS 3+ 受伤风险为 12.94%，并且在高达 33% 的情况下会不必要地制动。所有组件均公开发布，以支持可重复的、行人感知的人工智能研究。

Title: 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Authors: Minchong Chen, Xiaoyun Yuan, Junzhe Wan, Jianing Zhang, Jun Zhang
Subjects: cs.CV, physics.optics
Abstract URL: https://arxiv.org/abs/2511.19117
Pdf URL: https://arxiv.org/pdf/2511.19117
Copy Paste: [[2511.19117]] 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion(https://arxiv.org/abs/2511.19117)
Keywords: super-resolution, generative
Abstract: The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: this https URL.
摘要：移动平台热传感器的小型化本质上限制了其空间分辨率和纹理保真度，导致图像模糊且信息量较少。现有的热超分辨率（SR）方法可以分为单图像和RGB引导方法：前者难以从有限的信息中恢复精细结构，而后者依赖于准确且费力的跨相机校准，这阻碍了实际部署和鲁棒性。在这里，我们提出了 3M-TI，一种用于移动热成像的免校准多相机跨模态扩散框架。 3M-TI 的核心是将跨模态自注意力模块 (CSM) 集成到扩散 UNet 中，取代原来的自注意力层，在整个去噪过程中自适应地对齐热和 RGB 特征，而不需要显式的相机校准。这种设计使扩散网络能够利用其生成先验来增强超分辨率热图像中的空间分辨率、结构保真度和纹理细节。对现实世界移动热像仪和公共基准的广泛评估验证了我们的卓越性能，在视觉质量和定量指标方面均取得了最先进的结果。更重要的是，3M-TI 增强的热图像在目标检测和分割等关键下游任务中带来了巨大的收益，凸显了其对于强大的移动热感知系统的实用价值。更多材料：此https URL。

Title: FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation

Authors: Zhifeng Xie, Keyi Zhang, Yiye Yan, Yuling Guo, Fan Yang, Jiting Zhou, Mengtian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19137
Pdf URL: https://arxiv.org/pdf/2511.19137
Copy Paste: [[2511.19137]] FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation(https://arxiv.org/abs/2511.19137)
Keywords: generation
Abstract: Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
摘要：电影布景设计对于电影叙事和塑造视觉氛围起着至关重要的作用。然而，传统过程依赖于专家驱动的手动建模，既费力又耗时。为了解决这个问题，我们引入了 FilmSceneDesigner，这是一个模拟专业电影布景设计工作流程的自动场景生成系统。给定自然语言描述，包括场景类型、历史时期和风格，我们设计了一个基于代理的链接框架，以生成与电影布景设计工作流程一致的结构化参数，并以确保参数准确性和连贯性的提示策略为指导。另一方面，我们提出了一个程序生成管道，它使用结构化参数执行一系列专用函数，用于平面图和结构生成、材质分配、门窗放置以及对象检索和布局，最终从头开始构建完整的电影场景。此外，为了增强电影真实感和资产多样性，我们构建了 SetDepot-Pro，这是一个包含 6,862 个电影特定 3D 资产和 733 种材质的精选数据集。实验结果和人工评估表明，我们的系统可以生成结构合理的场景，具有很强的电影保真度，支持虚拟预览、施工图和情绪板创建等下游任务。

Title: ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation

Authors: Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19145
Pdf URL: https://arxiv.org/pdf/2511.19145
Copy Paste: [[2511.19145]] ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation(https://arxiv.org/abs/2511.19145)
Keywords: generation
Abstract: We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
摘要：我们提出了低秩适应的激活边界匹配（ABM-LoRA），这是一种有原则的初始化策略，可以大大加速低秩适配器的收敛。虽然 LoRA 提供高参数效率，但其随机初始化将梯度更新限制在不匹配的切线空间，导致严重的信息丢失并阻碍早期收敛。我们的 ABM-LoRA 通过在下游训练之前将适配器的激活边界与预训练模型的激活边界对齐来解决这个问题，从而最大化全参数梯度到适配器子空间的投影。这种对齐方式大大减少了初始化时的信息损失，产生较低的起始损失，并加速收敛。我们展示了 ABM-LoRA 在不同架构和任务中的有效性：语言理解（基于 GLUE 的 T5-Base）、对话生成（基于 WizardLM 的 LLaMA2-7B）和视觉识别（基于 VTAB-1K 的 ViT-B/16）。在 VTAB-1K 上，它实现了所有方法中最高的准确度，在需要几何理解的结构化推理任务上取得了巨大的进步。

Title: From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Authors: Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.19149
Pdf URL: https://arxiv.org/pdf/2511.19149
Copy Paste: [[2511.19149]] From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation(https://arxiv.org/abs/2511.19149)
Keywords: generation
Abstract: This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
摘要：本文介绍了自动时尚标题和主题标签生成的检索增强框架，结合了多服装检测、属性推理和大型语言模型（LLM）提示。该系统旨在为时尚图像生成视觉基础、描述性且风格有趣的文本，克服端到端字幕人员在属性保真度和领域泛化方面存在问题的局限性。该管道结合了用于多服装定位的基于 YOLO 的检测器、用于主色提取的 k 均值聚类，以及用于基于结构化产品索引进行织物和性别属性推断的 CLIP-FAISS 检索模块。这些属性与检索到的风格示例一起创建了一个事实证据包，用于指导法学硕士生成类似人类的标题和上下文丰富的主题标签。使用微调的 BLIP 模型作为监督基线模型进行比较。实验结果表明，YOLO检测器能够获得九类服装的平均精度（mAP@0.5）为0.71。 RAG-LLM 管道生成富有表现力的属性对齐标题，并在主题标签生成中实现 0.80 的平均属性覆盖率，在 50% 阈值下实现完全覆盖，而 BLIP 提供更高的词汇重叠和更低的泛化。检索增强方法表现出更好的事实基础、更少的幻觉以及在各种服装领域可扩展部署的巨大潜力。这些结果证明了使用检索增强生成作为自动化和视觉基础的时尚内容生成的有效且可解释的范例。

Title: Masked Diffusion Models are Secretly Learned-Order Autoregressive Models

Authors: Prateek Garg, Bhavya Kohli, Sunita Sarawagi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19152
Pdf URL: https://arxiv.org/pdf/2511.19152
Copy Paste: [[2511.19152]] Masked Diffusion Models are Secretly Learned-Order Autoregressive Models(https://arxiv.org/abs/2511.19152)
Keywords: generative
Abstract: Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.
摘要：掩蔽扩散模型 (MDM) 已成为离散域生成建模最有前途的范例之一。众所周知，MDM 可以有效地训练以随机顺序解码令牌，并且这种顺序在实践中具有显着的性能影响。这一观察提出了一个基本问题：我们能否设计一个针对有利解码顺序进行优化的训练框架？我们对此的回答是肯定的，表明 MDM 的连续时间变分目标在配备多元噪声计划时可以在训练期间识别和优化解码顺序。我们在解码顺序和多元噪声表之间建立了直接对应关系，并表明这种设置打破了 MDM 目标对噪声表的不变性。此外，我们证明 MDM 目标精确地分解为这些阶数的加权自回归损失，这将它们建立为具有可学习阶数的自回归模型。

Title: Test-Time Preference Optimization for Image Restoration

Authors: Bingchen Li, Xin Li, Jiaqi Xu, Jiaming Guo, Wenbo Li, Renjing Pei, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19169
Pdf URL: https://arxiv.org/pdf/2511.19169
Copy Paste: [[2511.19169]] Test-Time Preference Optimization for Image Restoration(https://arxiv.org/abs/2511.19169)
Keywords: restoration
Abstract: Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.
摘要：图像恢复 (IR) 模型通常经过训练，可以使用 L1 或 LPIPS 损失来恢复高质量图像。为了处理各种未知的退化，还引入了零样本红外方法。然而，现有的预训练和零样本红外方法通常无法符合人类的偏好，导致恢复的图像可能不受欢迎。这凸显了提高恢复质量并灵活适应各种图像恢复任务或主干的迫切需要，而无需模型重新训练，并且理想情况下无需劳动密集型偏好数据收集。在本文中，我们提出了第一个用于图像恢复的测试时偏好优化（TTPO）范例，它增强了感知质量，动态生成偏好数据，并且与任何 IR 模型主干兼容。具体来说，我们设计了一个免训练的三阶段流程：（i）基于初始恢复的图像，使用扩散反演和去噪在线生成候选偏好图像； (ii) 使用自动偏好对齐指标或人工反馈来选择首选和不首选图像；（iii）使用所选的偏好图像作为奖励信号来指导扩散去噪过程，优化恢复的图像以更好地符合人类偏好。跨各种图像恢复任务和模型的广泛实验证明了所提出的流程的有效性和灵活性。

Title: Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks

Authors: Ann-Sophia Müller, Moonkwang Jeong, Meng Zhang, Jiyuan Tian, Arkadiusz Miernik, Stefanie Speidel, Tian Qiu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19198
Pdf URL: https://arxiv.org/pdf/2511.19198
Copy Paste: [[2511.19198]] Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks(https://arxiv.org/abs/2511.19198)
Keywords: generation, generative
Abstract: Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
摘要：基于机器学习的手术规划和训练需要大量根据医学影像重建的3D解剖模型，这是目前的主要瓶颈之一。由于法律、道德和技术方面的挑战，从真实患者和手术过程中获取这些数据即使可能，也是非常困难的。对于成像对比度较差的软组织器官（例如前列腺）来说尤其困难。为了克服这些挑战，我们提出了一种新颖的工作流程，使用从物理器官模型获得的数据自动生成 3D 解剖数据。我们还使用 3D 生成对抗网络 (GAN) 来获取多种 3D 模型，这些模型可用于依赖 3D 数据的其他下游机器学习任务。我们使用仿生水凝胶制成的人工前列腺模型演示了我们的工作流程，在多个区域具有成像对比度。这用于物理模拟内窥镜手术。为了进行评估和生成 3D 数据，我们将其放入定制的超声波扫描仪中，记录手术前后的前列腺情况。训练神经网络对记录的超声图像进行分割，该网络在交并集 (IoU) 方面优于传统的、非基于学习的计算机视觉技术。基于分割，重建 3D 网格模型，并提供性能反馈。

Title: ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Authors: Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou, Hongsong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19217
Pdf URL: https://arxiv.org/pdf/2511.19217
Copy Paste: [[2511.19217]] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment(https://arxiv.org/abs/2511.19217)
Keywords: generation
Abstract: Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
摘要：文本到动作生成可根据文本输入合成 3D 人体动作，在游戏、电影和机器人领域具有巨大的应用潜力。最近，基于扩散的方法已被证明可以生成更多多样性和真实的运动。然而，扩散模型中的文本和运动分布之间存在不对齐，这导致语义不一致或低质量的运动。为了解决这个限制，我们提出了奖励引导采样对齐（ReAlign），包括用于评估去噪采样期间对齐质量的逐步感知奖励模型和将扩散过程引导至最佳对齐分布的奖励引导策略。该奖励模型集成了步骤感知令牌，并结合了用于语义一致性的文本对齐模块和用于真实感的运动对齐模块，在每个时间步细化噪声运动以平衡概率密度和对齐。运动生成和检索任务的大量实验表明，与现有的最先进方法相比，我们的方法显着提高了文本运动对齐和运动质量。

Title: Learning Plug-and-play Memory for Guiding Video Diffusion Models

Authors: Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, Biwei Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19229
Pdf URL: https://arxiv.org/pdf/2511.19229
Copy Paste: [[2511.19229]] Learning Plug-and-play Memory for Guiding Video Diffusion Models(https://arxiv.org/abs/2511.19229)
Keywords: generation
Abstract: Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: this https URL.
摘要：基于扩散变压器（DiT）的视频生成模型最近取得了令人印象深刻的视觉质量和时间连贯性，但它们仍然经常违反基本物理定律和常识动态，揭示了明确的世界知识的缺乏。在这项工作中，我们探索如何为它们配备即插即用的存储器，以注入有用的世界知识。受基于 Transformer 的 LLM 中上下文记忆的启发，我们进行了实证研究，表明 DiT 可以通过对其隐藏状态的干预来引导，并且嵌入空间中的简单低通和高通滤波器可以自然地解开低级外观和高级物理/语义线索，从而实现有针对性的指导。基于这些观察，我们提出了一种可学习的记忆编码器 DiT-Mem，由堆叠的 3D CNN、低/高通滤波器和自注意力层组成。编码器将参考视频映射到一组紧凑的内存标记中，这些标记作为 DiT 自注意力层中的内存连接起来。在训练过程中，我们保持扩散主干冻结，并且只优化内存编码器。它在少量训练参数（150M）和 10K 数据样本上产生了相当高效的训练过程，并在推理时实现了即插即用。对最先进模型的大量实验证明了我们的方法在提高物理规则遵循和视频保真度方面的有效性。我们的代码和数据在这里公开发布：这个 https URL。

Title: MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Authors: Boyuan Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19253
Pdf URL: https://arxiv.org/pdf/2511.19253
Copy Paste: [[2511.19253]] MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization(https://arxiv.org/abs/2511.19253)
Keywords: generative
Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
摘要：协作多智能体强化学习（MARL）面临两个主要设计瓶颈：精心设计密集的奖励函数和构建避免在高维、非平稳环境中出现局部最优的课程。现有方法依赖于固定启发式或直接在控制环中使用大型语言模型（LLM），这成本高昂且不适合实时系统。我们提出了 MAESTRO（通过任务和奖励优化来塑造多代理环境），这是一个将 LLM 移出执行循环并将其用作离线培训架构师的框架。 MAESTRO 引入了两个生成组件：(i) 语义课程生成器，可创建多样化的、性能驱动的流量场景；(ii) 自动奖励合成器，可生成适应不断变化的课程难度的可执行 Python 奖励函数。这些组件指导标准 MARL 主干 (MADDPG)，而不会增加部署时的推理成本。我们对MAESTRO在大规模交通信号控制（杭州，16个十字路口）上进行评估，并进行受控消融。结果表明，将法学硕士生成的课程与法学硕士生成的奖励塑造相结合，可以提高绩效和稳定性。在四个种子中，整个系统比强大的课程基线实现了 4.0% 高的平均回报（163.26 比 156.93）和 2.2% 的风险调整表现（夏普 1.53 比 0.70）。这些发现凸显了法学硕士作为 MARL 合作培训的有效高级设计师。

Title: Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention

Authors: Lucas Li, Jean-Baptiste Puel, Florence Carton, Dounya Barrit, Jhony H. Giraldo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19263
Pdf URL: https://arxiv.org/pdf/2511.19263
Copy Paste: [[2511.19263]] Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention(https://arxiv.org/abs/2511.19263)
Keywords: generation
Abstract: Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.
摘要：钙钛矿太阳能电池是下一代光伏发电的有希望的候选者。然而，它们作为多尺度设备的性能是由其组成层之间复杂的相互作用决定的。这创造了可能的材料和器件架构的巨大组合空间，使得传统的基于实验的筛选过程缓慢且昂贵。机器学习模型试图解决这个问题，但它们只关注单个材料特性或忽略钙钛矿晶体的重要几何信息。为了解决这个问题，我们建议使用几何感知共同注意（Solar-GECO）模型来预测钙钛矿太阳能电池的功率转换效率。 Solar-GECO 将几何图神经网络 (GNN)（直接编码钙钛矿吸收体的原子结构）与语言模型嵌入相结合，该语言模型嵌入处理代表传输层和其他设备组件的化合物的文本字符串。 Solar-GECO 还集成了一个共同注意力模块来捕获层内依赖性和层间交互，而概率回归头则可以预测功率转换效率 (PCE) 及其相关的不确定性。 Solar-GECO 实现了最先进的性能，显着优于多个基线，与语义 GNN（之前最先进的模型）相比，PCE 预测的平均绝对误差 (MAE) 从 3.066 降低到 2.936。 Solar-GECO 证明，整合几何和文本信息为 PCE 预测提供了更强大、更准确的框架。

Title: Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry

Authors: Amirtha Varshini A S, Duminda S. Ranasinghe, Hok Hei Tam
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2511.19264
Pdf URL: https://arxiv.org/pdf/2511.19264
Copy Paste: [[2511.19264]] Interpreting GFlowNets for Drug Discovery: Extracting Actionable Insights for Medicinal Chemistry(https://arxiv.org/abs/2511.19264)
Keywords: generative
Abstract: Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.
摘要：生成流网络（GFlowNets）为分子设计提供了一个有前途的框架，但其内部决策策略仍然不透明。这限制了药物发现的采用，化学家需要对提议的结构有清晰且可解释的基本原理。我们提出了 SynFlowNet 的可解释性框架，这是一个经过记录的化学反应和可购买的起始材料进行训练的 GFlowNet，这些起始材料生成分子以及产生分子的合成路线。我们的方法集成了三个互补的组件。基于梯度的显着性与反事实扰动相结合，确定了哪些原子环境影响奖励以及结构编辑如何改变分子结果。稀疏自动编码器揭示了与极性、亲脂性和分子大小等物理化学特性相对应的轴对齐潜在因子。基序探针表明，包括芳环和卤素在内的官能团是明确编码的，并且可从内部嵌入线性解码。总之，这些结果揭示了 SynFlowNet 内部的化学逻辑，并提供可操作的机制见解，支持透明且可控的分子设计。

Title: BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

Authors: Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19268
Pdf URL: https://arxiv.org/pdf/2511.19268
Copy Paste: [[2511.19268]] BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment(https://arxiv.org/abs/2511.19268)
Keywords: generation, generative
Abstract: Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: this https URL.
摘要：条件图像生成通过结构、空间或风格先验增强了文本到图像的合成，但当前的方法在处理源之间的冲突方面面临挑战。其中包括 1) 输入级冲突，即调节图像与文本提示相矛盾；2) 模型偏差冲突，即即使条件与文本匹配，生成偏差也会破坏对齐。解决这些冲突需要细致入微的解决方案，而标准监督微调很难提供这些解决方案。基于偏好的优化技术，如直接偏好优化 (DPO)，显示出了良好的前景，但受到文本和条件信号之间的梯度纠缠的限制，并且缺乏用于多约束任务的解纠缠训练数据。为了克服这个问题，我们提出了一个双向解耦的 DPO 框架（BideDPO）。我们的方法创建两个解开的偏好对（一对用于条件，一对用于文本）以减少梯度纠缠。使用自适应损失平衡策略来管理对的影响以实现平衡优化。我们引入了自动化数据管道来对模型输出进行采样并生成冲突感知数据。此过程嵌入到迭代优化策略中，可细化模型和数据。我们构建了 DualAlign 基准来评估文本和条件之间的冲突解决。实验表明，BideDPO 显着提高了文本成功率（例如 +35%）和条件遵守率。我们还使用 COCO 数据集验证了我们的方法。项目页面：此 https URL。

Title: CDLM: Consistency Diffusion Language Models For Faster Sampling

Authors: Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.19269
Pdf URL: https://arxiv.org/pdf/2511.19269
Copy Paste: [[2511.19269]] CDLM: Consistency Diffusion Language Models For Faster Sampling(https://arxiv.org/abs/2511.19269)
Keywords: generation
Abstract: Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at this https URL.
摘要：扩散语言模型 (DLM) 提供了一种很有前途的并行生成范例，但由于大量的细化步骤和无法使用标准 KV 缓存，推理速度很慢。我们引入了 CDLM（一致性扩散语言模型），这是一种基于训练的加速方法，可以同时解决这两个瓶颈。 CDLM 集成了一致性建模，通过启用多令牌最终确定来大幅减少所需采样步骤的数量。此外，我们在微调期间强制执行块级因果注意掩模，使模型与 KV 缓存完全兼容。实验表明，CDLM 的延迟降低了 3.6 倍至 14.5 倍，同时在数学和编码任务上保持有竞争力的准确性。完整的培训和评估代码可从此 https URL 获取。

Title: Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model

Authors: Felix Birkel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19272
Pdf URL: https://arxiv.org/pdf/2511.19272
Copy Paste: [[2511.19272]] Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model(https://arxiv.org/abs/2511.19272)
Keywords: generation
Abstract: We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.
摘要：我们提出了 Tiny-TSM，一种时间序列基础模型，其特点是规模小、训练经济、性能先进。它包含 2300 万个总参数，使用新的合成数据生成和数据增强管道 (SynthTS) 在不到一周的时间内在单个 A100 GPU 上进行训练。无需任何神经架构搜索、超参数调整或扩大模型大小，Tiny-TSM 在各种时间序列基准数据集上实现了最先进的性能，通常优于更大的模型，甚至与更大的工业规模、可能高度调整的基础模型的性能相匹配。具体来说，Tiny-TSM 优于我们在 MSE 损失下的中长期预测任务上评估的所有其他时间序列基础模型，而短期准确性仍然与最先进的模型具有竞争力。我们还引入了一种因果输入归一化方案，该方案使时间序列模型能够通过密集的下一个标记预测损失进行训练，从而显着加快收敛速度并减少训练时间。所有实验均在单个 A100 GPU 上进行，说明了所提出方法在资源受限环境中的实用性。

Title: ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Authors: Qianying Liu, Xiao Liang, Zhiqiang Zhang, Yibo Chen, Xu Tang, Zhongfei Qing, Fengfan Zhou, Yao Hu, Paul Henderson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19278
Pdf URL: https://arxiv.org/pdf/2511.19278
Copy Paste: [[2511.19278]] ReMatch: Boosting Representation through Matching for Multimodal Retrieval(https://arxiv.org/abs/2511.19278)
Keywords: generative
Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
摘要：我们提出了 ReMatch，一个利用 MLLM 的生成能力进行多模态检索的框架。以前的方法将 MLLM 视为简单的编码器，忽略了它的生成性质，并且没有充分利用它的组合推理和世界知识。相反，我们使用聊天式的生成匹配阶段来端到端地训练嵌入 MLLM。匹配阶段使用相同的 MLLM 从多视图输入中自动回归确定相关性，包括原始数据及其自己针对每个查询和文档的投影嵌入。它提供了实例区分监督，补充了标准对比损失，在硬负片上提供了更强的梯度，并保留了原始 MLLM 的组成优势。为了获得语义上更丰富的多模态嵌入，我们使用多个可学习的标记来增强每个输入，以较低的推理成本生成细粒度的上下文、相互正交的嵌入。利用我们既定的高性能基准，我们将上述想法组合成强大的训练方案，并在大规模多模态嵌入基准（MMEB）上实现了新的最先进水平。我们的实验在五个数据集上显示了特别强大的零样本泛化结果，突出了 ReMatch 的鲁棒性和可转移性。

Title: Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning

Authors: James R. M. Black, Moritz S. Hanke, Aaron Maiwald, Tina Hernandez-Boussard, Oliver M. Crook, Jaspreet Pannu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19299
Pdf URL: https://arxiv.org/pdf/2511.19299
Copy Paste: [[2511.19299]] Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning(https://arxiv.org/abs/2511.19299)
Keywords: generation, generative
Abstract: Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.
摘要：新颖的深度学习架构越来越多地应用于生物数据，包括基因序列。这些模型被称为基因组语言模型（gLM），已经展示了令人印象深刻的预测和生成能力，引发了人们的担忧，即这些模型也可能导致滥用，例如通过生成人类感染病毒的基因组。这些担忧促使人们呼吁采取风险缓解措施。事实上的缓解措施是过滤预训练数据（即从训练数据集中删除病毒基因组序列），以限制 gLM 在病毒相关任务上的性能。然而，目前尚不清楚这种方法对于保护可以使用敏感病原体数据进行微调的开源模型有多稳健。在这里，我们评估了最先进的 gLM Evo 2，并使用 110 种有害的人类感染病毒的序列进行微调，以评估对误用相关预测能力的挽救。相对于 1）预训练模型和 2）针对噬菌体序列进行微调的版本，微调模型表现出对看不见的病毒序列的困惑度降低。尽管在微调过程中没有暴露于 SARS-CoV-2 序列，但针对人类感染病毒进行微调的模型还识别出了 SARS-CoV-2 的免疫逃逸变体（AUROC 为 0.6）。这项工作表明，可以通过微调方法来规避数据排除，这些方法可以在某种程度上挽救 gLM 的误用相关功能。我们强调对 gLM 安全框架的需求，并概述了评估和缓解措施所需的进一步工作，以实现 gLM 的安全部署。

Title: SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Authors: Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19319
Pdf URL: https://arxiv.org/pdf/2511.19319
Copy Paste: [[2511.19319]] SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis(https://arxiv.org/abs/2511.19319)
Keywords: generation
Abstract: Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
摘要：手-物体交互 (HOI) 生成在推进动画和机器人技术的应用方面发挥着至关重要的作用。当前基于视频的方法主要是单视图，这阻碍了全面的 3D 几何感知，并且经常导致几何扭曲或不切实际的运动模式。虽然 3D HOI 方法可以生成动态的合理运动，但它们对受控实验室环境中捕获的高质量 3D 数据的依赖严重限制了它们对现实世界场景的推广。为了克服这些限制，我们引入了 SyncMV4D，这是第一个通过统一视觉先验、运动动力学和多视图几何来联合生成同步多视图 HOI 视频和 4D 运动的模型。我们的框架具有两项核心创新：(1) 多视图联合扩散 (MJD) 模型，可共同生成 HOI 视频和中间运动；(2) 扩散点对齐器 (DPA)，可将粗略的中间运动细化为全局对齐的 4D 度量点轨迹。为了将 2D 外观与 4D 动态紧密结合，我们建立了一个闭环、相互增强的循环。在扩散去噪过程中，生成的视频调节 4D 运动的细化，同时重新投影对齐的 4D 点轨迹以指导下一步的联合生成。通过实验，我们的方法在视觉真实感、运动合理性和多视图一致性方面表现出了优于最先进替代方法的性能。

Title: Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

Authors: Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang, Mingli Song, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19343
Pdf URL: https://arxiv.org/pdf/2511.19343
Copy Paste: [[2511.19343]] Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning(https://arxiv.org/abs/2511.19343)
Keywords: generation
Abstract: RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at this https URL.
摘要：用于MLLM（多模态LLM）感知能力的RL（强化学习）方法（例如GRPO）由于其卓越的泛化能力而引起了广泛的研究兴趣。然而，现有的强化学习方法仍然面临数据质量低的问题，数据样本无法引起 MLLM 的多样化响应，从而限制了 MLLM 强化学习的探索范围。一些方法试图通过对熵施加约束来缓解这个问题，但没有一个方法从根本上解决这个问题。因此，为了解决这个问题，本文提出了Syn-GRPO（Synthesis-GRPO），它采用在线数据生成器在GRPO训练中合成具有不同响应的高质量训练数据。具体来说，Syn-GRPO由两个部分组成：（1）数据服务器； (2)GRPO工作流程。数据服务器使用图像生成模型从现有样本中合成新样本，采用解耦和异步方案以实现高生成效率。 GRPO 工作流程为数据服务器提供新的图像描述，并利用多样性奖励来监督 MLLM 预测图像描述，以合成具有不同响应的样本。三个视觉感知任务的实验结果表明，Syn-GRPO 大幅提高了数据质量，实现了比现有 MLLM 感知方法显着优越的性能，并且 Syn-GRPO 为扩展长期自进化 RL 提供了广阔的前景。我们的代码可以在这个 https URL 上找到。

Title: Leveraging LLMs for reward function design in reinforcement learning control tasks

Authors: Franklin Cardenoso, Wouter Caarls
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19355
Pdf URL: https://arxiv.org/pdf/2511.19355
Copy Paste: [[2511.19355]] Leveraging LLMs for reward function design in reinforcement learning control tasks(https://arxiv.org/abs/2511.19355)
Keywords: generation
Abstract: The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
摘要：在强化学习（RL）中设计有效奖励函数的挑战是一个重大瓶颈，通常需要广泛的人类专业知识并且非常耗时。大型语言模型 (LLM) 的先前工作和最新进展已经证明了它们自动生成奖励函数的潜力。然而，现有的方法通常需要初步的评估指标、针对细化过程的人为设计反馈或使用环境源代码作为上下文。为了解决这些限制，本文引入了 LEARN-Opt（基于 LLM 的奖励函数优化评估器和分析器）。这种基于法学硕士的、完全自主的、与模型无关的框架消除了对初步指标和环境源代码作为上下文的需要，从系统和任务目标的文本描述中生成、执行和评估候选奖励函数。 LEARN-Opt 的主要贡献在于它能够直接从系统描述和任务目标自主导出绩效指标，从而实现奖励函数的无监督评估和选择。我们的实验表明，LEARN-Opt 的性能与最先进的方法（例如 EUREKA）相当或更好，同时需要较少的先验知识。我们发现自动奖励设计是一个高方差问题，平均情况下的候选者会失败，需要采用多次运行的方法来找到最佳候选者。最后，我们表明 LEARN-Opt 可以释放低成本法学硕士的潜力，找到与大型模型相当甚至更好的高性能候选人。这一表现证明了其在不需要任何初步的人类定义指标的情况下生成高质量奖励函数的潜力，从而减少了工程开销并增强了通用性。

Title: Growing with the Generator: Self-paced GRPO for Video Generation

Authors: Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19356
Pdf URL: https://arxiv.org/pdf/2511.19356
Copy Paste: [[2511.19356]] Growing with the Generator: Self-paced GRPO for Video Generation(https://arxiv.org/abs/2511.19356)
Keywords: generation
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
摘要：组相对策略优化（GRPO）已成为训练后视频生成模型的强大强化学习范例。然而，现有的 GRPO 管道依赖于静态、固定容量的奖励模型，其评估行为在训练期间被冻结。这种严格的奖励会引入分配偏差，随着生成器的改进而迅速饱和，并最终限制基于强化的对齐的稳定性和有效性。我们提出了 Self-Paced GRPO，这是一种能力感知的 GRPO 框架，其中奖励反馈与生成器共同进化。我们的方法引入了一种渐进式奖励机制，随着生成质量的提高，该机制会自动将其重点从粗略的视觉保真度转移到时间连贯性和细粒度的文本视频语义对齐。这种自定进度的课程减轻了奖励政策的不匹配，减少了奖励剥削，并产生了更稳定的优化。跨多个视频生成主干的 VBench 实验表明，在具有静态奖励的 GRPO 基线上，视觉质量和语义对齐均得到了持续改进，验证了自定进度 GRPO 的有效性和通用性。

Title: DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Authors: Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19365
Pdf URL: https://arxiv.org/pdf/2511.19365
Copy Paste: [[2511.19365]] DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation(https://arxiv.org/abs/2511.19365)
Keywords: generation
Abstract: Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at this https URL.
摘要：像素扩散旨在以端到端的方式直接在像素空间中生成图像。这种方法避免了 VAE 在两阶段潜在扩散中的局限性，提供了更高的模型容量。现有的像素扩散模型训练和推理速度缓慢，因为它们通常在单个扩散变压器 (DiT) 内对高频信号和低频语义进行建模。为了追求更有效的像素扩散范式，我们提出了频率解耦像素扩散框架。凭借将高频和低频分量的生成解耦的直觉，我们利用轻量级像素解码器来生成基于 DiT 语义指导的高频细节。因此，这使得 DiT 能够专注于低频语义建模。此外，我们引入了一种频率感知的流量匹配损失，它强调视觉上显着的频率，同时抑制不重要的频率。大量实验表明，DeCo 在像素扩散模型中实现了优越的性能，在 ImageNet 上获得了 1.62 (256x256) 和 2.22 (512x512) 的 FID，缩小了与潜在扩散方法的差距。此外，我们的预训练文本到图像模型在 GenEval 的系统级比较中取得了 0.86 的领先总分。代码可通过此 https URL 公开获取。

Title: Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware

Authors: Srishti Gupta, Yashasvee Taiwade
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19379
Pdf URL: https://arxiv.org/pdf/2511.19379
Copy Paste: [[2511.19379]] Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware(https://arxiv.org/abs/2511.19379)
Keywords: generative
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis, yet their deployment is hindered by significant computational overhead during inference, often requiring up to 1,000 iterative steps. This study presents a rigorous comparative analysis of DDPMs against the emerging Flow Matching (Rectified Flow) paradigm, specifically isolating their geometric and efficiency properties on low-resource hardware. By implementing both frameworks on a shared Time-Conditioned U-Net backbone using the MNIST dataset, we demonstrate that Flow Matching significantly outperforms Diffusion in efficiency. Our geometric analysis reveals that Flow Matching learns a highly rectified transport path (Curvature $\mathcal{C} \approx 1.02$), which is near-optimal, whereas Diffusion trajectories remain stochastic and tortuous ($\mathcal{C} \approx 3.45$). Furthermore, we establish an ``efficiency frontier'' at $N=10$ function evaluations, where Flow Matching retains high fidelity while Diffusion collapses. Finally, we show via numerical sensitivity analysis that the learned vector field is sufficiently linear to render high-order ODE solvers (Runge-Kutta 4) unnecessary, validating the use of lightweight Euler solvers for edge deployment. \textbf{This work concludes that Flow Matching is the superior algorithmic choice for real-time, resource-constrained generative tasks.}
摘要：去噪扩散概率模型 (DDPM) 已经在生成图像合成方面建立了新的最先进技术，但其部署受到推理过程中大量计算开销的阻碍，通常需要多达 1,000 个迭代步骤。这项研究对 DDPM 与新兴的流匹配（整流流）范例进行了严格的比较分析，特别是在低资源硬件上隔离了它们的几何和效率属性。通过使用 MNIST 数据集在共享的时间条件 U-Net 主干上实现这两个框架，我们证明了流匹配在效率上显着优于扩散。我们的几何分析表明，流量匹配学习了高度校正的传输路径（曲率 $\mathcal{C} \approx 1.02$），接近最优，而扩散轨迹仍然是随机且曲折的（$\mathcal{C} \approx 3.45$）。此外，我们在 $N=10$ 函数评估时建立了一个“效率边界”，其中流匹配保持高保真度，而扩散则崩溃。最后，我们通过数值敏感性分析表明，学习的向量场足够线性，无需高阶 ODE 求解器 (Runge-Kutta 4)，从而验证了轻量级欧拉求解器在边缘部署中的使用。 \textbf{这项工作的结论是，流匹配是实时、资源受限的生成任务的最佳算法选择。}

Title: In-Video Instructions: Visual Signals as Generative Control

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19401
Pdf URL: https://arxiv.org/pdf/2511.19401
Copy Paste: [[2511.19401]] In-Video Instructions: Visual Signals as Generative Control(https://arxiv.org/abs/2511.19401)
Keywords: generation, generative
Abstract: Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
摘要：大型视频生成模型最近表现出了强大的视觉功能，能够预测符合当前观察中的逻辑和物理线索的未来帧。在这项工作中，我们研究是否可以通过将帧中嵌入的视觉信号解释为指令（我们称之为视频内指令）来利用此类功能来生成可控的图像到视频。基于提示的控制提供本质上全局且粗略的文本描述，与此相反，视频内指令通过叠加文本、箭头或轨迹等元素将用户指导直接编码到视觉域中。通过向不同的对象分配不同的指令，可以在视觉主体与其预期动作之间实现明确的、空间感知的和明确的对应。对三种最先进的生成器（包括 Veo 3.1、Kling 2.5 和 Wan 2.2）的大量实验表明，视频模型可以可靠地解释和执行此类视觉嵌入指令，特别是在复杂的多对象场景中。

Title: UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Authors: Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2511.19413
Pdf URL: https://arxiv.org/pdf/2511.19413
Copy Paste: [[2511.19413]] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary(https://arxiv.org/abs/2511.19413)
Keywords: generation
Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: this https URL
摘要：统一多模态模型 (UMM) 在单一架构的理解和生成方面表现出了令人印象深刻的性能。然而，UMM 仍然表现出根本的不一致：理解有利于紧凑的嵌入，而生成有利于丰富的重建表示。这种结构性权衡会导致决策边界错位、跨模式一致性降低以及分配和对抗性变化下的脆弱性增加。在本文中，我们提出了 UniGame，一种直接针对不一致问题的自对抗式训练后框架。通过在共享代币接口上应用轻量级干扰器，UniGame 使生成分支能够主动寻求和挑战脆弱的理解，将模型本身变成自己的对手。实验表明 UniGame 显着提高了一致性（+4.6%）。此外，它还在理解（+3.6%）、生成（+0.02）、分布外和对抗鲁棒性（NaturalBench 和 AdVQA 上+4.8% 和 +6.2%）方面取得了显着的进步。该框架与架构无关，引入了不到 1% 的额外参数，并且是对现有后训练方法的补充。这些结果将对抗性自我博弈定位为增强未来多模式基础模型的连贯性、稳定性和统一能力的通用且有效的原则。官方代码位于：此 https URL

Title: SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation

Authors: Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding, Deyi Ji, Cheng Chen, Qi Zhu, Chunyan Xu, Papa Mao, Ying Zang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19425
Pdf URL: https://arxiv.org/pdf/2511.19425
Copy Paste: [[2511.19425]] SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation(https://arxiv.org/abs/2511.19425)
Keywords: generation
Abstract: The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.
摘要：大规模基础模型的迅速崛起重塑了图像分割的格局，Segment Anything 等模型在不同的视觉任务中实现了前所未有的多功能性。然而，前几代技术（包括 SAM 及其后继者）仍然在应对细粒度、低级别的分割挑战，例如伪装对象检测、医学图像分割、细胞图像分割和阴影检测。为了解决这些限制，我们最初于 2023 年提出了 SAM-Adapter，展示了在这些困难场景中取得的巨大成果。随着 Segment Anything 3 (SAM3) 的出现——一种更高效、性能更高的演进，具有重新设计的架构和改进的训练管道——我们重新审视这些长期存在的挑战。在这项工作中，我们提出了 SAM3-Adapter，这是第一个为 SAM3 量身定制的适配器框架，可释放其完整的分段功能。 SAM3-Adapter 不仅减少了计算开销，而且始终超越基于 SAM 和 SAM2 的解决方案，在多个下游任务中建立了新的最先进的结果，包括医学成像、伪装（隐藏）对象分割和阴影检测。 SAM3-Adapter基于原有SAM-Adapter的模块化、可组合设计理念，提供了更强的通用性、更丰富的任务适应性，并显着提高了分割精度。大量实验证实，与之前所有基于 SAM 的适配相比，将 SAM3 与我们的适配器集成可产生卓越的准确性、稳健性和效率。我们希望 SAM3-Adapter 能够作为未来研究和实际分割应用的基础。代码、预训练模型和数据处理管道均可用。

Title: Flow Map Distillation Without Data

Authors: Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.19428
Pdf URL: https://arxiv.org/pdf/2511.19428
Copy Paste: [[2511.19428]] Flow Map Distillation Without Data(https://arxiv.org/abs/2511.19428)
Keywords: generative
Abstract: State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
摘要：最先进的流动模型可实现卓越的质量，但需要缓慢的迭代采样。为了加速这一过程，可以从经过预先培训的教师那里提取流程图，这一过程通常需要从外部数据集中采样。我们认为，这种数据依赖性引入了教师数据不匹配的根本风险，因为静态数据集可能提供教师完整生成能力的不完整甚至不一致的表示。这让我们质疑这种对数据的依赖对于成功的流程图蒸馏是否真的是必要的。在这项工作中，我们探索了一种无数据的替代方案，仅从先验分布中采样，保证教师在构建时遵循该分布，从而完全规避不匹配风险。为了证明这一理念的实际可行性，我们引入了一个原则框架，该框架学习预测教师的采样路径，同时积极纠正其自身的复合错误以确保高保真度。我们的方法超越了所有基于数据的同行，并以显着的优势建立了新的最先进技术。具体来说，从 SiT-XL/2+REPA 中提取，我们的方法在 ImageNet 256x256 上达到了令人印象深刻的 FID 1.45，在 ImageNet 512x512 上达到了 1.49，两者都只需要 1 个采样步骤。我们希望我们的工作能够建立一个更强大的范例来加速生成模型，并推动更广泛地采用无数据的流程图蒸馏。

Title: Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Authors: Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19434
Pdf URL: https://arxiv.org/pdf/2511.19434
Copy Paste: [[2511.19434]] Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts(https://arxiv.org/abs/2511.19434)
Keywords: generation
Abstract: Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
摘要：用于图像生成的扩散模型通常表现出感知样本质量和数据可能性之间的权衡：强调高噪声去噪步骤的训练目标会产生逼真的图像，但可能性较差，而面向可能性的训练则过重低噪声步骤并损害视觉保真度。我们引入了一种简单的即插即用采样方法，该方法通过沿着去噪轨迹在两个预训练的扩散专家之间进行切换来组合它们。具体来说，我们应用高噪声水平的图像质量专家来塑造全局结构，然后切换到低噪声水平的似然专家来细化像素统计数据。该方法不需要重新训练或微调——只需要选择中间切换步骤。在 CIFAR-10 和 ImageNet32 上，合并模型始终匹配或优于其基本组件，相对于每位专家单独提高或保持可能性和样本质量。这些结果表明，专家在噪声级别之间进行切换是打破图像扩散模型中似然质量权衡的有效方法。

Title: Are Image-to-Video Models Good Zero-Shot Image Editors?

Authors: Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19435
Pdf URL: https://arxiv.org/pdf/2511.19435
Copy Paste: [[2511.19435]] Are Image-to-Video Models Good Zero-Shot Image Editors?(https://arxiv.org/abs/2511.19435)
Keywords: generative
Abstract: Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
摘要：大规模视频扩散模型显示出强大的世界模拟和时间推理能力，但它们作为零样本图像编辑器的用途仍未得到充分探索。我们引入了 IF-Edit，这是一个免调整框架，可将预训练的图像到视频扩散模型重新用于指令驱动的图像编辑。 IF-Edit 解决了三个关键挑战：提示错位、冗余时间潜伏和模糊的后期帧。它包括（1）思想链提示增强模块，将静态编辑指令转化为基于时间的推理提示；（2）时间潜在丢失策略，在专家切换点之后压缩帧潜在，加速去噪，同时保持语义和时间一致性； (3) 使用短静态视频轨迹锐化后期帧的自洽后细化步骤。对四个公共基准（涵盖非刚性编辑、物理和时间推理以及一般指令编辑）的实验表明，IF-Edit 在以推理为中心的任务上表现强劲，同时在通用编辑上保持竞争力。我们的研究提供了视频扩散模型作为图像编辑器的系统视图，并强调了统一视频图像生成推理的简单方法。

Title: VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Authors: Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, Yihong Gong
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2511.19436
Pdf URL: https://arxiv.org/pdf/2511.19436
Copy Paste: [[2511.19436]] VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection(https://arxiv.org/abs/2511.19436)
Keywords: generation
Abstract: We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
摘要：我们提出了 VDC-Agent，这是一种用于视频详细字幕的自我进化框架，既不需要人工注释，也不需要更大的教师模型。代理形成字幕生成、原则引导评分（分数和文本建议）和提示细化的闭环。当字幕质量下降时，自我反思路径会利用之前的思路来修正更新。在未标记的视频上运行此过程会产生（标题、分数）对的轨迹。我们将轨迹转换为偏好元组，并过滤掉 JSON 解析错误的样本，得到 VDC-Agent-19K，其中包含 18,886 个自动构建的对。然后，我们使用从易到难的课程直接偏好优化来微调该数据集上的基本 MLLM。我们的 VDC-Agent-7B 基于 Qwen2.5-VL-7B-Instruct 构建，在 VDC 基准测试中实现了最先进的性能，平均准确度为 49.08%，得分为 2.50，超越了专用视频字幕器，并在类似的推理成本下比基本模型提高了 +5.13% 的准确度和 +0.27 分。

Title: LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context

Authors: Jingzhi Bao, Hongze Chen, Lingting Zhu, Chenyu Liu, Runze Zhang, Keyang Luo, Zeyu Hu, Weikai Chen, Yingda Yin, Xin Wang, Zehong Lin, Jun Zhang, Xiaoguang Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19437
Pdf URL: https://arxiv.org/pdf/2511.19437
Copy Paste: [[2511.19437]] LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context(https://arxiv.org/abs/2511.19437)
Keywords: generation
Abstract: Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.
摘要：基于物理的渲染 (PBR) 为计算机图形学中真实的材质与光照交互提供了原则标准。尽管最近在生成 PBR 纹理方面取得了进展，但现有方法未能解决两个基本挑战：1）在有限的照明线索下根据图像提示进行材质分解；2）无缝且视图一致的纹理完成。为此，我们提出了 LumiTex，一个端到端框架，包含三个关键组件：(1) 多分支生成方案，在共享照明先验下解开反照率和金属粗糙度，以实现稳健的材质理解；(2) 照明感知材质注意机制，将照明上下文注入到解码过程中，以物理接地生成反照率、金属和粗糙度图；(3) 基于大视图合成模型的几何引导修复模块，丰富纹理覆盖范围并确保无缝、视图一致的 UV 完成。大量实验表明，LumiTex 在纹理质量方面实现了最先进的性能，超越了现有的开源和商业方法。