2025-12-01

Title: SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Authors: Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2511.21750
Pdf URL: https://arxiv.org/pdf/2511.21750
Copy Paste: [[2511.21750]] SO-Bench: A Structural Output Evaluation of Multimodal LLMs(https://arxiv.org/abs/2511.21750)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.
摘要：多模态大语言模型 (MLLM) 越来越多地部署在现实世界的代理环境中，其中输出不仅必须正确，而且必须符合预定义的数据模式。尽管文本领域的结构化生成最近取得了进展，但仍然没有系统评估基于模式的信息提取和视觉输入推理的基准。在这项工作中，我们利用精心设计的 SO-Bench 基准对 MLLM 的视觉结构输出能力进行了全面研究。 SO-Bench 涵盖四个视觉领域，包括 UI 屏幕、自然图像、文档和图表，由超过 6500 个不同的 JSON 模式和 1800 个经过人工验证质量的精选图像模式对构建而成。对开源和前沿专有模型的基准测试揭示了在预测准确、符合模式的输出方面持续存在的差距，凸显了对更好的多模态结构化推理的需求。除了基准测试之外，我们还进一步进行训练实验，以大幅提高模型的结构化输出能力。我们计划向社区提供该基准。

Title: Physics-Informed Spiking Neural Networks via Conservative Flux Quantization

Authors: Chi Zhang, Lin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21784
Pdf URL: https://arxiv.org/pdf/2511.21784
Copy Paste: [[2511.21784]] Physics-Informed Spiking Neural Networks via Conservative Flux Quantization(https://arxiv.org/abs/2511.21784)
Keywords: generation
Abstract: Real-time, physically-consistent predictions on low-power edge devices is critical for the next generation embodied AI systems, yet it remains a major challenge. Physics-Informed Neural Networks (PINNs) combine data-driven learning with physics-based constraints to ensure the model's predictions are with underlying physical this http URL, PINNs are energy-intensive and struggle to strictly enforce physical conservation laws. Brain-inspired spiking neural networks (SNNs) have emerged as a promising solution for edge computing and real-time processing. However, naively converting PINNs to SNNs degrades physical fidelity and fails to address long-term generalization issues. To this end, this paper introduce a novel Physics-Informed Spiking Neural Network (PISNN) framework. Importantly, to ensure strict physical conservation, we design the Conservative Leaky Integrate-and-Fire (C-LIF) neuron, whose dynamics structurally guarantee local mass preservation. To achieve robust temporal generalization, we introduce a novel Conservative Flux Quantization (CFQ) strategy, which redefines neural spikes as discrete packets of physical flux. Our CFQ learns a time-invariant physical evolution operator, enabling the PISNN to become a general-purpose solver -- conservative-by-construction. Extensive experiments show that our PISNN excels on diverse benchmarks. For both the canonical 1D heat equation and the more challenging 2D Laplace's Equation, it accurately simulates the system dynamics while maintaining perfect mass conservation by design -- a feat that is challenging for conventional PINNs. This work establishes a robust framework for fusing the rigor of scientific computing with the efficiency of neuromorphic engineering, paving the way for complex, long-term, and energy-efficient physics predictions for intelligent systems.
摘要：对低功耗边缘设备进行实时、物理一致的预测对于下一代嵌入式人工智能系统至关重要，但这仍然是一个重大挑战。物理信息神经网络 (PINN) 将数据驱动的学习与基于物理的约束相结合，以确保模型的预测符合底层物理 http URL，PINN 是能源密集型的，并且很难严格执行物理守恒定律。受大脑启发的尖峰神经网络 (SNN) 已成为边缘计算和实时处理的一种有前景的解决方案。然而，天真地将 PINN 转换为 SNN 会降低物理保真度，并且无法解决长期泛化问题。为此，本文介绍了一种新颖的物理信息尖峰神经网络（PISNN）框架。重要的是，为了确保严格的物理保存，我们设计了保守泄漏积分与激发（C-LIF）神经元，其动力学在结构上保证了局部质量保存。为了实现鲁棒的时间泛化，我们引入了一种新颖的保守通量量化（CFQ）策略，它将神经尖峰重新定义为离散的物理通量包。我们的 CFQ 学习时不变的物理演化算子，使 PISNN 成为通用求解器——构造保守。大量实验表明，我们的 PISNN 在各种基准测试中表现出色。对于规范的一维热方程和更具挑战性的二维拉普拉斯方程，它可以准确地模拟系统动力学，同时通过设计保持完美的质量守恒，这对传统 PINN 来说是一项挑战。这项工作建立了一个强大的框架，将科学计算的严谨性与神经形态工程的效率相融合，为智能系统的复杂、长期和节能的物理预测铺平了道路。

Title: Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training

Authors: Eric Yeats, Darryl Hannan, Wilson Fearn, Timothy Doster, Henry Kvinge, Scott Mahan
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.21863
Pdf URL: https://arxiv.org/pdf/2511.21863
Copy Paste: [[2511.21863]] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training(https://arxiv.org/abs/2511.21863)
Keywords: generation, generative
Abstract: Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity.
摘要：基于分数的生成模型需要指导才能生成合理的流形样本。最流行的指导方法是无分类器指导（CFG），仅适用于带有标记数据的设置，并且需要训练额外的无条件基于评分的模型。最近，自动引导采用了原始模型的更小、能力更弱的版本来引导生成。虽然每种方法都有效地提高了生成数据的保真度，但每种方法都需要标记数据或额外模型的训练，这使得当（标记的）训练数据不可用或训练新模型不可行时指导基于分数的模型具有挑战性。我们令人惊讶地发现，鞍区对数密度估计的正曲率为基于分数的模型提供了强有力的指导。受此启发，我们开发了无鞍引导（SFG），它保持对数密度最大正曲率的估计，以指导单个基于分数的模型。 SFG 具有与无分类器引导相同的计算成本，不需要额外的训练，并且可以与现成的扩散和流量匹配模型配合使用。我们的实验表明，SFG 在单模型无条件 ImageNet-512 生成中实现了最先进的 FID 和 FD-DINOv2 指标。当 SFG 与 Auto-Guidance 相结合时，其无条件样本在 FD-DINOv2 评分中达到了一般最先进的水平。我们使用 FLUX.1-dev 和 Stable Diffusion v3.5 进行的实验表明，与 CFG 相比，SFG 提高了输出图像的多样性，同时保持了出色的即时依从性和图像保真度。

Title: Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Authors: Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21893
Pdf URL: https://arxiv.org/pdf/2511.21893
Copy Paste: [[2511.21893]] Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings(https://arxiv.org/abs/2511.21893)
Keywords: generative
Abstract: Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions (Zhang et al., 2025), where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. To counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., Variational Autoencoders (VAEs), to maintain natural alignment. To further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 to 46) and 11% (32 to 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions.
摘要：多模态基础模型在共享嵌入空间中对齐图像、文本和其他模态，但仍然容易受到对抗性错觉的影响（Zhang 等人，2025），其中难以察觉的扰动会破坏跨模态对齐并误导下游任务。为了抵消对抗性错觉的影响，我们提出了一种与任务无关的缓解机制，该机制通过生成模型（例如变分自动编码器（VAE））从攻击者的扰动输入中重建输入，以保持自然对齐。为了进一步增强我们提出的防御机制，我们采用生成采样策略，结合对生成样本结果的基于共识的聚合方案。我们对最先进的多模态编码器的实验表明，我们的方法将错觉攻击的成功率大大降低到接近于零，并且在未受扰动和受扰动的输入设置中分别将跨模态对齐提高了 4%（42 到 46）和 11%（32 到 43），从而提供了针对对抗性错觉的有效且与模型无关的防御。

Title: PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images

Authors: Kunpeng Zhang, Hanwen Xu, Sheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21902
Pdf URL: https://arxiv.org/pdf/2511.21902
Copy Paste: [[2511.21902]] PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images(https://arxiv.org/abs/2511.21902)
Keywords: generation
Abstract: Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed "PathReasoning", a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.
摘要：从全切片图像 (WSI) 中解读肿瘤微环境非常有趣，因为它是癌症诊断、预后和治疗反应的关键。虽然这些十亿像素图像一方面提供了癌症的全面肖像，但另一方面，高达超过 100 亿像素的超大尺寸使得导航到相应区域以支持多样化的临床检查变得充满挑战且耗时。受到病理学家结合采样、推理和自我反思对 WSI 进行导航的启发，我们提出了“PathReasoning”，这是一种多模式推理代理，可以通过多轮推理和细化迭代地跨 WSI 进行导航。具体来说，从随机抽样的候选区域开始，PathReasoning 通过自我反思来审查当前的选择，推理视觉观察和临床问题之间的对应关系，并通过提出要探索的新区域来得出结论。在各轮中，PathReasoning 构建了一个推理链，逐渐将注意力转移到诊断相关领域。 PathReasoning 将每张完整幻灯片转换为一系列问题引导视图，使模型能够在固定数量的步骤内有效地找到信息丰富的 ROI，而无需密集的像素级注释。在子类型分析和纵向分析任务中，PathReasoning 的 AUROC 性能明显优于强大的 ROI 选择方法，分别高出 6.7% 和 3.1%。高质量的 ROI 进一步支持准确生成乳腺癌报告，其准确度明显优于标准 GPT-4o 10%。 PathReasoning 优先考虑特定问题区域并构建可解释的推理链，支持数字病理学中的高效幻灯片审查、一致的诊断解释、综合报告和证据可追溯性。

Title: Adaptive Parameter Optimization for Robust Remote Photoplethysmography

Authors: Cecilia G. Morales, Fanurs Chi En Teh, Kai Li, Pushpak Agrawal, Artur Dubrawski
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21903
Pdf URL: https://arxiv.org/pdf/2511.21903
Copy Paste: [[2511.21903]] Adaptive Parameter Optimization for Robust Remote Photoplethysmography(https://arxiv.org/abs/2511.21903)
Keywords: quality assessment
Abstract: Remote photoplethysmography (rPPG) enables contactless vital sign monitoring using standard RGB cameras. However, existing methods rely on fixed parameters optimized for particular lighting conditions and camera setups, limiting adaptability to diverse deployment environments. This paper introduces the Projection-based Robust Signal Mixing (PRISM) algorithm, a training-free method that jointly optimizes photometric detrending and color mixing through online parameter adaptation based on signal quality assessment. PRISM achieves state-of-the-art performance among unsupervised methods, with MAE of 0.77 bpm on PURE and 0.66 bpm on UBFC-rPPG, and accuracy of 97.3\% and 97.5\% respectively at a 5 bpm threshold. Statistical analysis confirms PRISM performs equivalently to leading supervised methods ($p > 0.2$), while maintaining real-time CPU performance without training. This validates that adaptive time series optimization significantly improves rPPG across diverse conditions.
摘要：远程光电体积描记法 (rPPG) 使用标准 RGB 摄像头实现非接触式生命体征监测。然而，现有方法依赖于针对特定照明条件和相机设置优化的固定参数，限制了对不同部署环境的适应性。本文介绍了基于投影的鲁棒信号混合（PRISM）算法，这是一种免训练方法，通过基于信号质量评估的在线参数自适应来联合优化光度去趋势和颜色混合。 PRISM 在无监督方法中实现了最先进的性能，PURE 上的 MAE 为 0.77 bpm，UBFC-rPPG 上的 MAE 为 0.66 bpm，在 5 bpm 阈值下的准确度分别为 97.3% 和 97.5%。统计分析证实 PRISM 的性能与领先的监督方法相当 ($p > 0.2$)，同时无需训练即可保持实时 CPU 性能。这验证了自适应时间序列优化在不同条件下显着提高了 rPPG。

Title: AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

Authors: Junwei Zhou, Yu-Wing Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21945
Pdf URL: https://arxiv.org/pdf/2511.21945
Copy Paste: [[2511.21945]] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views(https://arxiv.org/abs/2511.21945)
Keywords: generative
Abstract: Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.
摘要：在现实场景中，从一些未摆姿势和部分遮挡的视图重建 3D 对象是一个常见但具有挑战性的问题，因为许多对象表面从未被直接观察到。传统的多视图或基于修复的方法在这种条件下举步维艰，通常会产生不完整或几何不一致的重建。我们引入了 AmodalGen3D，这是一种用于非模态 3D 对象重建的生成框架，可以从任意稀疏输入推断出完整的、无遮挡的几何形状和外观。该模型将 2D 非模态完成先验与多视图立体几何调节相集成，并由用于稀疏视图特征融合的 View-Wise Cross Attention 机制和用于未观察到的结构推理的 Stereo-Conditioned Cross Attention 模块提供支持。通过联合建模可见区域和隐藏区域，AmodalGen3D 忠实地重建了符合稀疏视图约束的 3D 对象，同时合理地产生了看不见的部分的幻觉。对合成数据集和真实世界数据集的实验表明，AmodalGen3D 在遮挡严重的稀疏视图设置下实现了卓越的保真度和完整性，满足了机器人、AR/VR 和实体 AI 应用中对象级 3D 场景重建的迫切需求。

Title: PAT3D: Physics-Augmented Text-to-3D Scene Generation

Authors: Guying Lin, Kemeng Huang, Michael Liu, Ruihan Gao, Hanke Chen, Lyuhao Chen, Beijia Lu, Taku Komura, Yuan Liu, Jun-Yan Zhu, Minchen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21978
Pdf URL: https://arxiv.org/pdf/2511.21978
Copy Paste: [[2511.21978]] PAT3D: Physics-Augmented Text-to-3D Scene Generation(https://arxiv.org/abs/2511.21978)
Keywords: generation
Abstract: We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.
摘要：我们推出 PAT3D，这是第一个物理增强的文本到 3D 场景生成框架，它将视觉语言模型与基于物理的模拟相集成，以生成物理上合理、可模拟且无交叉的 3D 场景。给定文本提示，PAT3D 生成 3D 对象，推断它们的空间关系，并将它们组织成分层场景树，然后将其转换为模拟的初始条件。可微分刚体模拟器可确保重力下真实的物体交互，推动场景达到静态平衡，而不会相互渗透。为了进一步提高场景质量，我们引入了循环仿真优化程序，保证物理稳定性和不相交，同时提高与输入提示的语义一致性。实验表明，PAT3D 在物理合理性、语义一致性和视觉质量方面远远优于先前的方法。除了高质量生成之外，PAT3D 还独特地为场景编辑和机器人操作等下游任务提供了模拟就绪的 3D 场景。代码和数据将在接受后发布。

Title: StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

Authors: Sen Fang, Hongbin Zhong, Yalin Feng, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22009
Pdf URL: https://arxiv.org/pdf/2511.22009
Copy Paste: [[2511.22009]] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation(https://arxiv.org/abs/2511.22009)
Keywords: generation, generative
Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.
摘要：整流流、流匹配等新技术在过去两年显着提高了生成模型的性能，特别是在控制精度、发电质量和发电效率方面。然而，由于其理论、设计和现有扩散模型的一些差异，现有的加速方法不能直接应用于整流流模型。在这篇文章中，我们从理论、设计、推理策略等方面全面实现了一个整体的加速流程。该流程采用新速度场批处理、异构时间步批处理矢量化、新方法动态TensorRT编译等新方法，全面加速基于流模型的相关模型。目前，现有的公共方法通常可以实现18％的加速，而实验证明我们的新方法可以将512*512图像生成速度加速到611％，这远远超出了当前的非广义加速方法。

Title: PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection

Authors: Shuchen Du, Shuo Lei, Feiran Li, Jiacheng Li, Daisuke Iso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22029
Pdf URL: https://arxiv.org/pdf/2511.22029
Copy Paste: [[2511.22029]] PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection(https://arxiv.org/abs/2511.22029)
Keywords: generation
Abstract: Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments. However, most state-of-the-art approaches are overly complex, relying on challenging adversarial training strategies, or on elaborate architectural designs with auxiliary models for feature distillation and pseudo-label generation. In this work, we present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains. The proposed approach introduces only a lightweight pre-processing module during training and entirely discards it at inference time, thus incurring no additional computational overhead. We validate our method on domain-adaptive object detection (DAOD) tasks, where ground-truth annotations are easily accessible in source domains (e.g., normal-weather or synthetic conditions) but challenging to obtain in target domains (e.g., adverse weather or low-light scenes). Extensive experiments demonstrate that our method achieves substantial performance gains on multiple benchmarks, highlighting its practicality and effectiveness.
摘要：无监督域适应（UDA）极大地促进了神经网络在不同环境中的部署。然而，大多数最先进的方法都过于复杂，依赖于具有挑战性的对抗性训练策略，或者依赖于具有用于特征蒸馏和伪标签生成的辅助模型的复杂架构设计。在这项工作中，我们提出了一种简单而有效的 UDA 方法，该方法学习在频域中调整图像样式，以减少源域和目标域之间的差异。所提出的方法在训练期间仅引入轻量级预处理模块，并在推理时完全丢弃它，因此不会产生额外的计算开销。我们在域自适应对象检测（DAOD）任务上验证了我们的方法，其中在源域（例如，正常天气或合成条件）中可以轻松访问地面实况注释，但在目标域（例如，恶劣天气或低光场景）中很难获得。大量的实验表明，我们的方法在多个基准测试中取得了显着的性能提升，凸显了其实用性和有效性。

Title: Predicting Public Health Impacts of Electricity Usage

Authors: Yejia Liu, Zhifeng Wu, Pengfei Li, Shaolei Ren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22031
Pdf URL: https://arxiv.org/pdf/2511.22031
Copy Paste: [[2511.22031]] Predicting Public Health Impacts of Electricity Usage(https://arxiv.org/abs/2511.22031)
Keywords: generation
Abstract: The electric power sector is a leading source of air pollutant emissions, impacting the public health of nearly every community. Although regulatory measures have reduced air pollutants, fossil fuels remain a significant component of the energy supply, highlighting the need for more advanced demand-side approaches to reduce the public health impacts. To enable health-informed demand-side management, we introduce HealthPredictor, a domain-specific AI model that provides an end-to-end pipeline linking electricity use to public health outcomes. The model comprises three components: a fuel mix predictor that estimates the contribution of different generation sources, an air quality converter that models pollutant emissions and atmospheric dispersion, and a health impact assessor that translates resulting pollutant changes into monetized health damages. Across multiple regions in the United States, our health-driven optimization framework yields substantially lower prediction errors in terms of public health impacts than fuel mix-driven baselines. A case study on electric vehicle charging schedules illustrates the public health gains enabled by our method and the actionable guidance it can offer for health-informed energy management. Overall, this work shows how AI models can be explicitly designed to enable health-informed energy management for advancing public health and broader societal well-being. Our datasets and code are released at: this https URL.
摘要：电力行业是空气污染物排放的主要来源，影响着几乎每个社区的公共健康。尽管监管措施减少了空气污染物，但化石燃料仍然是能源供应的重要组成部分，这凸显出需要采取更先进的需求方方法来减少对公共健康的影响。为了实现健康知情的需求侧管理，我们引入了 HealthPredictor，这是一种特定领域的人工智能模型，可提供将电力使用与公共卫生结果联系起来的端到端管道。该模型由三个部分组成：估计不同发电来源的贡献的燃料混合预测器、模拟污染物排放和大气扩散的空气质量转换器以及将由此产生的污染物变化转化为货币化健康损害的健康影响评估器。在美国的多个地区，我们的健康驱动优化框架在公共卫生影响方面产生的预测误差比燃料混合驱动的基线要低得多。关于电动汽车充电时间表的案例研究说明了我们的方法所带来的公共健康收益以及它可以为健康知情的能源管理提供的可行指导。总体而言，这项工作展示了如何明确设计人工智能模型，以实现健康知情的能源管理，从而促进公共卫生和更广泛的社会福祉。我们的数据集和代码发布于：此 https URL。

Title: ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

Authors: Junoh Kang, Donghun Ryu, Bohyung Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22048
Pdf URL: https://arxiv.org/pdf/2511.22048
Copy Paste: [[2511.22048]] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion(https://arxiv.org/abs/2511.22048)
Keywords: super-resolution, generative
Abstract: Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.
摘要：现实世界图像超分辨率（Real-ISR）通常通过将输出正则化以使其位于学习的流形上，从而利用文本到图像扩散模型强大的生成先验。然而，现有的方法经常忽视正则化流形的重要性，通常默认为文本条件流形。这种方法有两个关键限制。从概念上讲，它与 Real-ISR 任务不一致，Real-ISR 任务是生成直接与低质量 (LQ) 图像相关的高质量 (HQ) 图像。实际上，教师模型经常重建具有颜色失真和模糊边缘的图像，这表明该任务的生成先验存在缺陷。为了纠正这些缺陷并确保概念上的一致性，更合适的流形必须包含图像中的信息。虽然最直接的方法是直接对原始输入图像进行调节，但它们的高信息密度使得正则化过程在数值上不稳定。为了解决这个问题，我们提出了图像条件流形正则化（ICM），一种将输出正则化为以稀疏但重要的结构信息为条件的流形的方法：色彩图和 Canny 边缘的组合。 ICM提供了任务对齐且稳定的正则化信号，从而避免了密集调节的不稳定性并提高了最终的超分辨率质量。我们的实验证实，所提出的正则化显着增强了超分辨率性能，特别是在感知质量方面，证明了其在现实世界应用中的有效性。我们将发布我们工作的源代码以确保可重复性。

Title: DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation

Authors: Tsai-Ling Huang, Nhat-Tuong Do-Tran, Ngoc-Hoang-Lam Le, Hong-Han Shuai, Ching-Chun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22064
Pdf URL: https://arxiv.org/pdf/2511.22064
Copy Paste: [[2511.22064]] DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation(https://arxiv.org/abs/2511.22064)
Keywords: generation
Abstract: Online handwriting generation (OHG) enhances handwriting recognition models by synthesizing diverse, human-like samples. However, existing OHG methods struggle to generate unseen characters, particularly in glyph-based languages like Chinese, limiting their real-world applicability. In this paper, we introduce our method for OHG, where the writer's style and the characters generated during testing are unseen during training. To tackle this challenge, we propose a Dual-branch Network with Adaptation (DNA), which comprises an adaptive style branch and an adaptive content branch. The style branch learns stroke attributes such as writing direction, spacing, placement, and flow to generate realistic handwriting. Meanwhile, the content branch is designed to generalize effectively to unseen characters by decomposing character content into structural information and texture details, extracted via local and global encoders, respectively. Extensive experiments demonstrate that our DNA model is well-suited for the unseen OHG setting, achieving state-of-the-art performance.
摘要：在线手写生成 (OHG) 通过合成多样化的类人样本来增强手写识别模型。然而，现有的 OHG 方法很难生成看不见的字符，特别是在像中文这样基于字形的语言中，限制了它们在现实世界中的适用性。在本文中，我们介绍了 OHG 方法，其中作者的风格和测试期间生成的角色在训练期间是看不到的。为了应对这一挑战，我们提出了一种具有适应能力的双分支网络（DNA），其中包括一个自适应风格分支和一个自适应内容分支。风格分支学习笔画属性，例如书写方向、间距、位置和流程，以生成逼真的笔迹。同时，内容分支被设计为通过将字符内容分解为分别通过局部和全局编码器提取的结构信息和纹理细节，有效地推广到未见过的字符。大量实验表明，我们的 DNA 模型非常适合看不见的 OHG 设置，实现了最先进的性能。

Title: Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Authors: Yiran Zhang, Weihang Xu, Mo Zhou, Maryam Fazel, Simon Shaolei Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22069
Pdf URL: https://arxiv.org/pdf/2511.22069
Copy Paste: [[2511.22069]] Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian(https://arxiv.org/abs/2511.22069)
Keywords: generative
Abstract: Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/\tau$ rate, where $\tau$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.
摘要：分数匹配已成为现代生成建模中的核心训练目标，特别是在扩散模型中，它用于通过估计分数函数来学习高维数据分布。尽管在实证上取得了成功，但对分数匹配优化行为的理论理解，特别是在过度参数化的情况下，仍然有限。在这项工作中，我们研究梯度下降来训练超参数化模型以学习单个高斯分布。具体来说，我们使用具有 $n$ 个可学习参数的学生模型，并使用总体分数匹配目标根据单个真实高斯生成的数据对其进行训练。我们分析了多种制度下的优化动态。当噪声尺度足够大时，我们证明了梯度下降的全局收敛结果。在低噪声状态下，我们确定了驻点的存在，突出了在这种情况下证明全局收敛的困难。尽管如此，我们在某些初始化条件下展示了收敛性：当参数初始化为指数小时，梯度下降确保所有参数收敛到真实值。我们进一步证明，如果没有指数小初始化，参数可能不会收敛到基本事实。最后，我们考虑参数是从远离真实情况的高斯分布随机初始化的情况。我们证明，只有一个参数收敛而其他参数发散的概率很高，但损失仍然以 $1/\tau$ 速率收敛到零，其中 $\tau$ 是迭代次数。我们还在该机制中建立了一个几乎匹配的收敛速度下限。这是第一个在分数匹配框架下为具有至少三个分量的高斯混合建立全局收敛保证的工作。

Title: WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Authors: Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22098
Pdf URL: https://arxiv.org/pdf/2511.22098
Copy Paste: [[2511.22098]] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation(https://arxiv.org/abs/2511.22098)
Keywords: generation
Abstract: Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.
摘要：视频扩散模型最近在真实性和可控性方面取得了显着的进步。然而，实现跨不同视角的无缝视频翻译，例如第一人称（以自我为中心）和第三人称（以外为中心），仍有待探索。弥合这些观点对于电影制作、具体人工智能和世界模型至关重要。受此启发，我们推出了 WorldWander，这是一个上下文学习框架，专为在视频生成中的自我中心世界和外中心世界之间进行转换而量身定制。 WorldWander 以先进的视频扩散变压器为基础，集成了 (i) 上下文视角对齐和 (ii) 协作位置编码，以有效地模拟跨视图同步。为了进一步支持我们的任务，我们策划了 EgoExo-8K，这是一个大型数据集，包含来自合成场景和现实场景的同步自我中心-外中心三元组。实验表明，WorldWander 实现了卓越的视角同步、字符一致性和泛化能力，为自我中心-外中心视频翻译树立了新的基准。

Title: MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation

Authors: Simon Joseph Clément Crête, Marta Kersten-Oertel, Yiming Xiao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22102
Pdf URL: https://arxiv.org/pdf/2511.22102
Copy Paste: [[2511.22102]] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation(https://arxiv.org/abs/2511.22102)
Keywords: generative
Abstract: MRI-based brain age estimation models aim to assess a subject's biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an $R^2$ of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer's Disease and Parkinson's disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.
摘要：基于 MRI 的大脑年龄估计模型旨在根据神经解剖学特征等信息评估受试者的生物大脑年龄。包括神经退行性疾病在内的各种因素都会加速大脑衰老，测量这种现象可以作为临床应用的潜在生物标志物。虽然基于深度学习 (DL) 的回归最近引起了广泛关注，但现有方法往往无法捕获神经形态变化的连续性质，可能导致次优的特征表示和结果。为了解决这个问题，我们建议首次使用监督对比学习和最近的 Rank-N-Contrast (RNC) 损失来基于广泛使用的 T1w 结构 MRI 来估计大脑年龄，并利用 Grad-RAM 来直观地解释回归结果。实验表明，我们提出的方法在有限的训练样本数据集上实现了 4.27 年的平均绝对误差 (MAE) 和 0.93 的 $R^2$，显着优于具有相同 ResNet 主干的传统深度回归，同时在具有更大训练数据的情况下表现更好或与最先进的方法相当。此外，与传统的深度回归相比，Grad-RAM 通过 RNC 损失揭示了更多与年龄回归相关的细微特征。作为一项探索性研究，我们采用所提出的方法来估计阿尔茨海默病和帕金森病患者的生物学和实际大脑年龄之间的差距，并揭示了大脑年龄差距与疾病严重程度之间的相关性，证明了其作为神经退行性疾病生物标志物的潜力。

Title: PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

Authors: Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva, Juan Zhai, Shiqing Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22119
Pdf URL: https://arxiv.org/pdf/2511.22119
Copy Paste: [[2511.22119]] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization(https://arxiv.org/abs/2511.22119)
Keywords: generative
Abstract: Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: this https URL
摘要：文本到图像 (T2I) 生成模型（例如稳定扩散和通量）可以直接根据文本提示合成逼真的高质量图像。由此产生的图像质量在很大程度上取决于精心设计的提示，这些提示指定了主题和风格修饰符，这些提示已成为宝贵的数字资产。然而，随着高质量提示的价值不断上升和普遍存在，它们也面临着安全和知识产权风险。一个关键威胁是提示窃取攻击，即恢复生成给定图像的文本提示的任务。提示窃取可以在未经授权的情况下提取和重复使用精心设计的提示，但它还可以支持有益的应用程序，例如数据归属、模型来源分析和水印验证。现有方法通常假设白盒梯度访问，需要大规模标记数据集进行监督训练，或者仅依赖字幕而没有显式优化，从而限制了其实用性和适应性。为了应对这些挑战，我们提出了 PROMPTMINER，这是一个黑盒提示窃取框架，它将任务分为两个阶段：（1）基于强化学习的优化阶段来重建主要主题，以及（2）模糊驱动的搜索阶段来恢复文体修饰语。跨多个数据集和扩散主干的实验表明，PROMPTMINER 取得了优异的结果，CLIP 相似度高达 0.958，与 SBERT 的文本对齐高达 0.751，超越了所有基线。即使应用于具有未知生成器的野外图像，它的 CLIP 相似度也比最强基线高 7.5%，表现出更好的泛化能力。最后，PROMPTMINER 在防御扰动下仍保持强劲表现，凸显了非凡的稳健性。代码：这个https URL

Title: Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Authors: Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22121
Pdf URL: https://arxiv.org/pdf/2511.22121
Copy Paste: [[2511.22121]] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation(https://arxiv.org/abs/2511.22121)
Keywords: generation, generative
Abstract: Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.
摘要：人类和传统的计算机视觉方法依赖于一组不同的单目线索来从单个图像推断 3D 结构，例如阴影、纹理、轮廓等。虽然最近的深度生成模型极大地改进了单图像 3D 生成，但仍不清楚这些方法实际利用了哪些图像线索。我们推出了 Cue3D，这是第一个与模型无关的综合框架，用于量化单图像 3D 生成中单个图像线索的影响。我们的统一基准评估七种最先进的方法，涵盖基于回归、多视图和原生 3D 生成范例。通过系统地扰动阴影、纹理、轮廓、透视、边缘和局部连续性等线索，我们测量它们对 3D 输出质量的影响。我们的分析表明，形状意义而非纹理决定了概括性。几何线索，尤其是阴影，对于 3D 生成至关重要。我们进一步确定了对提供的轮廓的过度依赖以及对模型系列中视角和局部连续性等线索的不同敏感性。通过剖析这些依赖关系，Cue3D 增进了我们对现代 3D 网络如何利用经典视觉线索的理解，并为开发更透明、稳健和可控的单图像 3D 生成模型提供了方向。

Title: DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Authors: Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22134
Pdf URL: https://arxiv.org/pdf/2511.22134
Copy Paste: [[2511.22134]] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action(https://arxiv.org/abs/2511.22134)
Keywords: generation
Abstract: To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: this https URL.
摘要：为了构建具有强大推理能力的通用视觉-语言-动作（VLA）模型，常见的策略是首先对专业的 VLA 进行机器人演示训练，以获得可靠的操作技能，然后将混合注释的机器人数据与多模态数据结合起来，以恢复更广泛的推理能力。然而，我们观察到，与微调前的专业模型相比，由此产生的推理 VLA 经常会出现动作性能下降的情况，我们将这种现象称为动作退化。为了解决这个问题，我们提出了 DualVLA，它通过精心设计的后训练来增强动作性能，同时仍然保留推理能力。我们首先引入一种双层数据修剪方法，该方法可以消除冗余的体现推理，防止其对动作学习产生不利影响。为了进一步加强动作生成，我们设计了一种双教师自适应蒸馏策略，将不同的监督信号分配给不同的数据域，同时保持推理能力。为了填补通才 VLA 的评估空白，我们还提出了 VLA 评分，它将 VLA 能力解耦为推理、意图、行动和对齐维度，以进行更细粒度的评估。实验表明，DualVLA 在 SimplerEnv 中的平均成功率为 61.0，在八个竞争性多模态基准测试中平均得分为 65.4，展示了精确动作执行和多模态理解之间更强的平衡。项目网站：此 https URL。

Title: EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

Authors: Yanchao Zhao, Jihao Zhu, Yu Liu, Weizhuo Chen, Yuling Yang, Kun Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22135
Pdf URL: https://arxiv.org/pdf/2511.22135
Copy Paste: [[2511.22135]] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation(https://arxiv.org/abs/2511.22135)
Keywords: generation
Abstract: Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.
摘要：大型语言模型通过自动将文本转换为高质量的手语视频，彻底改变了手语的生成，为聋人社区提供了无障碍的交流。然而，现有的基于法学硕士的方法优先考虑语义准确性，而忽视情感表达，导致输出缺乏自然性和表现力。我们提出了 EASL（情感感知手语），这是一种用于细粒度情感整合的多情感引导生成架构。我们引入了带有渐进式训练的情感语义分离模块，以分别提取语义和情感特征。在姿势解码过程中，情感表征引导语义交互生成具有 7 级情感置信度分数的手势姿势，从而实现情感表达识别。实验结果表明，EASL 通过集成多情感信息，实现了优于所有比较基线的姿势准确性，并有效地适应扩散模型以生成富有表现力的手语视频。

Title: IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Authors: Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22167
Pdf URL: https://arxiv.org/pdf/2511.22167
Copy Paste: [[2511.22167]] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer(https://arxiv.org/abs/2511.22167)
Keywords: generation
Abstract: Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.
摘要：说话人脸生成旨在从单个图像合成逼真的说话肖像，但现有方法通常依赖于显式光流和局部扭曲，这无法模拟复杂的全局运动并导致身份漂移。我们提出了 IMTalker，这是一种新颖的框架，可通过隐式运动传输实现高效、高保真的说话脸部生成。核心思想是用交叉注意机制取代传统的基于流的扭曲，该机制隐式地模拟统一潜在空间内的运动差异和身份对齐，从而实现鲁棒的全局运动渲染。为了在跨身份重演过程中进一步保留说话者的身份，我们引入了一个身份自适应模块，该模块将潜在的运动投射到个性化空间中，确保运动和身份之间的清晰分离。此外，轻量级的流匹配运动生成器可根据音频、姿势和注视线索生成生动且可控的隐式运动向量。大量实验表明，IMTalker 在运动准确性、身份保留和音频口型同步方面超越了先前的方法，以卓越的效率实现了最先进的质量，在 RTX 4090 GPU 上以 40 FPS 的视频驱动速度和 42 FPS 的音频驱动生成速度运行。我们将发布我们的代码和预训练模型，以促进应用和未来的研究。

Title: Partially Shared Concept Bottleneck Models

Authors: Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22170
Pdf URL: https://arxiv.org/pdf/2511.22170
Copy Paste: [[2511.22170]] Partially Shared Concept Bottleneck Models(https://arxiv.org/abs/2511.22170)
Keywords: generation
Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM's effectiveness in achieving both high accuracy and strong interpretability.
摘要：概念瓶颈模型 (CBM) 通过在输入和预测之间引入一层人类可理解的概念来增强可解释性。虽然最近的方法使用大型语言模型（LLM）和视觉语言模型（VLM）自动生成概念，但它们仍然面临三个基本挑战：视觉基础差、概念冗余以及缺乏平衡预测准确性和概念紧凑性的原则性指标。我们引入了 PS-CBM，这是一个部分共享的 CBM 框架，它通过三个核心组件解决了这些限制：（1）一个多模态概念生成器，它将 LLM 派生的语义与基于示例的视觉提示相集成；（2）部分共享概念策略，根据激活模式合并概念以平衡特异性和紧凑性； (3) 概念高效准确性 (CEA)，这是一种事后指标，可共同捕获预测准确性和概念紧凑性。对 11 个不同数据集进行的大量实验表明，PS-CBM 始终优于最先进的 CBM，将分类精度提高 1.0%-7.4%，将 CEA 提高 2.0%-9.5%，同时需要的概念显着减少。这些结果强调了 PS-CBM 在实现高精度和强可解释性方面的有效性。

Title: BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch

Authors: Pu Li, Wenhao Zhang, Weize Quan, Biao Zhang, Peter Wonka, Dong-Ming Yan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.22171
Pdf URL: https://arxiv.org/pdf/2511.22171
Copy Paste: [[2511.22171]] BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch(https://arxiv.org/abs/2511.22171)
Keywords: generation, generative
Abstract: Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive framework for B-rep generation. Our key innovation lies in the Voronoi Half-Patch (VHP) representation, which decomposes B-reps into unified local units by assigning geometry to nearest half-edges and sampling their next pointers. Unlike hierarchical representations that require multiple distinct encodings for different structural levels, our VHP representation facilitates unifying geometric attributes and topological relations in a single, coherent format. We further leverage dual VQ-VAEs to encode both vertex topology and Voronoi Half-Patches into vertex-based tokens, achieving a more compact sequential encoding. A decoder-only Transformer is then trained to autoregressively predict these tokens, which are subsequently mapped to vertex-based features and decoded into complete B-rep models. Experiments demonstrate that BrepGPT achieves state-of-the-art performance in unconditional B-rep generation. The framework also exhibits versatility in various applications, including conditional generation from category labels, point clouds, text descriptions, and images, as well as B-rep autocompletion and interpolation.
摘要：边界表示 (B-rep) 是现代工业设计中 CAD 模型表示的事实标准。 B-rep结构中几何和拓扑元素之间复杂的耦合迫使现有的生成方法依赖于级联多级网络，导致误差累积和计算效率低下。我们提出了 BrepGPT，一种用于 B-rep 生成的单阶段自回归框架。我们的关键创新在于 Voronoi Half-Patch (VHP) 表示，它通过将几何体分配给最近的半边并对其下一个指针进行采样，将 B-reps 分解为统一的局部单元。与需要针对不同结构级别进行多种不同编码的分层表示不同，我们的 VHP 表示有助于以单一、连贯的格式统一几何属性和拓扑关系。我们进一步利用双 VQ-VAE 将顶点拓扑和 Voronoi Half-Patches 编码为基于顶点的标记，从而实现更紧凑的顺序编码。然后训练仅解码器的 Transformer 以自回归方式预测这些标记，随后将其映射到基于顶点的特征并解码为完整的 B-rep 模型。实验表明，BrepGPT 在无条件 B-rep 生成中实现了最先进的性能。该框架还在各种应用中展现了多功能性，包括从类别标签、点云、文本描述和图像进行条件生成，以及 B-rep 自动完成和插值。

Title: Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

Authors: Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22177
Pdf URL: https://arxiv.org/pdf/2511.22177
Copy Paste: [[2511.22177]] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage(https://arxiv.org/abs/2511.22177)
Keywords: generation, generative
Abstract: Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.
摘要：大多数文本到图像采样器的后训练方法都侧重于模型权重：要么微调主干以进行对齐，要么对其进行提炼以提高几步效率。我们采取不同的路线：重新安排冻结采样器的采样时间线。我们不是通过固定的全局调度，而是通过单遍狄利克雷策略来学习实例级（提示条件和噪声条件）调度。为了确保高维策略学习中准确的梯度估计，我们引入了一种基于有原则的 James-Stein 估计器的新颖奖励基线；事实证明，它比常用的变体实现了更低的估计误差，并带来了卓越的性能。我们重新安排的采样器不断改进文本图像对齐，包括现代稳定扩散和通量模型系列中的文本渲染和构图控制。此外，采用我们的时间表的 5 步 Flux-Dev 采样器可以达到与 Flux-Schnell 等故意蒸馏采样器相当的生成质量。因此，我们将调度框架定位为一种新兴的与模型无关的训练后杠杆，可以释放预训练采样器的额外生成潜力。

Title: HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving

Authors: Qiang Li, Yingwenqi Jiang, Tuoxi Li, Duyu Chen, Xiang Feng, Yucheng Ao, Shangyue Liu, Xingchen Yu, Youcheng Cai, Yumeng Liu, Yuexin Ma, Xin Hu, Li Liu, Yu Zhang, Linkun Xu, Bingtao Gao, Xueyuan Wang, Shuchang Zhou, Xianming Liu, Ligang Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22187
Pdf URL: https://arxiv.org/pdf/2511.22187
Copy Paste: [[2511.22187]] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving(https://arxiv.org/abs/2511.22187)
Keywords: generative
Abstract: Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.
摘要：真实且可控的模拟对于推进端到端自动驾驶至关重要，但现有方法往往难以支持大视点变化下的新颖视图合成或确保几何一致性。我们引入了 HybridWorldSim，这是一个混合模拟框架，它将静态背景的多重遍历神经重建与动态代理的生成建模集成在一起。这种统一的设计解决了以前方法的主要局限性，能够创建具有可靠视觉和空间一致性的多样化、高保真驾驶场景。为了促进稳健的基准测试，我们进一步发布了一个新的多重遍历数据集 MIRROR，该数据集捕获了不同城市的各种路线和环境条件。大量的实验表明，HybridWorldSim 超越了以前最先进的方法，为高保真模拟提供了实用且可扩展的解决方案，并为自动驾驶的研究和开发提供了宝贵的资源。

Title: Controllable 3D Object Generation with Single Image Prompt

Authors: Jaeseok Lee, Jaekoo Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22194
Pdf URL: https://arxiv.org/pdf/2511.22194
Copy Paste: [[2511.22194]] Controllable 3D Object Generation with Single Image Prompt(https://arxiv.org/abs/2511.22194)
Keywords: generation, generative
Abstract: Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:this https URL
摘要：最近，扩散模型令人印象深刻的生成能力已经得到证明，可以生成具有极高保真度的图像。特别是，3D 对象生成任务的现有方法是计算机视觉中增长最快的领域之一，主要使用具有文本反转的文本到图像扩散模型，训练伪文本提示来描述给定图像。在实践中，各种文本到图像生成模型采用文本反转来学习伪文本提示嵌入空间中目标对象的概念或风格，从而生成复杂的输出。然而，文本倒置需要额外的训练时间并且缺乏控制能力。为了解决这个问题，我们提出了两种创新方法：(1) 使用现成的图像适配器来生成 3D 对象而无需文本反转，从而增强对深度、姿势和文本等条件的控制。 (2) 深度调节预热策略以增强 3D 一致性。在实验结果中，我们的结果显示了与现有基于文本反转的替代方案在定性和定量上可比较的性能以及改进的 3D 一致性。此外，我们还进行了一项用户研究来评估 (i) 结果与输入图像的匹配程度以及 (ii) 是否保持 3D 一致性。用户研究结果表明，我们的模型优于替代方案，验证了我们方法的有效性。我们的代码可在 GitHub 存储库中找到：此 https URL

Title: From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

Authors: Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.22232
Pdf URL: https://arxiv.org/pdf/2511.22232
Copy Paste: [[2511.22232]] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation(https://arxiv.org/abs/2511.22232)
Keywords: generation
Abstract: Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
摘要：多模态大语言模型 (MLLM) 在推进医疗保健方面显示出了希望。然而，大多数现有模型仍然局限于单图像理解，这极大地限制了它们在临床工作流程中的适用性。在实践中，医学诊断和进展通常需要综合来自不同模式或时间点的多个图像的信息。由于缺乏大规模、高质量的带注释训练数据，具有这种多图像理解能力的医学 MLLM 的发展受到阻碍。为了解决这一限制，我们提出了一种新颖的框架，该框架利用生物医学文献中允许许可的复合图像，作为多图像分析的丰富但未充分利用的数据源。具体来说，我们设计了一个以分而治之策略为基础的五阶段上下文感知指令生成范例。通过将多图像分析分解为可管理的子任务，该范式使 MLLM 能够超越单面板分析，并通过学习这些复合图形中固有的复杂空间、时间和跨模式关系来提供复合理解。通过解析超过 237,000 个复合图形及其上下文文本以生成指令，我们开发了 M3LLM，一种医学多图像多模态大语言模型。对于基准测试，我们构建了用于综合理解的 PMC-MI-Bench，并由医学专家手动验证。大量实验表明，M3LLM 在多图像、单图像、纯文本和多选择场景中显着优于通用和专用医学 MLLM。值得注意的是，M3LLM 对使用 MIMIC 数据集的纵向胸部 X 射线分析表现出很强的泛化能力。这项工作为开发具有复合推理能力的医学 MLLM 建立了一个可扩展且高效的范例，弥合了生物医学文献与现实临床应用之间的差距。

Title: IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution

Authors: Xiang Feng, Tieshi Zhong, Shuo Chang, Weiliu Wang, Chengkai Wang, Yifei Chen, Yuhe Wang, Zhenzhong Kuang, Xuefei Yin, Yanming Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22233
Pdf URL: https://arxiv.org/pdf/2511.22233
Copy Paste: [[2511.22233]] IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution(https://arxiv.org/abs/2511.22233)
Keywords: super-resolution
Abstract: Reconstructing high-resolution (HR) 3D Gaussian Splatting (3DGS) models from low-resolution (LR) inputs remains challenging due to the lack of fine-grained textures and geometry. Existing methods typically rely on pre-trained 2D super-resolution (2DSR) models to enhance textures, but suffer from 3D Gaussian ambiguity arising from cross-view inconsistencies and domain gaps inherent in 2DSR models. We propose IE-SRGS, a novel 3DGS SR paradigm that addresses this issue by jointly leveraging the complementary strengths of external 2DSR priors and internal 3DGS features. Specifically, we use 2DSR and depth estimation models to generate HR images and depth maps as external knowledge, and employ multi-scale 3DGS models to produce cross-view consistent, domain-adaptive counterparts as internal knowledge. A mask-guided fusion strategy is introduced to integrate these two sources and synergistically exploit their complementary strengths, effectively guiding the 3D Gaussian optimization toward high-fidelity reconstruction. Extensive experiments on both synthetic and real-world benchmarks show that IE-SRGS consistently outperforms state-of-the-art methods in both quantitative accuracy and visual fidelity.
摘要：由于缺乏细粒度纹理和几何形状，从低分辨率 (LR) 输入重建高分辨率 (HR) 3D 高斯溅射 (3DGS) 模型仍然具有挑战性。现有方法通常依赖于预先训练的 2D 超分辨率 (2DSR) 模型来增强纹理，但会因 2DSR 模型中固有的跨视图不一致和域间隙而产生 3D 高斯模糊性。我们提出了 IE-SRGS，这是一种新颖的 3DGS SR 范例，它通过联合利用外部 2DSR 先验和内部 3DGS 特征的互补优势来解决这个问题。具体来说，我们使用 2DSR 和深度估计模型来生成 HR 图像和深度图作为外部知识，并使用多尺度 3DGS 模型来生成跨视图一致、领域自适应的对应物作为内部知识。引入掩模引导融合策略来集成这两个源并协同利用它们的互补优势，有效引导 3D 高斯优化实现高保真重建。对合成和现实世界基准的大量实验表明，IE-SRGS 在定量准确性和视觉保真度方面始终优于最先进的方法。

Title: Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Authors: Bolin Lai, Xudong Wang, Saketh Rambhatla, James M. Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22249
Pdf URL: https://arxiv.org/pdf/2511.22249
Copy Paste: [[2511.22249]] Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective(https://arxiv.org/abs/2511.22249)
Keywords: generation
Abstract: Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.
摘要：潜在扩散已成为视觉生成的默认范例，但随着潜在维度的增加，我们观察到持续的重建与生成权衡：更高容量的自动编码器提高了重建保真度，但生成质量最终下降。我们将这种差距追溯到高频编码和解码中的不同行为。通过 RGB 和潜在域中的受控扰动，我们分析了编码器/解码器的行为，发现解码器强烈依赖高频潜在组件来恢复细节，而编码器则不能充分表示高频内容，从而在扩散模型训练的高频频段中产生曝光不足和欠拟合。为了解决这个问题，我们引入了 FreqWarm，这是一种即插即用的频率热身课程，可在扩散或流量匹配训练期间增加对高频潜在信号的早期接触，而无需修改或重新训练自动编码器。 FreqWarm 应用于多个高维自动编码器，持续提高生成质量：在 Wan2.2-VAE 上将 gFID 降低 14.11，在 LTX-VAE 上降低 6.13，在 DC-AE-f32 上降低 4.42，同时保持与架构无关并与不同的骨干网兼容。我们的研究表明，明确管理频率暴露可以成功地将高维潜在空间转变为更可扩散的目标。

Title: TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Authors: Henrijs Princis, Arindam Sharma, Cristina David
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22277
Pdf URL: https://arxiv.org/pdf/2511.22277
Copy Paste: [[2511.22277]] TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation(https://arxiv.org/abs/2511.22277)
Keywords: generation
Abstract: Large language models (LLMs) have shown remarkable ability to generate code, yet their outputs often violate syntactic or semantic constraints when guided only through natural language prompts. We introduce TreeCoder, the most general and flexible framework to date for exploring decoding strategies, constraints, and hyperparameters in LLMs, and use it in code generation to enforce correctness and structure during decoding rather than relying on prompt engineering. TreeCoder represents decoding as a tree search over candidate programs, where both decoding strategies and constraint functions - such as style, syntax, execution - are treated as first-class, optimisable components. This design enables systematic exploration and automatic tuning of decoding configurations using standard optimisation techniques. Experiments on the MBPP (Python) and SQL-Spider benchmarks show that TreeCoder consistently improves accuracy across open-source models such as CodeLlama, Mistral and DeepSeek, often outperforming their unconstrained baselines by considerable margins.
摘要：大型语言模型 (LLM) 已显示出生成代码的卓越能力，但当仅通过自然语言提示进行引导时，其输出常常违反句法或语义约束。我们引入了 TreeCoder，这是迄今为止最通用、最灵活的框架，用于探索 LLM 中的解码策略、约束和超参数，并在代码生成中使用它来强制解码过程中的正确性和结构，而不是依赖即时工程。 TreeCoder 将解码表示为对候选程序的树搜索，其中解码策略和约束函数（例如样式、语法、执行）都被视为一流的可优化组件。该设计支持使用标准优化技术对解码配置进行系统探索和自动调整。 MBPP (Python) 和 SQL-Spider 基准测试的实验表明，TreeCoder 持续提高了 CodeLlama、Mistral 和 DeepSeek 等开源模型的准确性，通常远远超过其无约束基线。

Title: The Collapse of Patches

Authors: Wei Guo, Shunqi Mao, Zhuonan Liang, Heng Wang, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22281
Pdf URL: https://arxiv.org/pdf/2511.22281
Copy Paste: [[2511.22281]] The Collapse of Patches(https://arxiv.org/abs/2511.22281)
Keywords: generation
Abstract: Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at this https URL .
摘要：观察图像中的某些斑块可以减少其他斑块的不确定性。它们的实现降低了每个剩余补丁特征的分布熵，类似于量子力学中粒子波函数的塌缩。这种现象可以直观地称为补丁崩溃。为了确定在目标区域崩溃期间最依赖哪些补丁，我们学习了一个自动编码器，它可以软性地选择补丁的子集来重建每个目标补丁。将这些学习到的每个补丁的 PageRank 分数的依赖关系绘制成图表，揭示了实现图像的最佳补丁顺序。我们证明，尊重这个顺序有利于各种掩模图像建模方法。首先，可以通过重新训练最先进的模型 MAR 来增强自回归图像生成。接下来，我们通过仅将 Vision Transformer 暴露于折叠顺序中的高等级补丁来引入图像分类的新设置。看到 22% 的此类补丁就足以实现高精度。通过这些实验，我们提出斑块塌陷作为一种新颖的图像建模视角，可以提高视觉效率。我们的项目可通过此 https URL 获取。

Title: Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Authors: Kate Feingold, Omri Kaduri, Tali Dekel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22287
Pdf URL: https://arxiv.org/pdf/2511.22287
Copy Paste: [[2511.22287]] Match-and-Fuse: Consistent Generation from Unstructured Image Sets(https://arxiv.org/abs/2511.22287)
Keywords: generation
Abstract: We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.
摘要：我们提出了 Match-and-Fuse - 一种零样本、免训练的方法，用于一致控制生成非结构化图像集 - 共享共同视觉元素的集合，但在视点、捕获时间和周围内容方面有所不同。与对单个图像或密集采样视频进行操作的现有方法不同，我们的框架执行集合到集合的生成：给定源集和用户提示，它会生成一个新集合，以保留共享内容的跨图像一致性。我们的关键思想是将任务建模为一个图，其中每个节点对应一个图像，每个边触发图像对的联合生成。这种表述将所有成对的世代整合到一个统一的框架中，增强它们的局部一致性，同时确保整个集合的全局一致性。这是通过在密集输入对应的指导下融合图像对的内部特征来实现的，无需掩模或手动监督。它还允许我们利用文本到图像模型中的新兴先验，当多个视图共享单个画布时，鼓励连贯的生成。 Match-and-Fuse 实现了最先进的一致性和视觉质量，并解锁了从图像集合创建内容的新功能。

Title: Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22345
Pdf URL: https://arxiv.org/pdf/2511.22345
Copy Paste: [[2511.22345]] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment(https://arxiv.org/abs/2511.22345)
Keywords: generation, generative
Abstract: Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at this https URL.
摘要：归一化流（NF）是一类生成模型，其特点是数学上可逆的架构，其中前向传递将数据转换为用于密度估计的潜在空间，反向传递从该空间生成新样本。这一特性在表示学习和数据生成之间创造了内在的协同作用。然而，标准 NF 的生成质量受到对数似然优化的不良语义表示的限制。为了解决这个问题，我们提出了一种新颖的对齐策略，创造性地利用了 NF 的可逆性：我们不是将正向传递进行正则化，而是将生成（反向）传递的中间特征与强大的视觉基础模型的表示进行对齐，从而证明了比朴素对齐更优越的有效性。我们还引入了一种新颖的免训练、测试时优化分类算法，它为 NF 的嵌入语义知识提供了更内在的评估。综合实验表明，我们的方法将 NF 的训练速度提高了 3.3 美元\倍以上，同时在生成质量和分类准确性方面取得了显着提高。新的最先进的 NF 结果是在 ImageNet 64$\times$64 和 256$\times$256 上建立的。我们的代码可以在这个 https URL 上找到。

Title: INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

Authors: Anshul Bagaria
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22351
Pdf URL: https://arxiv.org/pdf/2511.22351
Copy Paste: [[2511.22351]] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts(https://arxiv.org/abs/2511.22351)
Keywords: super-resolution, generative
Abstract: The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.
摘要：最近的 GAN 和扩散模型生成的人工智能生成的图像越来越真实，加剧了人们对视觉媒体可靠性的担忧。然而，尽管深度伪造检测取得了显着进展，但当前的取证系统在现实条件下（例如严重的下采样、压缩和跨域分布变化）急剧退化。此外，大多数检测器都是作为不透明的分类器运行的，无法深入了解为什么图像被标记为合成图像，从而破坏了信任并阻碍了在高风险环境中的采用。我们推出了 INSIGHT（可解释的神经语义和基于图像的生成法医幻觉追踪），这是一个统一的多模态框架，即使在极低的分辨率 (16x16 - 64x64) 下，也能对 AI 生成的图像进行稳健检测和透明解释。 INSIGHT 结合了分层超分辨率，可放大微妙的取证线索，而不会产生误导性的伪影；Grad-CAM 驱动的多尺度定位，可揭示指示生成模式的空间区域；以及 CLIP 引导的语义对齐，可将视觉异常映射到人类可解释的描述符。然后使用结构化的 ReAct + 思想链协议提示视觉语言模型，以产生一致的、细粒度的解释，并通过双阶段 G-Eval + LLM 作为法官管道进行验证，以最大限度地减少幻觉并确保事实性。在动物、车辆和抽象合成场景等不同领域，INSIGHT 极大地提高了极端退化下的检测鲁棒性和解释质量，优于先前的检测器和黑盒 VLM 基线。我们的结果突出了一条通往透明、可靠的人工智能生成图像取证的实用道路，并将 INSIGHT 确立为值得信赖的多模式内容验证的一步。

Title: DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention

Authors: Furkan Guzelant, Arda Goktogan, Tarık Kaya, Aysegul Dundar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22411
Pdf URL: https://arxiv.org/pdf/2511.22411
Copy Paste: [[2511.22411]] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention(https://arxiv.org/abs/2511.22411)
Keywords: generation
Abstract: 3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.
摘要：3D 头部风格化已成为以各种艺术形式重新想象现实人体头部的关键技术，从而在数字媒体中实现富有表现力的角色设计和创造性的视觉体验。尽管 3D 感知生成取得了进展，但现有的 3D 头部风格化方法通常依赖于计算成本高昂的优化或特定领域的微调来适应新风格。为了解决这些限制，我们提出了 DiffStyle360，这是一种基于扩散的框架，能够在给定单一风格参考图像的情况下在不同的艺术领域生成多视图一致、保留身份的 3D 头部风格化，而不需要针对每种风格进行训练。基于 3D 感知 DiffPortrait360 架构，我们的方法引入了两个关键组件：风格外观模块（将风格与内容分离）和风格融合注意机制（自适应地平衡潜在空间中的结构保留和风格化保真度）。此外，我们采用 3D GAN 生成的多视图数据集进行鲁棒微调，并引入基于温度的关键缩放策略来控制推理过程中的风格化强度。在 FFHQ 和 RenderMe360 上进行的大量实验表明，DiffStyle360 实现了卓越的风格质量，在具有挑战性的风格领域中优于最先进的 GAN 和基于扩散的风格化方法。

Title: Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Authors: Minghao Yin, Yukang Cao, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22425
Pdf URL: https://arxiv.org/pdf/2511.22425
Copy Paste: [[2511.22425]] Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models(https://arxiv.org/abs/2511.22425)
Keywords: generative
Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
摘要：我们提出了 WUKONG，一种新颖的免训练框架，用于高保真纹理 3D 变形，采用一对源和目标提示（图像或文本）作为输入。与依赖手动对应匹配和变形轨迹估计（限制泛化并需要昂贵的预处理）的传统方法不同，WUKONG 利用基于流的变压器的生成先验来生成具有丰富纹理细节的高保真 3D 过渡。为了确保平滑的形状过渡，我们利用基于流的生成过程的固有连续性，并将变形制定为最佳传输重心问题。我们进一步引入了顺序初始化策略，以防止突然的几何扭曲并保持身份一致性。为了忠实地保存纹理，我们提出了一种相似性引导的语义一致性机制，可以选择性地保留高频细节，并能够精确控制混合动态。这可以避免常见的伪像，例如过度平滑，同时保持语义保真度。广泛的定量和定性评估表明，WUKONG 的性能显着优于最先进的方法，在不同的几何形状和纹理变化中均取得了优异的结果。

Title: Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Authors: Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22443
Pdf URL: https://arxiv.org/pdf/2511.22443
Copy Paste: [[2511.22443]] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition(https://arxiv.org/abs/2511.22443)
Keywords: generation
Abstract: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.
摘要：Deepfake 一代已经取得了显着的进步，有助于生成高度逼真的图像、视频和音频。虽然技术上很有趣，但这种进展引起了对滥用受操纵媒体的严重担忧。为了减少这种滥用，迫切需要强大且可靠的深度伪造检测。为此，我们提出了一种新颖的网络 FauxNet，它基于预先训练的视觉语音识别（VSR）功能。通过从视频中提取时间 VSR 特征，我们可以识别真实视频并将其与经过处理的视频分开。在这种情况下，圣杯与零样本检测有关，即可泛化检测，这是我们在这项工作中关注的重点。 FauxNet 在此设置中始终优于最先进的技术。此外，FauxNet 能够区分视频的生成技术。最后，我们提出了新的数据集，称为 Authentica-Vox 和 Authentica-HDTF，总共包含约 38,000 个真实和虚假视频，后者是使用六种最新的 Deepfake 生成技术创建的。我们对 Authentica 数据集和 FaceForensics++ 提供了广泛的分析和结果，证明了 FauxNet 的优越性。 Authentica 数据集将公开。

Title: Beyond Real versus Fake Towards Intent-Aware Video Analysis

Authors: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22455
Pdf URL: https://arxiv.org/pdf/2511.22455
Copy Paste: [[2511.22455]] Beyond Real versus Fake Towards Intent-Aware Video Analysis(https://arxiv.org/abs/2511.22455)
Keywords: generative
Abstract: The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.
摘要：生成模型的快速发展导致了越来越真实的深度伪造视频，带来了重大的社会和安全风险。虽然现有的检测方法侧重于区分真视频和假视频，但这些方法无法解决一个基本问题：被操纵的视频背后的意图是什么？为了解决这个问题，我们引入了 IntentHQ：一个以人为中心的意图分析的新基准，将范式从真实性验证转变为视频的上下文理解。 IntentHQ 由 5168 个视频组成，这些视频经过精心收集并注释了 23 个细粒度的意图类别，包括“金融欺诈”、“间接营销”、“政治宣传”以及“散布恐惧”。我们使用监督和自监督的多模态模型进行意图识别，这些模型集成了时空视频特征、音频处理和文本分析，以推断视频背后的潜在动机和目标。我们提出的模型经过简化，可以区分各种意图类别。

Title: ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

Authors: Zhenglin Zhou, Fan Ma, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22456
Pdf URL: https://arxiv.org/pdf/2511.22456
Copy Paste: [[2511.22456]] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models(https://arxiv.org/abs/2511.22456)
Keywords: generation, generative
Abstract: We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at this https URL.
摘要：我们探索文本引导的 3D 扩散模型中的推理时间缩放，以在无需额外训练的情况下提高生成质量。为此，我们引入了 ITS3D，该框架将任务表述为优化问题，以识别最有效的高斯噪声输入。该框架由验证者引导的搜索算法驱动，其中搜索算法根据验证者的反馈迭代地细化候选噪声。为了解决 3D 生成的固有挑战，我们引入了三种技术来提高稳定性、效率和探索能力。 1）应用高斯归一化来稳定搜索过程。当候选噪声在迭代更新期间偏离标准高斯分布时，它会纠正分布变化。 2) 3D 搜索空间的高维特性增加了计算复杂度。为了缓解这个问题，采用基于奇异值分解的压缩技术来降低维度，同时保留有效的搜索方向。 3）为了进一步防止收敛到次优局部最小值，奇异空间重置机制基于多样性度量动态更新搜索空间。大量实验表明，ITS3D 提高了文本到 3D 的生成质量，这显示了生成过程中计算高效的搜索方法的潜力。源代码可从此 https URL 获取。

Title: Rethinking Cross-Generator Image Forgery Detection through DINOv3

Authors: Zhenglin Huang, Jason Li, Haiquan Wen, Tianxiao Li, Xi Yang, Lu Qi, Bei Peng, Xiaowei Huang, Ming-Hsuan Yang, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22471
Pdf URL: https://arxiv.org/pdf/2511.22471
Copy Paste: [[2511.22471]] Rethinking Cross-Generator Image Forgery Detection through DINOv3(https://arxiv.org/abs/2511.22471)
Keywords: generative
Abstract: As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.
摘要：随着生成模型变得越来越多样化和强大，跨生成器检测已成为一个新的挑战。现有的检测方法通常会记住特定生成模型的工件，而不是学习可转移的线索，从而导致看不见的生成器出现严重故障。令人惊讶的是，这项工作发现冻结的视觉基础模型，尤其是 DINOv3，无需任何微调就已经表现出强大的跨生成器检测能力。通过对频率、空间和令牌视角的系统研究，我们观察到 DINOv3 倾向于依赖全局低频结构作为微弱但可转移的真实性线索，而不是高频、生成器特定的工件。受这种洞察力的启发，我们引入了一种简单的、免训练的令牌排名策略，然后是轻量级线性探针来选择一小部分与真实性相关的令牌。该令牌子集始终提高所有评估数据集的检测准确性。我们的研究提供了经验证据和可行的假设，用于理解为什么基础模型可以在不同的生成器中推广，为图像伪造检测提供通用、高效和可解释的基线。

Title: Adversarial Flow Models

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22475
Pdf URL: https://arxiv.org/pdf/2511.22475
Copy Paste: [[2511.22475]] Adversarial Flow Models(https://arxiv.org/abs/2511.22475)
Keywords: generation, generative
Abstract: We present adversarial flow models, a class of generative models that unifies adversarial models and flow models. Our method supports native one-step or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without needing to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 2NFE and 4NFE counterparts.
摘要：我们提出了对抗流模型，这是一类统一对抗模型和流模型的生成模型。我们的方法支持本机一步或多步生成，并使用对抗性目标进行训练。与传统的 GAN（生成器学习噪声和数据分布之间的任意传输计划）不同，我们的生成器学习确定性噪声到数据的映射，这与流匹配模型中的最佳传输相同。这显着稳定了对抗性训练。此外，与基于一致性的方法不同，我们的模型直接学习一步或一步生成，而不需要学习传播概率流的中间时间步。这节省了模型容量，减少了训练迭代次数，并避免了错误累积。在 ImageNet-256px 上相同的 1NFE 设置下，我们的 B/2 模型接近基于一致性的 XL/2 模型的性能，而我们的 XL/2 模型创建了新的最佳 FID 2.38。我们还展示了在没有任何中间监督的情况下通过深度重复对 56 层和 112 层模型进行端到端训练的可能性，并使用单个前向传递实现了 2.08 和 1.94 的 FID，超过了 2NFE 和 4NFE 同行。

Title: Enhancing Trustworthiness with Mixed Precision: Benchmarks, Opportunities, and Challenges

Authors: Guanxi Lu, Hao Mark Chen, Zhiqiang Que, Wayne Luk, Hongxiang Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22483
Pdf URL: https://arxiv.org/pdf/2511.22483
Copy Paste: [[2511.22483]] Enhancing Trustworthiness with Mixed Precision: Benchmarks, Opportunities, and Challenges(https://arxiv.org/abs/2511.22483)
Keywords: generation
Abstract: Large language models (LLMs) have shown promising performance across various tasks. However, their autoregressive decoding process poses significant challenges for efficient deployment on existing AI hardware. Quantization alleviates memory and compute pressure by compressing weights, activations, and KV caches to low precisions while preserving generation quality. However, existing quantization frameworks typically focus on perplexity or classification accuracy, often omitting critical trustworthiness metrics. This gap introduces risks when applying quantized LLMs to downstream high-stakes domains such as finance and healthcare. In this work, we systematically investigate the impact of quantization on four trustworthiness metrics (adversarial robustness, fairness, machine ethics, and out-of-distribution robustness) and identify the instability across compression ratios and quantization methods. Building on these observations, we develop a novel precision-ensemble voting approach that leverages predictions from mixed-precision variants of the same model and consistently improves performance by up to $5.8\%$ on trustworthiness metrics. Our results highlight the importance of considering trustworthiness when developing model compression techniques and point to research opportunities at the intersection of compression and trustworthiness for safety-critical applications.
摘要：大型语言模型 (LLM) 在各种任务中都表现出了良好的性能。然而，它们的自回归解码过程对现有人工智能硬件上的高效部署提出了重大挑战。量化通过将权重、激活和 KV 缓存压缩到低精度来减轻内存和计算压力，同时保持生成质量。然而，现有的量化框架通常关注复杂性或分类准确性，常常忽略关键的可信度指标。当将量化法学硕士应用于金融和医疗保健等下游高风险领域时，这种差距会带来风险。在这项工作中，我们系统地研究了量化对四个可信度指标（对抗鲁棒性、公平性、机器道德和分布外鲁棒性）的影响，并确定了压缩比和量化方法的不稳定性。基于这些观察结果，我们开发了一种新颖的精确集成投票方法，该方法利用同一模型的混合精度变体的预测，并在可信度指标上持续提高性能高达 5.8\%$。我们的结果强调了在开发模型压缩技术时考虑可信度的重要性，并指出了安全关键应用程序的压缩和可信度交叉点的研究机会。

Title: AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Authors: Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22488
Pdf URL: https://arxiv.org/pdf/2511.22488
Copy Paste: [[2511.22488]] AI killed the video star. Audio-driven diffusion model for expressive talking head generation(https://arxiv.org/abs/2511.22488)
Keywords: generation
Abstract: We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.
摘要：我们提出了 Dimitra++，这是一种用于音频驱动头部说话生成的新颖框架，经过简化可以学习嘴唇运动、面部表情以及头部姿势运动。具体来说，我们提出了一种条件运动扩散变换器 (cMDT)，采用 3D 表示来建模面部运动序列。 cMDT 以两个输入为条件：确定外观的参考面部图像，以及驱动运动的音频序列。定量和定性实验以及对两个广泛使用的数据集（即 VoxCeleb2 和 CelebV-HQ）的用户研究表明，Dimitra++ 在生成赋予嘴唇运动、面部表情和头部姿势的真实说话头像方面能够优于现有方法。

Title: SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts

Authors: Shun Inadumi, Shohei Tanaka, Tosho Hirasawa, Atsushi Hashimoto, Koichiro Yoshino, Yoshitaka Ushiku
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2511.22490
Pdf URL: https://arxiv.org/pdf/2511.22490
Copy Paste: [[2511.22490]] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts(https://arxiv.org/abs/2511.22490)
Keywords: generation
Abstract: As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this gap, we introduce SciPostGen, a large-scale dataset for understanding and generating poster layouts from scientific papers. Our analyses based on SciPostGen show that paper structures are associated with the number of layout elements in posters. Based on this insight, we explore a framework, Retrieval-Augmented Poster Layout Generation, which retrieves layouts consistent with a given paper and uses them as guidance for layout generation. We conducted experiments under two conditions: with and without layout constraints typically specified by poster creators. The results show that the retriever estimates layouts aligned with paper structures, and our framework generates layouts that also satisfy given constraints.
摘要：随着科学论文数量的不断增长，需要能够有效传达研究成果的方法，海报作为展示论文内容的关键媒介。海报布局决定了研究成果的传播和理解的有效性，凸显了它们日益增长的重要性。特别是，在理解论文如何与呈现它们的布局相对应方面仍然存在差距，这需要具有大规模成对注释的数据集。为了弥补这一差距，我们引入了 SciPostGen，这是一个用于理解科学论文并生成海报布局的大型数据集。我们基于 SciPostGen 的分析表明，纸张结构与海报中布局元素的数量相关。基于这一见解，我们探索了一个框架，即检索增强海报布局生成，该框架检索与给定论文一致的布局，并将其用作布局生成的指导。我们在两种条件下进行了实验：有或没有海报创建者通常指定的布局约束。结果表明，检索器估计布局与纸张结构一致，并且我们的框架生成也满足给定约束的布局。

Title: Space Explanations of Neural Network Classification

Authors: Faezeh Labbaf, Tomáš Kolárik, Martin Blicha, Grigory Fedyukovich, Michael Wand, Natasha Sharygina
Subjects: cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2511.22498
Pdf URL: https://arxiv.org/pdf/2511.22498
Copy Paste: [[2511.22498]] Space Explanations of Neural Network Classification(https://arxiv.org/abs/2511.22498)
Keywords: generation
Abstract: We present a novel logic-based concept called Space Explanations for classifying neural networks that gives provable guarantees of the behavior of the network in continuous areas of the input feature space. To automatically generate space explanations, we leverage a range of flexible Craig interpolation algorithms and unsatisfiable core generation. Based on real-life case studies, ranging from small to medium to large size, we demonstrate that the generated explanations are more meaningful than those computed by state-of-the-art.
摘要：我们提出了一种新的基于逻辑的概念，称为空间解释，用于对神经网络进行分类，为输入特征空间的连续区域中的网络行为提供可证明的保证。为了自动生成空间解释，我们利用了一系列灵活的 Craig 插值算法和不可满足的核心生成。基于从小到中到大尺寸的现实案例研究，我们证明生成的解释比最先进的计算结果更有意义。

Title: What Shape Is Optimal for Masks in Text Removal?

Authors: Hyakka Nakada, Marika Kubota
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22499
Pdf URL: https://arxiv.org/pdf/2511.22499
Copy Paste: [[2511.22499]] What Shape Is Optimal for Masks in Text Removal?(https://arxiv.org/abs/2511.22499)
Keywords: generative
Abstract: The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.
摘要：生成模型的出现极大地提高了图像修复的准确性。特别是，通过从文档图像中删除特定文本，重建原始图像对于工业应用极为重要。然而，大多数现有的文本删除方法集中于删除出现在室外环境中相机捕获的图像中的简单场景文本。很少有专门针对带有密集文本的复杂实用图像的研究。因此，我们创建了用于从包含大量文本的图像中删除文本的基准数据。从数据中，我们发现文本删除性能容易受到掩模轮廓扰动的影响。因此，对于实际的文本删除任务，精确调整掩模形状至关重要。这项研究开发了一种对高度灵活的掩模轮廓进行建模并使用贝叶斯优化学习其参数的方法。结果发现，所得的配置文件是字符型掩码。还发现文本区域的最小覆盖并不是最佳的。我们的研究预计将为用户友好的手动屏蔽指南铺平道路。

Title: Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Authors: Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22533
Pdf URL: https://arxiv.org/pdf/2511.22533
Copy Paste: [[2511.22533]] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration(https://arxiv.org/abs/2511.22533)
Keywords: generation, generative
Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).
摘要：扩散模型在 2D 图像、视频和 3D 形状等模态中取得了令人印象深刻的生成质量，但由于迭代去噪过程，其推理的计算成本仍然很高。虽然最近基于缓存的方法有效地重用冗余计算来加速 2D 和视频生成，但将这些技术直接应用于 3D 扩散模型可能会严重破坏几何一致性。在 3D 合成中，即使缓存的潜在特征中的微小数值错误也会累积，导致结构伪影和拓扑不一致。为了克服这一限制，我们提出了 Fast3Dcache，这是一种免训练的几何感知缓存框架，可以加速 3D 扩散推理，同时保持几何保真度。我们的方法引入了预测缓存调度器约束（PCSC）来根据体素稳定模式动态确定缓存配额，并引入时空稳定性标准（SSC）来根据速度大小和加速度标准选择稳定特征以供重用。综合实验表明，Fast3Dcache 显着加速了推理，实现了 27.12% 的加速和 54.8% 的 FLOP 减少，并且通过 Chamfer Distance (2.48%) 和 F-Score (1.95%) 衡量的几何质量下降最小。

Title: Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior

Authors: Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li, Xin Jin, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22549
Pdf URL: https://arxiv.org/pdf/2511.22549
Copy Paste: [[2511.22549]] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior(https://arxiv.org/abs/2511.22549)
Keywords: generative
Abstract: Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model's generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH's superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: this https URL.
摘要：图像压缩方法通常针对人类感知或机器分析任务单独优化。我们揭示了这些目标之间的基本共性：保留准确的语义信息至关重要，因为它直接决定了智能任务关键信息的完整性并有助于人类理解。同时，增强的感知质量不仅可以提高视觉吸引力，而且通过确保真实的图像分布，有利于机器任务的语义特征提取。基于这一见解，我们提出了 Diff-ICMH，一种生成图像压缩框架，旨在协调图像压缩中的机器视觉和人类视觉。它通过利用生成先验来确保感知真实性，同时通过在训练期间结合语义一致性损失（SC 损失）来保证语义保真度。此外，我们还引入了标签指导模块（TGM），它利用高度语义的图像级标签来刺激预先训练的扩散模型的生成能力，并且需要最少的额外比特率。因此，Diff-ICMH 通过单个编解码器和比特流支持多个智能任务，无需任何特定于任务的适应，同时保留人类感知的高质量视觉体验。大量的实验结果证明了 Diff-ICMH 在不同任务中的优越性和普遍性，同时保持了对人类感知的视觉吸引力。代码可在以下位置获得：此 https URL。

Title: Bringing Your Portrait to 3D Presence

Authors: Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22553
Pdf URL: https://arxiv.org/pdf/2511.22553
Copy Paste: [[2511.22553]] Bringing Your Portrait to 3D Presence(https://arxiv.org/abs/2511.22553)
Keywords: generative
Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.
摘要：我们提出了一个统一的框架，用于根据头部、半身和全身输入的单个肖像重建可动画的 3D 人体化身。我们的方法解决了三个瓶颈：姿势和帧敏感的特征表示、有限的可扩展数据和不可靠的代理网格估计。我们引入了双 UV 表示，通过 Core-UV 和 Shell-UV 分支将图像特征映射到规范 UV 空间，从而消除姿势和框架引起的标记偏移。我们还构建了一个分解的合成数据流形，将 2D 生成多样性与几何一致的 3D 渲染相结合，并由提高真实性和身份一致性的训练方案支持。强大的代理网格跟踪器可在部分可见性下保持稳定性。这些组件共同实现了强大的野外泛化能力。我们的模型仅在半身合成数据上进行训练，实现了最先进的头部和上半身重建以及具有竞争力的全身结果。大量的实验和分析进一步验证了我们方法的有效性。

Title: GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing

Authors: Xiaoyin Yang
Subjects: cs.CV, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22607
Pdf URL: https://arxiv.org/pdf/2511.22607
Copy Paste: [[2511.22607]] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing(https://arxiv.org/abs/2511.22607)
Keywords: generation
Abstract: Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.
摘要：眼动追踪在虚拟和增强现实应用中变得越来越重要；然而，目前的注视精度还不能满足空间计算的要求。我们设计了一个视线收集框架，并利用高精度设备收集了第一个精确的基准数据集 GazeTrack，涵盖了不同种族、年龄和视力条件，用于瞳孔定位和视线跟踪。我们提出了一种新颖的形状误差正则化方法来约束瞳孔椭圆拟合并在开源数据集上进行训练，从而提高语义分割和瞳孔位置预测的准确性。此外，我们发明了一种类似于纸张展开的新颖坐标变换方法，以准确预测 GazeTrack 数据集上的注视向量。最后，我们建立了一个注视向量生成模型，与其他方法相比，该模型以较低的计算复杂度实现了减少注视角度误差。

Title: Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning

Authors: Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, Andreas Krause
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22640
Pdf URL: https://arxiv.org/pdf/2511.22640
Copy Paste: [[2511.22640]] Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning(https://arxiv.org/abs/2511.22640)
Keywords: generation, generative
Abstract: Adapting large-scale foundation flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.
摘要：采用大规模基础流和扩散生成模型来优化特定任务目标，同时保留先验信息对于分子设计、蛋白质对接和创意图像生成等实际应用至关重要。现有的有原则的微调方法旨在最大化生成样本的预期奖励，同时通过 KL 散度正则化保留来自预训练模型的知识。在这项工作中，我们解决了超越平均奖励的优化一般效用的更普遍的问题，包括规避风险和寻求新奇的奖励最大化、探索的多样性措施以及实验设计目标等。同样，我们考虑更通用的方法来保留 KL 散度之外的先验信息，例如最佳传输距离和 Renyi 散度。为此，我们引入了流密度控制（FDC），这是一种简单的算法，可以将这个复杂的问题简化为一系列更简单的微调任务的特定序列，每个任务都可以通过可扩展的既定方法来解决。我们利用最近对镜像流的理解，在现实假设下得出了所提出方案的收敛保证。最后，我们在说明性设置、文本到图像和分子设计任务上验证了我们的方法，表明它可以引导预先训练的生成模型来优化目标并解决超出当前微调方案范围的实际相关任务。

Title: Architecture Decoupling Is Not All You Need For Unified Multimodal Model

Authors: Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22663
Pdf URL: https://arxiv.org/pdf/2511.22663
Copy Paste: [[2511.22663]] Architecture Decoupling Is Not All You Need For Unified Multimodal Model(https://arxiv.org/abs/2511.22663)
Keywords: generation
Abstract: Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
摘要：用于图像生成和理解的统一多模态模型代表了迈向 AGI 的重要一步，并引起了研究人员的广泛关注。这项任务的主要挑战在于，由于理解和生成任务中固有的目标冲突，难以建立最佳的训练范式。为了缓解这些冲突并追求更高的性能，许多研究人员采用不同程度的模型解耦（例如双图像编码器、MOE/MOT 架构或冻结 MLLM）。然而，过度的模型解耦会导致交错生成能力的丧失，破坏了统一模型的初衷。在这项工作中，我们的目标是探索如何在不诉诸模型解耦的情况下减轻任务冲突。首先，我们通过研究模型的跨模态注意力行为来分析为什么解耦可以缓解冲突。我们观察到，模型解耦本质上驱动模型朝着特定于任务的多模态交互模式发展，如 Qwen-VL 和 HunyuanImage 中所示，并且解耦越彻底，行为就越一致。受这一观察的启发，我们提出了注意力交互对齐（AIA）损失，它在训练期间明确学习任务特定的多模式交互模式。为了证明我们的 AIA 损失的普遍性，我们在 SFT 和训练后阶段分别将其应用于 Emu3 和 Janus-Pro。没有花里胡哨的东西，AIA 不仅改进了跨模式注意力模式，而且还提高了生成和理解性能。

Title: Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

Authors: Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22677
Pdf URL: https://arxiv.org/pdf/2511.22677
Copy Paste: [[2511.22677]] Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(https://arxiv.org/abs/2511.22677)
Keywords: generation
Abstract: Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( this https URL ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.
摘要：扩散模型蒸馏已成为创建高效的少步和单步生成器的强大技术。其中，分布匹配蒸馏（DMD）及其变体因其令人印象深刻的性能而脱颖而出，这被广泛归因于其将学生的输出分布与预训练的教师模型的输出分布相匹配的核心机制。在这项工作中，我们挑战了这种传统的理解。通过对 DMD 训练目标的严格分解，我们揭示了在文本到图像生成等复杂任务中，通常需要 CFG 来实现理想的几步性能，而几步蒸馏的主要驱动力不是分布匹配，而是一个以前被忽视的组件，我们将其称为 CFG 增强 (CA)。我们证明该术语充当蒸馏的核心“引擎”，而分布匹配（DM）术语充当“正则化器”，确保训练稳定性并减少伪影。我们通过证明虽然 DM 项是一个高效的正则项，但它并不是唯一的，进一步验证了这种解耦。更简单的非参数约束或基于 GAN 的目标可以提供相同的稳定功能，尽管需要进行不同的权衡。这种劳动的解耦激发了对这两个术语的属性进行更原则性的分析，从而产生更系统和更深入的理解。这种新的理解进一步使我们能够对蒸馏过程提出原则性的修改，例如解耦发动机和正则器的噪声计划，从而进一步提高性能。值得注意的是，我们的方法已被 Z-Image（此 https URL）项目采用，以开发顶级的 8 步图像生成模型，从经验上验证了我们研究结果的泛化性和稳健性。

Title: Test-time scaling of diffusions with flow maps

Authors: Amirmojtaba Sabour, Michael S. Albergo, Carles Domingo-Enrich, Nicholas M. Boffi, Sanja Fidler, Karsten Kreis, Eric Vanden-Eijnden
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22688
Pdf URL: https://arxiv.org/pdf/2511.22688
Copy Paste: [[2511.22688]] Test-time scaling of diffusions with flow maps(https://arxiv.org/abs/2511.22688)
Keywords: generation
Abstract: A common recipe to improve diffusion models at test-time so that samples score highly against a user-specified reward is to introduce the gradient of the reward into the dynamics of the diffusion itself. This procedure is often ill posed, as user-specified rewards are usually only well defined on the data distribution at the end of generation. While common workarounds to this problem are to use a denoiser to estimate what a sample would have been at the end of generation, we propose a simple solution to this problem by working directly with a flow map. By exploiting a relationship between the flow map and velocity field governing the instantaneous transport, we construct an algorithm, Flow Map Trajectory Tilting (FMTT), which provably performs better ascent on the reward than standard test-time methods involving the gradient of the reward. The approach can be used to either perform exact sampling via importance weighting or principled search that identifies local maximizers of the reward-tilted distribution. We demonstrate the efficacy of our approach against other look-ahead techniques, and show how the flow map enables engagement with complicated reward functions that make possible new forms of image editing, e.g. by interfacing with vision language models.
摘要：在测试时改进扩散模型以使样本针对用户指定的奖励获得高分的常见方法是将奖励的梯度引入扩散本身的动态中。这个过程通常是不恰当的，因为用户指定的奖励通常只在生成结束时的数据分布上得到很好的定义。虽然解决此问题的常见方法是使用降噪器来估计样本在生成结束时的情况，但我们提出了一种通过直接使用流程图来解决此问题的简单解决方案。通过利用流量图和控制瞬时传输的速度场之间的关系，我们构建了一种算法，流量图轨迹倾斜（FMTT），事实证明，与涉及奖励梯度的标准测试时间方法相比，该算法在奖励上的上升效果更好。该方法可用于通过重要性加权或原则搜索来执行精确采样，以识别奖励倾斜分布的局部最大化。我们展示了我们的方法相对于其他前瞻技术的有效性，并展示了流程图如何能够参与复杂的奖励功能，从而使新形式的图像编辑成为可能，例如通过与视觉语言模型交互。

Title: Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Authors: Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22690
Pdf URL: https://arxiv.org/pdf/2511.22690
Copy Paste: [[2511.22690]] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation(https://arxiv.org/abs/2511.22690)
Keywords: generation
Abstract: Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.
摘要：尽管最近在文本到图像生成方面取得了进展，但现有模型始终无法生成可靠的多人场景，经常会复制面孔、合并身份或错误计数个体。我们提出了 Ar2Can，这是一个新颖的两阶段框架，它将空间规划与多人生成的身份渲染分开。架构师模块预测结构化布局，指定每个人应该出现的位置。然后，Artist 模块在结合了匈牙利空间对齐和 ArcFace 身份相似性的基于空间的人脸匹配奖励的指导下，合成逼真的图像。这种方法可确保在正确的位置渲染面部并忠实地保留参考身份。我们开发了两种 Architect 变体，与基于扩散的 Artist 模型无缝集成，并通过组相对策略优化 (GRPO) 进行优化，使用组合奖励来实现计数准确性、图像质量和身份匹配。在 MultiHuman-Testbench 上进行评估，Ar2Can 在计数准确性和身份保存方面取得了显着改进，同时保持了较高的感知质量。值得注意的是，我们的方法主要使用合成数据来实现这些结果，而不需要真实的多人图像。

Title: Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra

Authors: Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Glenn Van Wallendael
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22693
Pdf URL: https://arxiv.org/pdf/2511.22693
Copy Paste: [[2511.22693]] Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra(https://arxiv.org/abs/2511.22693)
Keywords: generation, generative
Abstract: We present Generative Anchored Fields (GAF), a generative model that learns independent endpoint predictors $J$ (noise) and $K$ (data) rather than a trajectory predictor. The velocity field $v=K-J$ emerges from their time-conditioned disagreement. This factorization enables \textit{Transport Algebra}: algebraic operation on learned $\{(J_n,K_n)\}_{n=1}^N$ heads for compositional control. With class-specific $K_n$ heads, GAF supports a rich family of directed transport maps between a shared base distribution and multiple modalities, enabling controllable interpolation, hybrid generation, and semantic morphing through vector arithmetic. We achieve strong sample quality (FID 7.5 on CelebA-HQ $64\times 64$) while uniquely providing compositional generation as an architectural primitive. We further demonstrate, GAF has lossless cyclic transport between its initial and final state with LPIPS=$0.0$. Code available at this https URL
摘要：我们提出了生成锚定场（GAF），这是一种生成模型，可以学习独立的端点预测器 $J$（噪声）和 $K$（数据），而不是轨迹预测器。速度场 $v=K-J$ 从他们的时间条件分歧中出现。这种因式分解使得 \textit{传输代数}：对学习的 $\{(J_n,K_n)\}_{n=1}^N$ 头进行代数运算以进行成分控制。借助特定于类的 $K_n$ 头，GAF 支持共享基础分布和多种模态之间丰富的定向传输映射系列，从而通过向量算术实现可控插值、混合生成和语义变形。我们实现了强大的样本质量（CelebA-HQ 上的 FID 7.5 $64\times 64$），同时以独特的方式提供组合生成作为架构原语。我们进一步证明，GAF 在其初始状态和最终状态之间具有无损循环传输，LPIPS=$0.0$。代码可在此 https URL 获取

Title: Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Authors: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22699
Pdf URL: https://arxiv.org/pdf/2511.22699
Copy Paste: [[2511.22699]] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(https://arxiv.org/abs/2511.22699)
Keywords: generation, generative
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
摘要：高性能图像生成模型目前由 Nano Banana Pro 和 Seedream 4.0 等专有系统主导。领先的开源替代方案，包括 Qwen-Image、Hunyuan-Image-3.0 和 FLUX.2，其特点是参数数量庞大（20B 到 80B），这使得它们对于消费级硬件的推理和微调来说不切实际。为了解决这一差距，我们提出了 Z-Image，这是一种基于可扩展单流扩散变压器 (S3-DiT) 架构的高效 6B 参数基础生成模型，挑战了“不惜一切代价扩展”范式。通过系统地优化整个模型生命周期（从精心策划的数据基础设施到简化的培训课程），我们仅用 314K H800 GPU 小时（约 63 万美元）就完成了完整的培训工作流程。我们的带有奖励后训练的几步蒸馏方案进一步产生了 Z-Image-Turbo，在企业级 H800 GPU 上提供亚秒级推理延迟，并与消费级硬件（<16GB VRAM）兼容。此外，我们的全方位预训练范例还可以对 Z-Image-Edit 进行高效训练，Z-Image-Edit 是一种具有令人印象深刻的指令跟踪功能的编辑模型。定性和定量实验都表明，我们的模型在各个方面都达到了与领先竞争对手相当或超过的性能。最值得注意的是，Z-Image 在真实感图像生成和双语文本渲染方面表现出卓越的能力，提供的结果可与顶级商业模型相媲美，从而证明可以在显着降低计算开销的情况下实现最先进的结果。我们公开发布我们的代码、权重和在线演示，以促进可访问、预算友好且最先进的生成模型的开发。

Title: ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2511.22715
Pdf URL: https://arxiv.org/pdf/2511.22715
Copy Paste: [[2511.22715]] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering(https://arxiv.org/abs/2511.22715)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: this https URL.
摘要：多模态大语言模型 (MLLM) 在共同理解文本、图像和视频方面表现出了令人印象深刻的能力，通常通过视觉问答 (VQA) 进行评估。然而，即使是最先进的 MLLM 也难以应对特定领域或知识密集型查询，其中相关信息在预训练数据中的代表性不足。基于知识的 VQA (KB-VQA) 通过检索外部文档来条件回答生成来解决这个问题，但当前的检索增强方法存在精度低、段落噪音大和推理有限的问题。为了解决这个问题，我们提出了 ReAG，这是一种新颖的推理增强多模态 RAG 方法，它将粗粒度和细粒度检索与过滤不相关段落的批评模型相结合，确保高质量的附加上下文。该模型遵循多阶段训练策略，利用强化学习来增强对检索内容的推理，而监督微调仅作为冷启动。 Encyclopedic-VQA 和 InfoSeek 的大量实验表明，ReAG 显着优于先前的方法，提高了答案准确性并提供基于检索到的证据的可解释推理。我们的源代码可在以下位置公开获取：此 https URL。

Title: VeriDispatcher: Multi-Model Dispatching through Pre-Inference Difficulty Prediction for RTL Generation Optimization

Authors: Zeng Wang, Weihua Xiao, Minghao Shao, Raghu Vamshi Hemadri, Ozgur Sinanoglu, Muhammad Shafique, Ramesh Karri
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22749
Pdf URL: https://arxiv.org/pdf/2511.22749
Copy Paste: [[2511.22749]] VeriDispatcher: Multi-Model Dispatching through Pre-Inference Difficulty Prediction for RTL Generation Optimization(https://arxiv.org/abs/2511.22749)
Keywords: generation
Abstract: Large Language Models (LLMs) show strong performance in RTL generation, but different models excel on different tasks because of architecture and training differences. Prior work mainly prompts or finetunes a single model. What remains not well studied is how to coordinate multiple different LLMs so they jointly improve RTL quality while also reducing cost, instead of running all models and choosing the best output. We define this as the multi-LLM RTL generation problem. We propose VeriDispatcher, a multi-LLM RTL generation framework that dispatches each RTL task to suitable LLMs based on pre-inference difficulty prediction. For each model, we train a compact classifier over semantic embeddings of task descriptions, using difficulty scores derived from benchmark variants that combine syntax, structural similarity, and functional correctness. At inference, VeriDispatcher uses these predictors to route tasks to a selected subset of LLMs. Across 10 diverse LLMs on RTLLM and VerilogEval, VeriDispatcher achieves up to 18% accuracy improvement on RTLLM using only 40% of commercial calls, and on VerilogEval maintains accuracy while reducing commercial usage by 25%, enabling cost-effective, high-quality LLM deployment in hardware design automation.
摘要：大型语言模型 (LLM) 在 RTL 生成中表现出强大的性能，但由于架构和训练差异，不同的模型在不同的任务上表现出色。之前的工作主要是提示或微调单个模型。尚未得到充分研究的是如何协调多个不同的 LLM，从而共同提高 RTL 质量，同时降低成本，而不是运行所有模型并选择最佳输出。我们将其定义为多 LLM RTL 生成问题。我们提出了 VeriDispatcher，一个多 LLM RTL 生成框架，它根据预推理难度预测将每个 RTL 任务分派给合适的 LLM。对于每个模型，我们使用从结合语法、结构相似性和功能正确性的基准变体得出的难度分数，在任务描述的语义嵌入上训练紧凑的分类器。在推理时，VeriDispatcher 使用这些预测器将任务路由到选定的 LLM 子集。在 RTLLM 和 VerilogEval 上的 10 个不同的 LLM 中，VeriDispatcher 仅使用 40% 的商业调用就在 RTLLM 上实现了高达 18% 的精度提升，而在 VerilogEval 上保持了精度，同时将商业使用量减少了 25%，从而在硬件设计自动化中实现了经济高效、高质量的 LLM 部署。

Title: Exact Learning of Arithmetic with Differentiable Agents

Authors: Hristo Papazov, Francesco D'Angelo, Nicolas Flammarion
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22751
Pdf URL: https://arxiv.org/pdf/2511.22751
Copy Paste: [[2511.22751]] Exact Learning of Arithmetic with Differentiable Agents(https://arxiv.org/abs/2511.22751)
Keywords: generation
Abstract: We explore the possibility of exact algorithmic learning with gradient-based methods and introduce a differentiable framework capable of strong length generalization on arithmetic tasks. Our approach centers on Differentiable Finite-State Transducers (DFSTs), a Turing-complete model family that avoids the pitfalls of prior architectures by enabling constant-precision, constant-time generation, and end-to-end log-parallel differentiable training. Leveraging policy-trajectory observations from expert agents, we train DFSTs to perform binary and decimal addition and multiplication. Remarkably, models trained on tiny datasets generalize without error to inputs thousands of times longer than the training examples. These results show that training differentiable agents on structured intermediate supervision could pave the way towards exact gradient-based learning of algorithmic skills. Code available at \href{this https URL}{this https URL}.
摘要：我们探索了使用基于梯度的方法进行精确算法学习的可能性，并引入了一种能够对算术任务进行强长度泛化的可微框架。我们的方法以可微分有限状态换能器（DFST）为中心，这是一个图灵完备的模型系列，通过实现恒定精度、恒定时间生成和端到端对数并行可微分训练来避免先前架构的陷阱。利用专家代理的策略轨迹观察，我们训练 DFST 执行二进制和十进制加法和乘法。值得注意的是，在微小数据集上训练的模型可以毫无错误地推广到比训练示例长数千倍的输入。这些结果表明，在结构化中间监督上训练可微代理可以为算法技能的精确基于梯度的学习铺平道路。代码位于 \href{此 https URL}{此 https URL}。

Title: Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data

Authors: Mahdieh Behjat Khatooni, Mohsen Soryani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22774
Pdf URL: https://arxiv.org/pdf/2511.22774
Copy Paste: [[2511.22774]] Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data(https://arxiv.org/abs/2511.22774)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject's cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05\% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer's disease.
摘要：阿尔茨海默病 (AD) 是一种常见的神经退行性疾病，会逐渐损害记忆、决策和整体认知功能。由于 AD 是不可逆转的，早期预测对于及时干预和管理至关重要。轻度认知障碍（MCI）是认知正常（CN）衰老与 AD 之间的过渡阶段，在 AD 早期诊断中发挥着重要作用。然而，预测 MCI 进展仍然是一个重大挑战，因为并非所有 MCI 患者都会转化为 AD。 MCI 受试者根据转换状态分为稳定型 MCI (sMCI) 和进展型 MCI (pMCI)。在这项研究中，我们提出了一种通用的端到端深度学习模型，使用阿尔茨海默病神经影像计划 (ADNI) 的 MCI 病例进行 AD 预测。我们的混合架构集成了卷积神经网络和视觉变换器，以从磁共振成像 (MRI) 扫描中捕获局部空间特征和全局上下文依赖性。为了纳入时间进展，我们进一步采用双向长短期记忆 (BiLSTM) 网络来处理从四个连续 MRI 时间点提取的特征以及一些其他非图像生物标志物，预测每个受试者在第 48 个月时的认知状态。我们的多模态模型在 sMCI 和 pMCI 之间实现了 95.05% 的平均进展预测准确度，优于 AD 预测方面的现有研究。这项工作展示了纵向 AD 预测方面最先进的性能，并强调了结合空间和时间模型来早期检测阿尔茨海默病的有效性。

Title: From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Authors: Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2511.22805
Pdf URL: https://arxiv.org/pdf/2511.22805
Copy Paste: [[2511.22805]] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images(https://arxiv.org/abs/2511.22805)
Keywords: generation
Abstract: While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.
摘要：虽然多模态大语言模型（MLLM）擅长回答图像中的内容——识别对象和描述场景——但它们通常缺乏理解人类观察者对图像的感受的能力。当考虑主观认知属性时，这种差距最为明显，例如是什么使图像令人难忘、有趣、美观或唤起情感。为了系统地应对这一挑战，我们引入了 CogIP-Bench，这是一个用于评估 MLLM 此类图像认知属性的综合基准。我们的评估揭示了一个巨大的差距：当前的模型与人类对这些细微差别的属性的感知不太一致。然后，我们证明训练后阶段可以有效地弥合这一差距，显着增强模型与人类判断的一致性。此外，我们表明，这种习得的认知一致性不仅具有预测性，而且可以转移到下游的创造性任务中。通过将我们的认知对齐 MLLM 集成到图像生成管道中，我们可以指导合成过程来生成更好地体现所需特征的图像，例如更令人难忘或更具视觉吸引力。我们的工作提供了衡量这种类人感知的基准、增强这种感知的训练后管道，并证明这种一致性可以解锁更多以人为中心的人工智能。

Title: LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Authors: Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22812
Pdf URL: https://arxiv.org/pdf/2511.22812
Copy Paste: [[2511.22812]] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer(https://arxiv.org/abs/2511.22812)
Keywords: generative
Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
摘要：土地覆盖是生态系统服务、水文调节、减少灾害风险和循证土地规划的基础；因此，及时、准确的土地覆盖图对于环境管理至关重要。基于遥感的土地覆盖分类为此类地图提供了可扩展的路线，但受到稀缺且不平衡的注释以及高分辨率场景中的几何扭曲的阻碍。我们提出了 LC4-DViT（使用可变形视觉变换器进行土地覆盖分类的土地覆盖创建），这是一个将生成数据创建与变形感知视觉变换器相结合的框架。文本引导的扩散管道使用 GPT-4o 生成的场景描述和超分辨率示例来合成类平衡的高保真训练图像，而 DViT 将 DCNv4 可变形卷积主干与 Vision Transformer 编码器结合起来，共同捕获精细尺度的几何形状和全局上下文。在航空图像数据集 (AID) 的八个类别（海滩、桥梁、沙漠、森林、山地、池塘、港口和河流）中，DViT 实现了 0.9572 总体精度、0.9576 宏观 F1 分数和 0.9510 Cohen's Kappa，比普通 ViT 基线（0.9274 OA、0.9300 宏观 F1、0.9169）有所提高Kappa），并且优于 ResNet50、MobileNetV2 和 FlashInternImage。对三类 SIRI-WHU 子集（港口、池塘、河流）进行的跨数据集实验产生了 0.9333 的总体精度、0.9316 的宏观 F1 和 0.8989 Kappa，表明良好的可迁移性。一位法学硕士法官使用 GPT-4o 对 Grad-CAM 热图进行评分，进一步表明 DViT 的注意力与具有水文学意义的结构最为一致。这些结果表明，描述驱动的生成增强与变形感知变压器相结合是高分辨率土地覆盖绘图的一种有前途的方法。

Title: Captain Safari: A World Engine

Authors: Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22815
Pdf URL: https://arxiv.org/pdf/2511.22815
Copy Paste: [[2511.22815]] Captain Safari: A World Engine(https://arxiv.org/abs/2511.22815)
Keywords: generation
Abstract: World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
摘要：世界引擎旨在合成长的、3D 一致的视频，支持在用户控制的摄像机运动下对场景进行交互式探索。然而，现有系统在激进的 6 自由度轨迹和复杂的室外布局下举步维艰：它们失去了远程几何一致性、偏离目标路径或陷入过于保守的运动。为此，我们引入了 Captain Safari，这是一个姿势条件世界引擎，可通过从持久世界记忆中检索来生成视频。给定相机路径，我们的方法维护动态本地内存，并使用检索器来获取姿势对齐的世界标记，然后沿着轨迹调节视频生成。这种设计使模型能够保持稳定的 3D 结构，同时准确执行具有挑战性的相机操作。为了评估此设置，我们策划了 OpenSafari，这是一个新的野外 FPV 数据集，其中包含具有经过验证的相机轨迹的高动态无人机视频，通过多阶段几何和运动学验证管道构建。在视频质量、3D 一致性和轨迹跟踪方面，Captain Safari 的性能远远优于最先进的摄像机控制发生器。它将 MEt3R 从 0.3703 降低到 0.3690，将 AUC@30 从 0.181 提高到 0.200，并且产生的 FVD 远低于所有相机控制的基线。更重要的是，在一项 50 名参与者的 5 向人类研究中，注释者在五个匿名模型中选择最佳结果，67.6% 的偏好在所有轴上都支持我们的方法。我们的结果表明，姿势条件世界记忆是一种强大的长视野、可控视频生成机制，并为 OpenSafari 提供了未来世界引擎研究具有挑战性的新基准。

Title: Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

Authors: Keliang Liu, Zizhi Chen, Mingcheng Li, Jingqun Tang, Dingkang Yang, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22850
Pdf URL: https://arxiv.org/pdf/2511.22850
Copy Paste: [[2511.22850]] Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding(https://arxiv.org/abs/2511.22850)
Keywords: generation
Abstract: Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.
摘要：文档理解是一项长期存在的实际任务。视觉语言模型（VLM）已逐渐成为该领域的主要方法，在单页任务上展示了有效的性能。然而，在处理长文档时，它们的效率会降低。在这种情况下，线索通常分散在多个页面和模式中，冗长输入的冗余可能会损害模型的判断。虽然检索增强生成通过过滤问题相关内容来缓解此问题，但检索结果仍然包含大量冗余。为了解决这些限制，我们提出了 SLEUTH，一个多代理框架。具体来说，SLEUTH 以从粗到细的过程协调一个检索器和四个协作代理。该框架识别检索页面中的关键文本和视觉线索，过滤显着的视觉证据（例如表格和图表），并分析查询以设计推理策略。它最终合成一个经过提炼的、证据密集的多模态上下文来生成最终的预测。 SLEUTH 与模型无关且可扩展。当与先进的 VLM 主干配合使用时，它可以持续提高多个长文档基准测试的性能，实现最先进的结果。消融研究验证了每个模块的有效性并确认了我们的分层细化范式的好处。

Title: TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE

Authors: Jiawen Wei, Lan Jiang, Pengbo Wei, Ziwen Ye, Teng Song, Chen Chen, Guangrui Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22853
Pdf URL: https://arxiv.org/pdf/2511.22853
Copy Paste: [[2511.22853]] TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE(https://arxiv.org/abs/2511.22853)
Keywords: generation, generative
Abstract: Time series data is ubiquitous, with forecasting applications spanning from finance to healthcare. Beyond popular deterministic methods, generative models are gaining attention due to advancements in areas like image synthesis and video generation, as well as their inherent ability to provide probabilistic predictions. However, existing generative approaches mostly involve recurrent generative operations or repeated denoising steps, making the prediction laborious, particularly for long-term forecasting. Most of them only conduct experiments for relatively short-term forecasting, with limited comparison to deterministic methods in long-term forecasting, leaving their practical advantages unclear. This paper presents TARFVAE, a novel generative framework that combines the Transformer-based autoregressive flow (TARFLOW) and variational autoencoder (VAE) for efficient one-step generative time series forecasting. Inspired by the rethinking that complex architectures for extracting time series representations might not be necessary, we add a flow module, TARFLOW, to VAE to promote spontaneous learning of latent variables that benefit predictions. TARFLOW enhances VAE's posterior estimation by breaking the Gaussian assumption, thereby enabling a more informative latent space. TARFVAE uses only the forward process of TARFLOW, avoiding autoregressive inverse operations and thus ensuring fast generation. During generation, it samples from the prior latent space and directly generates full-horizon forecasts via the VAE decoder. With simple MLP modules, TARFVAE achieves superior performance over state-of-the-art deterministic and generative models across different forecast horizons on benchmark datasets while maintaining efficient prediction speed, demonstrating its effectiveness as an efficient and powerful solution for generative time series forecasting.
摘要：时间序列数据无处不在，预测应用涵盖从金融到医疗保健的各个领域。除了流行的确定性方法之外，由于图像合成和视频生成等领域的进步以及它们提供概率预测的固有能力，生成模型也越来越受到关注。然而，现有的生成方法大多涉及循环生成操作或重复的去噪步骤，使得预测很费力，特别是对于长期预测。大多数只针对相对短期的预测进行实验，与长期预测中的确定性方法进行比较有限，其实际优势尚不明确。本文提出了 TARFVAE，这是一种新颖的生成框架，它将基于 Transformer 的自回归流 (TARFLOW) 和变分自动编码器 (VAE) 结合起来，用于高效的单步生成时间序列预测。受到重新思考提取时间序列表示的复杂架构可能没有必要的启发，我们向 VAE 添加了一个流模块 TARFLOW，以促进有利于预测的潜在变量的自发学习。 TARFLOW 通过打破高斯假设来增强 VAE 的后验估计，从而实现信息更丰富的潜在空间。 TARFVAE仅使用TARFLOW的前向过程，避免了自回归逆运算，从而保证了快速生成。在生成过程中，它从先前的潜在空间中进行采样，并通过 VAE 解码器直接生成全范围预测。通过简单的 MLP 模块，TARFVAE 在基准数据集的不同预测范围内实现了优于最先进的确定性和生成模型的性能，同时保持高效的预测速度，证明了其作为生成时间序列预测的高效且强大的解决方案的有效性。

Title: CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Authors: Fengyi Fang, Sicheng Yang, Wenming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22863
Pdf URL: https://arxiv.org/pdf/2511.22863
Copy Paste: [[2511.22863]] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation(https://arxiv.org/abs/2511.22863)
Keywords: generation
Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
摘要：协同语音手势生成显着改进了人机交互，但由于省略了文本驱动的非自发手势（例如，说话时鞠躬），说话者的动作仍然受到限制。现有方法面临两个关键挑战：1）由于手势数据集中缺乏描述性文本注释而导致语义先验差距；2）难以实现对手势生成的协调多模态控制。为了应对这些挑战，本文介绍了 CoordSpeaker，这是一个综合框架，可以实现协调的字幕授权的协同语音手势合成。我们的方法首先通过新颖的手势字幕框架弥合语义先验差距，利用运动语言模型生成多个粒度的描述性字幕。在此基础上，我们提出了一种条件潜在扩散模型，具有统一的跨数据集运动表示和分层控制的降噪器，以实现高度控制、协调的手势生成。 CoordSpeaker 开创了手势理解和字幕的首次探索，以解决手势生成中的语义差距，同时提供双向手势文本映射的新颖视角。大量的实验表明，我们的方法可以产生高质量的手势，这些手势既与语音节奏同步，又与任意字幕语义一致，与现有方法相比，以更高的效率实现了卓越的性能。

Title: Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis

Authors: Jungwoo Seo, David Keetae Park, Shinjae Yoo, Jiook Cha
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.22870
Pdf URL: https://arxiv.org/pdf/2511.22870
Copy Paste: [[2511.22870]] Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis(https://arxiv.org/abs/2511.22870)
Keywords: generation
Abstract: Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.
摘要：由于受试者/采集之间的高维、异构 BOLD 动态以及缺乏基于神经科学的验证，生成以认知任务为条件的全脑 4D fMRI 序列仍然具有挑战性。我们引入了第一个用于体素 4D fMRI 条件生成的扩散变压器，将 3D VQ-GAN 潜在压缩与 CNN-Transformer 主干以及通过 AdaLN-Zero 和交叉注意力的强大任务调节相结合。在 HCP 任务 fMRI 上，我们的模型再现了任务诱发的激活图，保留了在真实数据 (RSA) 中观察到的任务间表征结构，实现了完美的条件特异性，并将 ROI 时间进程与规范的血流动力学响应保持一致。性能随着规模的扩大而显着提高，任务诱发图相关性达到 0.83，RSA 达到 0.98，在所有指标上始终超过 U-Net 基线。通过将潜在扩散与可扩展的主干和强调节相结合，这项工作建立了条件 4D fMRI 合成的实用路径，为未来的应用铺平了道路，例如虚拟实验、跨站点协调和下游神经影像模型的原则性增强。

Title: Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling

Authors: Minyoung Kim, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22936
Pdf URL: https://arxiv.org/pdf/2511.22936
Copy Paste: [[2511.22936]] Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling(https://arxiv.org/abs/2511.22936)
Keywords: generation
Abstract: The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker's intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.
摘要：人工智能生成内容（AIGC）的快速增长引发了人们对数字媒体真实性的担忧。在这种情况下，图像自我恢复，从其被操纵的版本中重建原始内容，为理解攻击者的意图和恢复可信数据提供了一种实用的解决方案。然而，现有的方法往往无法准确恢复被篡改的区域，达不到自我恢复的主要目标。为了应对这一挑战，我们提出了 ReImage，这是一种基于神经水印的自我恢复框架，它将目标图像的打乱版本嵌入到自身中作为水印。我们设计了一个生成器，可以生成针对神经水印优化的水印，并引入图像增强模块来细化恢复的图像。我们进一步分析并解决了混洗水印的关键局限性，使其能够有效地用于自我恢复。我们证明 ReImage 在各种篡改场景中都能实现最先进的性能，始终如一地生成高质量的恢复图像。代码和预训练模型将在发布后发布。

Title: One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe

Authors: Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22940
Pdf URL: https://arxiv.org/pdf/2511.22940
Copy Paste: [[2511.22940]] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe(https://arxiv.org/abs/2511.22940)
Keywords: generation
Abstract: Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at this https URL.
摘要：扩散模型的最新进展极大地改进了姿势驱动的角色动画。然而，现有方法仅限于具有匹配骨骼结构的空间对齐参考姿势对。处理参考姿势未对准的问题仍未解决。为了解决这个问题，我们提出了一对多动画，这是一个用于高保真角色动画和图像姿势传输的统一框架，以供任意布局的参考。首先，为了处理空间未对齐的参考，我们将训练重新制定为一项自我监督的绘制任务，将多样化的布局参考转换为统一的遮挡输入格式。其次，为了处理部分可见的参考，我们设计了一个用于全面身份特征提取的参考提取器。此外，我们集成了混合参考融合注意力来处理不同的分辨率和动态序列长度。最后，从生成质量的角度来看，我们引入了身份稳健的姿势控制，将外观与骨骼结构解耦以减轻姿势过度拟合，以及用于连贯长视频生成的令牌替换策略。大量实验表明我们的方法优于现有方法。代码和模型将在此 https URL 中提供。

Title: Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation

Authors: Taeyeong Kim, SeungJoon Lee, Jung Uk Kim, MyeongAh Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22948
Pdf URL: https://arxiv.org/pdf/2511.22948
Copy Paste: [[2511.22948]] Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation(https://arxiv.org/abs/2511.22948)
Keywords: generation
Abstract: Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at this https URL.
摘要：语义分割中的领域泛化面临着领域转移的挑战，特别是在不利条件下。虽然基于扩散的数据生成方法显示出前景，但它们在生成的图像和语义掩模之间引入了固有的错位。本文提出了 FLEX-Seg（FLexible Edge eXploitation for Segmentation），这是一个将这种限制转化为稳健学习机会的框架。 FLEX-Seg 包含三个关键组件：(1) 粒度自适应原型，捕获多个尺度的边界特征；(2) 不确定性边界强调，根据预测熵动态调整学习重点；(3) 硬度感知采样，逐步关注具有挑战性的示例。通过利用固有的错位而不是强制执行严格的对齐，FLEX-Seg 可以学习稳健的表示，同时捕获丰富的风格变化。五个真实世界数据集的实验表明，与最先进的方法相比，ACDC 和 Dark Zurich 的 mIoU 分别提高了 2.44% 和 2.63%。我们的研究结果证实，处理不完美合成数据的自适应策略可以带来卓越的领域泛化能力。代码可从此 https URL 获取。

Title: BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Authors: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22973
Pdf URL: https://arxiv.org/pdf/2511.22973
Copy Paste: [[2511.22973]] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation(https://arxiv.org/abs/2511.22973)
Keywords: generation
Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: this https URL. Inferix (Code): this https URL.
摘要：生成一分钟长的视频是开发世界模型的关键一步，为真实的扩展场景和先进的人工智能模拟器奠定了基础。新兴的半自回归（块扩散）范式集成了扩散和自回归模型的优点，能够生成任意长度的视频，并通过 KV 缓存和并行采样提高推理效率。然而，它仍然面临两个持久的挑战：(i) KV 缓存引起的长范围错误累积，以及 (ii) 缺乏细粒度的长视频基准和一致性感知指标。为了克服这些限制，我们提出了 BlockVid，一种新颖的块扩散框架，配备了语义感知的稀疏 KV 缓存、称为块强制的有效训练策略以及专用的块式噪声调度和洗牌，以减少错误传播并增强时间一致性。我们进一步介绍了 LV-Bench，这是一个针对一分钟长视频的细粒度基准，并配有评估远程一致性的新指标。 VBench 和 LV-Bench 上的大量实验表明，BlockVid 在生成高质量、连贯的一分钟长视频方面始终优于现有方法。特别是，与最先进的方法相比，它在 LV-Bench 中的 VDE 主题提高了 22.2%，VDE 清晰度提高了 19.4%。项目网站：这个 https URL。 Inferix（代码）：此 https URL。

Title: McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

Authors: Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22974
Pdf URL: https://arxiv.org/pdf/2511.22974
Copy Paste: [[2511.22974]] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning(https://arxiv.org/abs/2511.22974)
Keywords: generation, generative
Abstract: Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.
摘要：文本到视频（T2V）生成在制作与文本提示一致的高质量视频方面取得了显着进展。然而，由于人类判断的主观性和多方面性，使合成视频与人类微妙的偏好保持一致仍然具有挑战性。现有的视频偏好对齐方法依赖于昂贵的人工注释或利用代理指标来预测偏好，缺乏对人类偏好逻辑的理解。此外，他们通常直接将 T2V 模型与整体偏好分布对齐，忽略运动动力学和视觉质量等潜在冲突维度，这可能会使模型偏向于低运动内容。为了解决这些问题，我们提出了带有自我批评分层推理（McSc）的运动校正对齐，这是一个用于稳健偏好建模和对齐的三阶段强化学习框架。首先，自我批评维度推理（ScDR）训练生成奖励模型（RM）将偏好分解为每个维度的评估，使用自我批评推理链进行可靠的学习。其次，为了实现整体视频比较，我们引入了分层比较推理（HCR），用于具有分层奖励监督的结构多维推理。最后，使用 RM 首选视频，我们提出运动校正直接偏好优化 (McDPO) 来优化 T2V 模型，同时动态重新加权对齐目标以减轻对低运动内容的偏差。实验表明，McSc 在人类偏好对齐方面取得了优异的性能，并生成具有高运动动态的视频。

Title: MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Authors: Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22989
Pdf URL: https://arxiv.org/pdf/2511.22989
Copy Paste: [[2511.22989]] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation(https://arxiv.org/abs/2511.22989)
Keywords: generation
Abstract: Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at this https URL .
摘要：最近的文本到图像生成模型已经具备了多参考生成和编辑的能力；能够从多个参考图像继承主题的外观并在新的上下文中重新渲染它们。然而，现有的基准数据集通常专注于生成单个或几个参考图像，这使得我们无法在不同的多参考条件下衡量模型性能的进步或指出其弱点。此外，他们的任务定义仍然模糊，通常仅限于“编辑什么”或“给出多少参考”等轴，因此无法捕捉多参考设置的内在困难。为了解决这一差距，我们引入了 $\textbf{MultiBanana}$，它经过精心设计，通过广泛覆盖大规模的多参考特定问题来评估模型能力的边缘：(1) 改变参考的数量，(2) 参考之间的域不匹配（例如照片与动漫），(3) 参考和目标场景之间的尺度不匹配，(4) 包含罕见概念的参考（例如，红香蕉），以及 (5) 用于渲染的多语言文本参考。我们对各种文本到图像模型的分析揭示了它们的优越性能、典型故障模式和需要改进的领域。 MultiBanana 将作为开放基准发布，以突破界限并为多参考图像生成中的公平比较建立标准化基础。我们的数据和代码可在此 https URL 获取。

Title: Guiding Visual Autoregressive Models through Spectrum Weakening

Authors: Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22991
Pdf URL: https://arxiv.org/pdf/2511.22991
Copy Paste: [[2511.22991]] Guiding Visual Autoregressive Models through Spectrum Weakening(https://arxiv.org/abs/2511.22991)
Keywords: generation
Abstract: Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.
摘要：无分类器引导（CFG）已成为一种广泛采用的实用方法，用于提高发电质量和改善条件匹配。最近的研究探索了无条件生成的指导机制，但这些方法仍然从根本上与扩散模型特有的假设相关。在这项工作中，我们提出了一种用于视觉自回归（AR）模型的频谱弱化框架。此方法无需重新训练、特定条件或任何架构修改即可发挥作用。它通过在谱域构建可控弱模型来实现这一点。我们从理论上证明，可逆谱变换可以保留信息，而选择性地仅保留谱的子集会引入受控信息减少。基于这种见解，我们沿着内部表示的通道维度进行频谱选择，这避免了扩散模型施加的结构约束。我们进一步引入了两种谱重整化策略，以确保弱化过程中的数值稳定性。在离散和连续 AR 模型上进行了大量实验，并使用文本或类别调节。结果表明，我们的方法能够实现高质量的无条件生成，同时保持条件生成的强大提示对齐。

Title: Masked Diffusion for Generative Recommendation

Authors: Kulin Shah, Bhuvesh Kumar, Neil Shah, Liam Collins
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2511.23021
Pdf URL: https://arxiv.org/pdf/2511.23021
Copy Paste: [[2511.23021]] Masked Diffusion for Generative Recommendation(https://arxiv.org/abs/2511.23021)
Keywords: generative
Abstract: Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.
摘要：具有语义 ID (SID) 的生成推荐 (GR) 因其性能提升、对通过语言模型嵌入提供的语义信息的利用以及推理和存储效率而成为传统推荐方法的有前景的替代方案。现有的具有 SID 的 GR 使用自回归建模来构建与用户交互历史相对应的 SID 序列的概率。虽然这在某些设置下带来了令人印象深刻的下一项预测性能，但这些具有 SID 的自回归 GR 模型由于顺序令牌解码、训练数据的潜在低效使用以及倾向于学习令牌之间的短上下文关系而遭受昂贵的推理。受 NLP 最近突破的启发，我们建议使用掩蔽扩散来建模和学习用户 SID 序列的概率。掩蔽扩散采用离散掩蔽噪声来促进学习序列分布，并将掩蔽标记的概率建模为给定未掩蔽标记的条件独立的模型，从而允许对掩蔽标记进行并行解码。我们通过彻底的实验证明，我们提出的方法始终优于自回归模型。这种性能差距在数据受限的设置和粗粒度召回方面尤其明显，这与我们的直觉一致。此外，我们的方法允许在推理过程中灵活地并行预测多个 SID，同时保持优于自回归建模的性能。

Title: Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

Authors: Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini, Nitamar Abdala, Marcelo Straus Takahashi, Felipe Campos Kitamura
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23066
Pdf URL: https://arxiv.org/pdf/2511.23066
Copy Paste: [[2511.23066]] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation(https://arxiv.org/abs/2511.23066)
Keywords: generative
Abstract: Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.
摘要：生成基础模型可以通过逼真的图像修复来消除视觉伪影，但它们对医疗人工智能性能的影响仍然不确定。儿科手部 X 光照片通常包含非解剖标记，目前尚不清楚修复这些区域是否可以保留骨龄和性别预测所需的特征。为了评估基于生成模型修复去除伪影的临床可靠性，我们使用了 RSNA 骨龄挑战数据集，选择 200 张原始放射线照片，并使用 gpt-image-1 使用自然语言提示生成 600 个修复版本，以针对非解剖伪影。使用深度学习集成进行骨龄估计和性别分类，使用平均绝对误差 (MAE) 和 ROC 曲线下面积 (AUC) 作为指标，并使用像素强度分布来检测结构变化，评估下游性能。修复明显降低了模型性能：骨龄 MAE 从 6.26 个月增加到 30.11 个月，性别分类 AUC 从 0.955 下降到 0.704。修复后的图像显示像素强度变化和不一致，表明结构修改未通过简单校准进行校正。这些发现表明，虽然视觉上真实，但基于基础模型的修复可能会掩盖微妙但与临床相关的特征，并引入潜在的偏差，即使编辑仅限于非诊断区域，这强调了在将此类生成工具集成到临床人工智能工作流程之前需要进行严格的、特定于任务的验证。

Title: db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Authors: Siqi Chen, Ke Hong, Tianchen Zhao, Ruiqi Xie, Zhenhua Zhu, Xudong Zhang, Yu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23113
Pdf URL: https://arxiv.org/pdf/2511.23113
Copy Paste: [[2511.23113]] db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism(https://arxiv.org/abs/2511.23113)
Keywords: generation, generative
Abstract: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at this https URL.
摘要：通过序列并行扩展扩散变换器 (DiT) 推理对于减少视觉生成中的延迟至关重要，但当应用于采用块式稀疏注意力的模型时，会受到工作负载不平衡的严重阻碍。当沿着头维度（如在 Ulysses 中）或块维度（如在 Ring Attention 中）应用序列并行性时，这种不平衡源于注意力头之间稀疏性的固有变化以及稀疏掩模内密集块的不规则分布。在本文中，我们形式化了稀疏不平衡比来量化不平衡，并提出了 db-SP，一种解决这一挑战的稀疏感知序列并行技术。 db-SP 包含双层分区方法，可以在头级和块级实现近乎完美的工作负载平衡，而开销可以忽略不计。此外，为了处理跨降噪步骤和层不断变化的稀疏模式，db-SP 在运行时动态确定头和块维度的并行度。实验结果表明，与最先进的序列并行方法相比，db-SP 的端到端加速平均提高了 1.25 倍，注意力特定加速提高了 1.40 倍。代码可从此 https URL 获取。

Title: DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

Authors: Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23127
Pdf URL: https://arxiv.org/pdf/2511.23127
Copy Paste: [[2511.23127]] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation(https://arxiv.org/abs/2511.23127)
Keywords: generation
Abstract: This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: this https URL\-page/
摘要：本文提出了 DualCamCtrl，一种用于摄像机控制视频生成的新颖的端到端扩散模型。最近的工作通过将相机姿势表示为基于光线的条件来推进这一领域，但它们往往缺乏足够的场景理解和几何意识。 DualCamCtrl 通过引入双分支框架专门针对这一限制，该框架可相互生成相机一致的 RGB 和深度序列。为了协调这两种模式，我们进一步提出了语义引导相互对齐（SIGMA）机制，该机制以语义引导和相互增强的方式执行 RGB 深度融合。这些设计共同使 DualCamCtrl 能够更好地理清外观和几何建模，生成更忠实地遵循指定摄像机轨迹的视频。此外，我们分析并揭示了深度和相机姿势在去噪阶段的独特影响，并进一步证明早期和后期在形成全局结构和细化局部细节方面发挥着互补作用。大量实验表明，DualCamCtrl 实现了更一致的摄像机控制视频生成，与之前的方法相比，摄像机运动误差减少了 40% 以上。我们的项目页面：这个https URL\-page/

Title: InstanceV: Instance-Level Video Generation

Authors: Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23146
Pdf URL: https://arxiv.org/pdf/2511.23146
Copy Paste: [[2511.23146]] InstanceV: Instance-Level Video Generation(https://arxiv.org/abs/2511.23146)
Keywords: generation
Abstract: Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.
摘要：文本到视频扩散模型的最新进展使得能够生成以文本描述为条件的高质量视频。然而，大多数现有的文本到视频模型仅依赖于文本条件，缺乏对视频生成的一般细粒度可控性。为了应对这一挑战，我们提出了 InstanceV，这是一种视频生成框架，可实现 i) 实例级控制和 ii) 全局语义一致性。具体来说，借助所提出的实例感知屏蔽交叉注意机制，InstanceV 最大限度地利用额外的实例级基础信息，在指定的空间位置生成正确归因的实例。为了提高整体一致性，我们引入了共享时间步长自适应提示增强模块，该模块以参数有效的方式将本地实例与全局语义连接起来。此外，我们在训练和推理过程中加入了空间感知无条件指导，以减轻小实例的消失。最后，我们提出了一个名为 InstanceBench 的新基准，它将通用视频质量指标与实例感知指标相结合，以便对实例级视频生成进行更全面的评估。大量实验表明，InstanceV 不仅在视频生成方面实现了卓越的实例级可控性，而且在定性和定量评估中的一般质量和实例感知指标方面均优于现有的最先进模型。

Title: REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

Authors: Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23158
Pdf URL: https://arxiv.org/pdf/2511.23158
Copy Paste: [[2511.23158]] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection(https://arxiv.org/abs/2511.23158)
Keywords: generation, generative
Abstract: With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
摘要：随着生成模型的快速进步，视觉逼真的人工智能生成图像变得越来越难以与真实图像区分开，对社会信任和信息完整性构成严重威胁。因此，迫切需要有效且真正可解释的图像取证方法。最近的检测范式已经转向可解释的取证。然而，最先进的方法主要依赖于事后合理化或视觉歧视，缺乏可验证的证据链。这种对表面模式匹配的依赖限制了因果解释的生成，并且常常导致泛化能力较差。为了弥补这一关键差距，我们引入了 \textbf{REVEAL-Bench}，这是第一个用于人工智能生成图像检测的推理增强多模态基准，它是围绕从多个轻量级专家模型派生的证据链明确构建的，然后记录逐步推理轨迹和证据论证。在此数据集的基础上，我们提出了 \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis)，这是一种有效且可解释的取证框架，它将检测与新颖的基于专家的强化学习相结合。我们的奖励机制是专门定制的，旨在共同优化基于明确取证证据的检测准确性、解释保真度和逻辑一致性，使 REVEAL 能够在检测结果的同时产生细粒度、可解释和可验证的推理链。大量的实验结果表明，REVEAL 显着提高了检测精度、解释保真度和强大的跨模型泛化能力，为可解释的图像取证树立了新的技术水平基准。

Title: Fast Multi-view Consistent 3D Editing with Video Priors

Authors: Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23172
Pdf URL: https://arxiv.org/pdf/2511.23172
Copy Paste: [[2511.23172]] Fast Multi-view Consistent 3D Editing with Video Priors(https://arxiv.org/abs/2511.23172)
Keywords: generation, generative
Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.
摘要：文本驱动的 3D 编辑支持使用文本指令进行用户友好的 3D 对象或场景编辑。由于缺乏多视图一致性先验，现有方法通常采用 2D 生成或编辑模型来单独处理每个视图，然后迭代 2D-3D-2D 更新。然而，这些方法不仅耗时，而且容易产生过度平滑的结果，因为从不同视图收集的不同编辑信号在迭代过程中被平均。在本文中，我们提出基于生成视频先验的 3D 编辑 (ViP3DE)，以利用预训练视频生成模型的时间一致性先验，在单次前向传递中实现多视图一致 3D 编辑。我们的主要见解是在单个编辑视图上调节视频生成模型，以生成其他一致的编辑视图以直接进行 3D 更新，从而绕过迭代编辑范例。由于 3D 更新需要将编辑的视图与特定的相机姿势配对，因此我们建议视频模型进行运动保留噪声混合，以在预定义的相机姿势下生成编辑的视图。此外，我们引入了几何感知去噪，通过将 3D 几何先验集成到视频模型中来进一步增强多视图一致性。大量实验表明，我们提出的 ViP3DE 即使在单次前向传递中也能实现高质量的 3D 编辑结果，在编辑质量和速度方面都显着优于现有方法。

Title: GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation

Authors: Yuhao Wan, Lijuan Liu, Jingzhi Zhou, Zihan Zhou, Xuying Zhang, Dongbo Zhang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23191
Pdf URL: https://arxiv.org/pdf/2511.23191
Copy Paste: [[2511.23191]] GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation(https://arxiv.org/abs/2511.23191)
Keywords: generation
Abstract: Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: this https URL.
摘要：之前利用视频模型生成图像到 3D 场景的作品往往会遇到几何失真和内容模糊的问题。在本文中，我们通过释放几何模型的潜力来革新图像到 3D 场景生成的流程，并展示我们的 GeoWorld。我们建议首先生成连续的视频帧，然后利用几何模型提供全帧几何特征，而不是利用从单帧输入获得的几何信息，该特征包含比先前方法中使用的单帧深度图或相机嵌入更丰富的信息，并使用这些几何特征作为几何条件来辅助视频生成模型。为了增强几何结构的一致性，我们进一步提出了几何对齐损失，为模型提供现实世界的几何约束和几何适应模块，以确保几何特征的有效利用。大量实验表明，我们的 GeoWorld 可以从单个图像和给定的相机轨迹生成高保真 3D 场景，在质量和数量上都优于现有方法。项目页面：此 https URL。

Title: Vision Bridge Transformer at Scale

Authors: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23199
Pdf URL: https://arxiv.org/pdf/2511.23199
Copy Paste: [[2511.23199]] Vision Bridge Transformer at Scale(https://arxiv.org/abs/2511.23199)
Keywords: generation
Abstract: We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
摘要：我们推出 Vision Bridge Transformer (ViBT)，它是专为条件生成而设计的布朗桥模型的大规模实例。与将噪声转换为数据的传统扩散模型不同，桥模型直接对输入和输出之间的轨迹进行建模，从而创建有效的数据到数据的转换范例。通过将这些模型扩展到 20B 和 1.3B 参数，我们展示了它们在图像和视频翻译任务中的有效性。为了支持这种规模，我们采用了 Transformer 架构，并提出了用于稳健训练的方差稳定速度匹配目标。这些进步共同凸显了扩展桥接模型在基于指令的图像编辑和复杂视频翻译方面的强大功能。

Title: Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

Authors: Milad Abdollahzadeh, Abdul Raheem, Zilong Zhao, Uzair Javaid, Kevin Yee, Nalam Venkata Abhishek, Tram Truong-Huu, Biplab Sikdar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23220
Pdf URL: https://arxiv.org/pdf/2511.23220
Copy Paste: [[2511.23220]] Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day(https://arxiv.org/abs/2511.23220)
Keywords: generation
Abstract: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.
摘要：表格指令调整已成为提高法学硕士对表格数据理解的有前途的研究方向。然而，大多数现有的工作只考虑表格数据的问答和推理任务，而表格数据的生成在很大程度上被忽视。在这项工作中，我们首次探讨了指令调整在提高法学硕士表格数据生成能力方面的功效。更具体地说，考虑到表格指令调整的高数据和计算要求，我们的目标是解决在有限的数据和计算资源下进行表格数据生成的指令调整的可能性。为了实现这一目标，我们首先为表格数据创建高质量的指令数据集，从而实现高效的法学硕士理解。然后，我们在此数据集的训练集上对开源 LLM (Llama3.1-8B-Instruct) 进行指令调整，以提高其表格数据生成性能。我们的实验结果表明，通过使用我们的高质量数据集和 A100 GPU 仅对 7K 指令进行指令调整，在不到 6 小时的时间内，我们实现了与最强大的商业 LLM GPT-4o 相当的表格数据生成性能。

Title: Language-guided 3D scene synthesis for fine-grained functionality understanding

Authors: Jaime Corsetti, Francesco Giuliari, Davide Boscaini, Pedro Hermosilla, Andrea Pilzer, Guofeng Mei, Alexandros Delitzas, Francis Engelmann, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23230
Pdf URL: https://arxiv.org/pdf/2511.23230
Copy Paste: [[2511.23230]] Language-guided 3D scene synthesis for fine-grained functionality understanding(https://arxiv.org/abs/2511.23230)
Keywords: generation
Abstract: Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: this http URL.
摘要：3D 功能理解旨在识别 3D 场景中的功能元素以完成某个动作（例如，“打开床边柜子的第二个抽屉”的正确手柄），但由于收集和注释所需的大量工作，导致现实世界数据稀缺，从而阻碍了该功能的理解。为了解决这个问题，我们引入了 SynthFun3D，这是第一种基于任务的 3D 场景合成方法。给定动作描述，SynthFun3D 使用带有部件级注释的家具资产数据库生成 3D 室内环境，确保可以完成动作。它解释了自动识别和检索正确功能元素的 3D 掩模的操作，从而能够以廉价且大规模的方式生成高质量的注释数据。我们通过用户研究验证了 SynthFun3D，与其他方法相比，该研究证明了场景提示一致性的改进。我们的定量结果进一步表明，生成的数据可以替换真实数据，但性能损失较小，也可以补充真实数据以提高性能，从而为数据密集型 3D 应用程序提供廉价且可扩展的解决方案。项目页面：这个http URL。

Title: Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods

Authors: Jose Moises Araya-Martinez, Adrián Sanchis Reig, Gautham Mohan, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23241
Pdf URL: https://arxiv.org/pdf/2511.23241
Copy Paste: [[2511.23241]] Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods(https://arxiv.org/abs/2511.23241)
Keywords: generation, generative
Abstract: Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.
摘要：减少数据生成和注释的负担仍然是在工业和机器人环境中经济高效地部署机器学习的主要挑战。虽然合成渲染是一种很有前途的解决方案，但弥合模拟与真实之间的差距通常需要专家干预。在这项工作中，我们对一系列域随机化 (DR) 和域适应 (DA) 技术进行了基准测试，包括基于特征的方法、生成式 AI (GenAI) 和经典渲染方法，用于创建上下文化的合成数据，而无需手动注释。我们的评估重点是低级和高级特征对齐的有效性和效率，以及由现实世界环境生成的提示引导的基于受控扩散的 DA 方法。我们在两个数据集上验证我们的方法：专有工业数据集（汽车和物流）和公共机器人数据集。结果表明，如果具有足够可变性的基于渲染的数据可用作种子，则基于更简单的特征的方法（例如基于亮度和感知哈希过滤）在准确性和资源效率方面都优于更复杂的基于 GenAI 的方法。感知哈希始终实现最高性能，在工业和机器人数据集上的 mAP50 分数分别为 98% 和 67%。此外，与更简单的方法相比，GenAI 方法在数据生成方面存在大量时间开销，但模拟到真实 mAP 值没有明显改善。我们的研究结果提供了可操作的见解，可以有效地弥合模拟与真实的差距，从而使专门基于合成数据训练的模型能够在现实世界中获得较高的性能。

Title: UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23332
Pdf URL: https://arxiv.org/pdf/2511.23332
Copy Paste: [[2511.23332]] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes(https://arxiv.org/abs/2511.23332)
Keywords: generation
Abstract: Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at this https URL.
摘要：遥感中指令驱动的分割根据指导生成掩模，为可访问和通用的应用提供了巨大的潜力。然而，现有方法存在任务表述分散和指令数据有限的问题，阻碍了有效的理解和泛化。为了解决这些问题，我们引入了 GeoSeg-1M，这是第一个用于遥感指令驱动分割的百万级数据集，通过自动掩模过滤和指令生成管道构建，该管道综合来自多个公共数据集的引用、交互和推理分割指令。 GeoSeg-1M 包含 590K 图像、117 个类别和 110 万图像-掩模-指令三元组。在此基础上，我们进一步策划了 GeoSeg-Bench，这是一个具有挑战性的基准，旨在评估跨不同指令驱动任务和复杂地理空间场景的上下文理解和推理能力。此外，我们提出了 UniGeoSeg，一个作为强大基线的统一框架，结合了任务感知文本增强、潜在知识记忆和渐进式训练策略，以促进多任务学习。大量的实验证明了 UniGeoSeg 在 GeoSeg-Bench 和各种公共基准上的最先进的性能，同时表现出强大的零样本泛化能力。数据集和源代码在此 https URL 发布。

Title: Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Authors: Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23334
Pdf URL: https://arxiv.org/pdf/2511.23334
Copy Paste: [[2511.23334]] Markovian Scale Prediction: A New Era of Visual Autoregressive Generation(https://arxiv.org/abs/2511.23334)
Keywords: generation
Abstract: Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.
摘要：基于下一尺度预测的视觉自回归建模（VAR）重振了自回归视觉生成。尽管其全上下文依赖性，即对所有先前尺度进行建模以进行下一尺度预测，通过利用完整的信息流促进了更稳定和更全面的表示学习，但由此产生的计算效率低下和大量开销严重阻碍了 VAR 的实用性和可扩展性。这激励我们开发一种新的 VAR 模型，具有更好的性能和效率，且无需全上下文依赖。为了解决这个问题，我们将 VAR 重新表述为非全上下文马尔可夫过程，提出了马尔可夫-VAR。它是通过马尔可夫尺度预测来实现的：我们将每个尺度视为马尔可夫状态，并引入一个滑动窗口，将某些先前的尺度压缩为紧凑的历史向量，以补偿由于非全上下文依赖而导致的历史信息损失。将历史向量与马尔可夫状态积分产生在马尔可夫过程下演化的代表性动态状态。大量实验表明，Markov-VAR 极其简单但非常有效：与 ImageNet 上的 VAR 相比，Markov-VAR 将 FID 降低了 10.5%（256 $\times$ 256），并将峰值内存消耗降低了 83.8%（1024 $\times$ 1024）。我们相信马尔可夫-VAR 可以作为视觉自回归生成和其他下游任务的未来研究的基础。

Title: Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories

Authors: Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Alen Mrdovic, Dimitris Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23342
Pdf URL: https://arxiv.org/pdf/2511.23342
Copy Paste: [[2511.23342]] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories(https://arxiv.org/abs/2511.23342)
Keywords: generation, generative
Abstract: Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at this https URL.
摘要：基于流的生成模型最近表现出了强大的性能，但采样通常依赖于昂贵的常微分方程 (ODE) 数值积分。整流流通过学习几乎笔直的概率路径来实现一步采样，但实现这种笔直需要多次计算密集型回流迭代。 MeanFlow 通过直接建模随时间变化的平均速度来实现一步生成；然而，当在高度弯曲的流上进行训练时，它会遇到收敛缓慢和监督嘈杂的问题。为了解决这些限制，我们提出了 Rectified MeanFlow，这是一个仅使用单个回流步骤即可沿校正轨迹对平均速度场进行建模的框架。这消除了对完美笔直轨迹的需求，同时实现了高效的训练。此外，我们引入了一种简单而有效的截断启发式，旨在减少残余曲率并进一步提高性能。在 ImageNet 上以 64、256 和 512 分辨率进行的大量实验表明，Re-MeanFlow 在样本质量和训练效率方面始终优于先前的一步流蒸馏和整流流方法。代码可从此 https URL 获取。

Title: SimScale: Learning to Drive via Real-World Simulation at Scale

Authors: Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.23369
Pdf URL: https://arxiv.org/pdf/2511.23369
Copy Paste: [[2511.23369]] SimScale: Learning to Drive via Real-World Simulation at Scale(https://arxiv.org/abs/2511.23369)
Keywords: generation
Abstract: Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.
摘要：实现完全自动驾驶系统需要在各种场景中学习理性决策，包括安全关键场景和非分布场景。然而，此类案例在人类专家收集的现实世界语料库中代表性不足。为了弥补数据多样性的不足，我们引入了一种新颖且可扩展的模拟框架，能够根据现有的驾驶日志合成大量未见的状态。我们的管道利用先进的神经渲染和反应环境来生成由扰动的自我轨迹控制的高保真多视图观察。此外，我们为这些新模拟的状态开发了一种伪专家轨迹生成机制，以提供动作监督。根据合成数据，我们发现对现实世界和模拟样本的简单协同训练策略可以显着提高各种规划方法在具有挑战性的现实世界基准上的鲁棒性和泛化性，在 navhard 上高达 +6.8 EPDMS，在 navtest 上高达 +2.9。更重要的是，即使没有额外的现实世界数据流，这种策略改进也可以通过仅增加模拟数据来顺利扩展。我们进一步揭示了这种模拟真实学习系统（我们称之为 SimScale）的几个关键发现，包括伪专家的设计和不同策略架构的扩展属性。我们的模拟数据和代码将被发布。

Title: VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Authors: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23386
Pdf URL: https://arxiv.org/pdf/2511.23386
Copy Paste: [[2511.23386]] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction(https://arxiv.org/abs/2511.23386)
Keywords: generation
Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
摘要：在单个标记器中统一多模态理解、生成和重建表示仍然是构建统一模型的关键挑战。先前的研究主要尝试在双编码器范式中解决这个问题，例如，利用单独的编码器分别进行理解和生成，或者通过对比损失来平衡语义表示和低级特征。在本文中，我们提出了 VQRAE，即表示自动编码器的矢量量化版本，它开创了统一表示的首次探索，以在统一标记器内生成用于图像理解的连续语义特征和用于视觉生成的离散标记。具体来说，我们基于带有对称 ViT 解码器的预训练视觉基础模型，并采用两阶段训练策略：首先，它冻结编码器并学习具有像素重建目标的高维语义 VQ 码本；然后利用自蒸馏约束联合优化编码器。这种设计可以使用可忽略的语义信息来维持多模态理解的能力、兼容生成和细粒度重建的离散标记。此外，我们还发现了依赖于高维码本的量化语义编码器的有趣特性，这与之前图像重建中低维码本的常见做法形成鲜明对比。语义VQ码本在1536维上可以实现100%的利用率。VQRAE在视觉理解、生成和重建的多个基准上表现出有竞争力的性能，并且由于其离散优点，在自回归范式中具有良好的缩放特性。

Title: Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Authors: Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23429
Pdf URL: https://arxiv.org/pdf/2511.23429
Copy Paste: [[2511.23429]] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model(https://arxiv.org/abs/2511.23429)
Keywords: generative
Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".
摘要：生成世界模型的最新进展在创建开放式游戏环境方面取得了显着进展，从静态场景合成发展到动态交互式模拟。然而，当前的方法仍然受到严格的动作模式和高注释成本的限制，限制了它们对不同的游戏内交互和玩家驱动的动态进行建模的能力。为了应对这些挑战，我们引入了Hunyuan-GameCraft-2，这是一种用于生成游戏世界建模的指令驱动交互的新范式。我们的模型不依赖固定的键盘输入，而是允许用户通过自然语言提示、键盘或鼠标信号来控制游戏视频内容，从而在生成的世界中实现灵活且语义丰富的交互。我们正式定义了交互式视频数据的概念，并开发了一个自动化流程，将大规模、非结构化的文本视频对转换为因果对齐的交互式数据集。我们的模型建立在 14B 图像到视频专家混合 (MoE) 基础模型的基础上，结合了文本驱动的交互注入机制，可对摄像机运动、角色行为和环境动态进行细粒度控制。我们引入了一个以交互为中心的基准测试InterBench，来全面评估交互性能。大量的实验表明，我们的模型生成了时间连贯且有因果关系的交互式游戏视频，这些视频忠实地响应各种自由形式的用户指令，例如“开门”、“拔火把”或“触发爆炸”。

Title: Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation

Authors: Bernhard Klein, Falk Selker, Hendrik Borras, Sophie Steger, Franz Pernkopf, Holger Fröning
Subjects: cs.LG, cs.AR, cs.DC, stat.ML
Abstract URL: https://arxiv.org/abs/2511.23440
Pdf URL: https://arxiv.org/pdf/2511.23440
Copy Paste: [[2511.23440]] Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation(https://arxiv.org/abs/2511.23440)
Keywords: generation
Abstract: Machine learning models perform well across domains such as diagnostics, weather forecasting, NLP, and autonomous driving, but their limited uncertainty handling restricts use in safety-critical settings. Traditional neural networks often fail to detect out-of-domain (OOD) data and may output confident yet incorrect predictions. Bayesian neural networks (BNNs) address this by providing probabilistic estimates, but incur high computational cost because predictions require sampling weight distributions and multiple forward passes. The Probabilistic Forward Pass (PFP) offers a highly efficient approximation to Stochastic Variational Inference (SVI) by assuming Gaussian-distributed weights and activations, enabling fully analytic uncertainty propagation and replacing sampling with a single deterministic forward pass. We present an end-to-end pipeline for training, compiling, optimizing, and deploying PFP-based BNNs on embedded ARM CPUs. Using the TVM deep learning compiler, we implement a dedicated library of Gaussian-propagating operators for multilayer perceptrons and convolutional neural networks, combined with manual and automated tuning strategies. Ablation studies show that PFP consistently outperforms SVI in computational efficiency, achieving speedups of up to 4200x for small mini-batches. PFP-BNNs match SVI-BNNs on Dirty-MNIST in accuracy, uncertainty estimation, and OOD detection while greatly reducing compute cost. These results highlight the potential of combining Bayesian approximations with code generation to enable efficient BNN deployment on resource-constrained systems.
摘要：机器学习模型在诊断、天气预报、自然语言处理和自动驾驶等领域表现良好，但其有限的不确定性处理限制了在安全关键环境中的使用。传统的神经网络通常无法检测域外（OOD）数据，并且可能输出自信但不正确的预测。贝叶斯神经网络 (BNN) 通过提供概率估计来解决这个问题，但会产生较高的计算成本，因为预测需要采样权重分布和多次前向传递。概率前向传递 (PFP) 通过假设高斯分布权重和激活，为随机变分推理 (SVI) 提供高效近似，从而实现完全分析的不确定性传播并用单个确定性前向传递取代采样。我们提出了一个端到端的管道，用于在嵌入式 ARM CPU 上训练、编译、优化和部署基于 PFP 的 BNN。使用 TVM 深度学习编译器，我们结合手动和自动调整策略，实现了用于多层感知器和卷积神经网络的专用高斯传播算子库。消融研究表明，PFP 在计算效率方面始终优于 SVI，小批量的加速速度高达 4200 倍。 PFP-BNN 在准确性、不确定性估计和 OOD 检测方面与 Dirty-MNIST 上的 SVI-BNN 相匹配，同时大大降低了计算成本。这些结果凸显了将贝叶斯近似与代码生成相结合的潜力，可以在资源受限的系统上实现高效的 BNN 部署。

Title: ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

Authors: Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23442
Pdf URL: https://arxiv.org/pdf/2511.23442
Copy Paste: [[2511.23442]] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts(https://arxiv.org/abs/2511.23442)
Keywords: generative
Abstract: Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.
摘要：离线强化学习（RL）使代理能够从预先收集的数据集中学习最佳策略。然而，包含次优和分散轨迹的数据集给奖励传播带来了挑战，导致价值估计不准确和政策绩效下降。虽然通过生成模型进行轨迹拼接提供了一种有前途的解决方案，但现有的增强方法经常产生的轨迹要么仅限于行为政策的支持，要么违反潜在的动态，从而限制了其改善政策的有效性。我们提出了 ASTRO，这是一种数据增强框架，可为离线强化学习生成分布新颖且动态一致的轨迹。 ASTRO 首先学习时间距离表示来识别不同且可到达的缝合目标。然后，我们采用动态引导的缝合规划器，通过滚出偏差反馈自适应地生成连接动作序列，定义为目标状态序列与通过执行预测动作实际到达的状态序列之间的差距，以提高轨迹缝合的可行性和可达性。这种方法通过拼接促进有效增强，并最终增强政策学习。 ASTRO 在各种算法上都优于之前的离线 RL 增强方法，在具有挑战性的 OGBench 套件上实现了显着的性能提升，并在标准离线 RL 基准（例如 D4RL）上展示了一致的改进。

Title: Visual Generation Tuning

Authors: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23469
Pdf URL: https://arxiv.org/pdf/2511.23469
Copy Paste: [[2511.23469]] Visual Generation Tuning(https://arxiv.org/abs/2511.23469)
Keywords: generation
Abstract: Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at this https URL.
摘要：大视觉语言模型 (VLM) 通过广泛的预训练有效地弥合了模态差距，获取与语言一致的复杂视觉表示。然而，这些针对多模态理解任务进行优化的表示是否具有视觉生成的内在潜力，仍有待探索。在本文中，我们提出了 VGT（视觉生成调整），这是一种新颖的范式，旨在激发任何视觉语言模型中视觉生成的潜在能力。通过在预先训练良好的 VLM 上执行高效的视觉生成调整，我们显着降低了对齐成本并加速了连续空间中自回归建模的收敛（加速 20 倍）。具体来说，我们消除了为扩散变换器设计的纠缠像素级 VAE，并通过将预训练 VLM 中的语义编码器与像素解码器的潜在表示对齐来制定 VGT-AE。在图像重建任务中，我们在 28 倍压缩比下实现了 26.67 PSNR 和 0.50 rFID，优于专门的 VAE；在视觉生成任务中，我们在自回归模型中取得了最先进的结果，GenEval 上为 0.77，DPG-Bench 上为 78.73。此外，我们提出的 VGT 展示了显着的扩展前景，并且具有多功能性，可以为任何经过多模态理解训练的 VLM 赋予视觉生成功能，这为探索下一代统一多模态基础模型铺平了新途径。模型和代码可从此 https URL 获取。

Title: AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Authors: Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23475
Pdf URL: https://arxiv.org/pdf/2511.23475
Copy Paste: [[2511.23475]] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(https://arxiv.org/abs/2511.23475)
Keywords: generation, generative
Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
摘要：最近，多人视频生成开始受到重视。虽然一些初步工作已经探索了音频驱动的多人谈话视频生成，但由于多样化多人数据收集的高成本以及通过连贯的交互性驱动多个身份的困难，它们经常面临挑战。为了应对这些挑战，我们提出了 AnyTalker，一个多人生成框架，具有可扩展的多流处理架构。具体来说，我们用一种新颖的身份感知注意力机制扩展了 Diffusion Transformer 的注意力模块，该机制迭代地处理身份音频对，从而允许任意缩放可驾驶的身份。此外，训练多人生成模型需要大量多人数据。我们提出的训练流程仅依赖于单人视频来学习多人说话模式，并仅通过一些真实的多人剪辑来改进交互性。此外，我们提供了一个有针对性的指标和数据集，旨在评估生成的多人视频的自然度和交互性。大量实验表明，AnyTalker 实现了卓越的唇形同步、视觉质量和自然交互性，在数据成本和身份可扩展性之间取得了良好的平衡。