2025-12-16

Title: Active Inference with Reusable State-Dependent Value Profiles

Authors: Jacob Poschl
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11829
Pdf URL: https://arxiv.org/pdf/2512.11829
Copy Paste: [[2512.11829]] Active Inference with Reusable State-Dependent Value Profiles(https://arxiv.org/abs/2512.11829)
Keywords: generative
Abstract: Adaptive behavior in volatile environments requires agents to switch among value-control regimes across latent contexts, but maintaining separate preferences, policy biases, and action-confidence parameters for every situation is intractable. We introduce value profiles: a small set of reusable bundles of value-related parameters (outcome preferences, policy priors, and policy precision) assigned to hidden states in a generative model. As posterior beliefs over states evolve trial by trial, effective control parameters arise via belief-weighted mixing, enabling state-conditional strategy recruitment without requiring independent parameters for each context. We evaluate this framework in probabilistic reversal learning, comparing static-precision, entropy-coupled dynamic-precision, and profile-based models using cross-validated log-likelihood and information criteria. Model comparison favors the profile-based model over simpler alternatives (about 100-point AIC differences), and parameter-recovery analyses support structural identifiability even when context must be inferred from noisy observations. Model-based inference further suggests that adaptive control in this task is driven primarily by modulation of policy priors rather than policy precision, with gradual belief-dependent profile recruitment consistent with state-conditional (not purely uncertainty-driven) control. Overall, reusable value profiles provide a tractable computational account of belief-conditioned value control in volatile environments and yield testable signatures of belief-dependent control and behavioral flexibility.
摘要：不稳定环境中的适应性行为要求智能体跨潜在环境在价值控制机制之间切换，但针对每种情况保持单独的偏好、政策偏见和行动信心参数是很棘手的。我们引入了价值概况：分配给生成模型中隐藏状态的一小组可重用的价值相关参数（结果偏好、策略先验和策略精度）。随着对状态的后验信念不断地进行试验，有效的控制参数通过信念加权混合产生，从而实现状态条件策略招募，而不需要每个上下文的独立参数。我们在概率逆转学习中评估了这个框架，使用交叉验证的对数似然和信息标准来比较静态精度、熵耦合动态精度和基于轮廓的模型。与更简单的替代方案（大约 100 点 AIC 差异）相比，模型比较更倾向于基于配置文件的模型，即使必须从噪声观测中推断上下文，参数恢复分析也支持结构可识别性。基于模型的推理进一步表明，该任务中的自适应控制主要是由政策先验的调制而不是政策精度驱动的，逐步依赖于信念的配置招募与状态条件（并非纯粹不确定性驱动）控制一致。总体而言，可重用的价值概况为不稳定环境中的信念条件价值控制提供了易于处理的计算帐户，并产生了信念依赖控制和行为灵活性的可测试签名。

Title: CR3G: Causal Reasoning for Patient-Centric Explanations in Radiology Report Generation

Authors: Satyam Kumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11830
Pdf URL: https://arxiv.org/pdf/2512.11830
Copy Paste: [[2512.11830]] CR3G: Causal Reasoning for Patient-Centric Explanations in Radiology Report Generation(https://arxiv.org/abs/2512.11830)
Keywords: generation
Abstract: Automatic chest X-ray report generation is an important area of research aimed at improving diagnostic accuracy and helping doctors make faster decisions. Current AI models are good at finding correlations (or patterns) in medical images. Still, they often struggle to understand the deeper cause-and-effect relationships between those patterns and a patient condition. Causal inference is a powerful approach that goes beyond identifying patterns to uncover why certain findings in an X-ray relate to a specific diagnosis. In this paper, we will explore the prompt-driven framework Causal Reasoning for Patient-Centric Explanations in radiology Report Generation (CR3G) that is applied to chest X-ray analysis to improve understanding of AI-generated reports by focusing on cause-and-effect relationships, reasoning and generate patient-centric explanation. The aim to enhance the quality of AI-driven diagnostics, making them more useful and trustworthy in clinical practice. CR3G has shown better causal relationship capability and explanation capability for 2 out of 5 abnormalities.
摘要：自动生成胸部 X 射线报告是一个重要的研究领域，旨在提高诊断准确性并帮助医生更快地做出决策。当前的人工智能模型擅长发现医学图像中的相关性（或模式）。尽管如此，他们常常很难理解这些模式与患者病情之间更深层次的因果关系。因果推理是一种强大的方法，它不仅限于识别模式，还可以揭示为什么 X 射线中的某些发现与特定诊断相关。在本文中，我们将探索放射学报告生成中以患者为中心的解释的提示驱动框架（CR3G），该框架应用于胸部 X 射线分析，通过关注因果关系、推理和生成以患者为中心的解释来提高对 AI 生成报告的理解。目的是提高人工智能驱动诊断的质量，使其在临床实践中更有用、更值得信赖。 CR3G 对 5 个异常中的 2 个表现出更好的因果关系能力和解释能力。

Title: Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion

Authors: Michael Chertkov (University of Arizona)
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11859
Pdf URL: https://arxiv.org/pdf/2512.11859
Copy Paste: [[2512.11859]] Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion(https://arxiv.org/abs/2512.11859)
Keywords: generative
Abstract: We introduce Guided Harmonic Path-Integral Diffusion (GH-PID), a linearly-solvable framework for guided Stochastic Optimal Transport (SOT) with a hard terminal distribution and soft, application-driven path costs. A low-dimensional guidance protocol shapes the trajectory ensemble while preserving analytic structure: the forward and backward Kolmogorov equations remain linear, the optimal score admits an explicit Green-function ratio, and Gaussian-Mixture Model (GMM) terminal laws yield closed-form expressions. This enables stable sampling and differentiable protocol learning under exact terminal matching. We develop guidance-centric diagnostics -- path cost, centerline adherence, variance flow, and drift effort -- that make GH-PID an interpretable variational ansatz for empirical SOT. Three navigation scenarios illustrated in 2D: (i) Case A: hand-crafted protocols revealing how geometry and stiffness shape lag, curvature effects, and mode evolution; (ii) Case B: single-task protocol learning, where a PWC centerline is optimized to minimize integrated cost; (iii) Case C: multi-expert fusion, in which a commander reconciles competing expert/teacher trajectories and terminal beliefs through an exact product-of-experts law and learns a consensus protocol. Across all settings, GH-PID generates geometry-aware, trust-aware trajectories that satisfy the prescribed terminal distribution while systematically reducing integrated cost.
摘要：我们引入引导谐波路径积分扩散 (GH-PID)，这是一种用于引导随机最优传输 (SOT) 的线性可解框架，具有硬终端分布和软、应用驱动的路径成本。低维制导协议在保留分析结构的同时塑造轨迹系综：前向和后向柯尔莫哥洛夫方程保持线性，最佳分数允许显式格林函数比，高斯混合模型 (GMM) 终端定律产生闭合形式表达式。这使得在精确的终端匹配下能够实现稳定的采样和可微分的协议学习。我们开发以引导为中心的诊断——路径成本、中心线遵守、方差流和漂移努力——这使得 GH-PID 成为经验 SOT 的可解释变分模拟。以 2D 形式说明的三种导航场景：(i) 案例 A：手工制作的协议揭示几何形状和刚度形状滞后、曲率效应和模式演化； (ii) 案例 B：单任务协议学习，其中 PWC 中心线经过优化以最小化综合成本； (iii) 案例 C：多专家融合，其中指挥官通过精确的专家乘积定律来协调竞争的专家/教师轨迹和最终信念，并学习共识协议。在所有设置中，GH-PID 都会生成几何感知、信任感知的轨迹，满足规定的终端分布，同时系统地降低集成成本。

Title: Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

Authors: Jiahao You, Ziye Jia, Can Cui, Chao Dong, Qihui Wu, Zhu Han
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11862
Pdf URL: https://arxiv.org/pdf/2512.11862
Copy Paste: [[2512.11862]] Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL(https://arxiv.org/abs/2512.11862)
Keywords: generative
Abstract: The low-altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low-latency and energy-efficient edge intelligence in dynamic and infrastructure-limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing. However, these systems face key challenges, including energy-constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey-Clarke-Groves auction mechanism enables the energy-aware and incentive-compatible trajectory assignment. At the small timescale, we propose the diffusion-heterogeneous-agent proximal policy optimization, a generative multi-agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation-conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.
摘要：低空智能网络（LAIN）作为一种有前途的架构而出现，可在动态和基础设施有限的环境中提供低延迟和节能的边缘智能。通过集成无人机 (UAV)、空中基站和地面基站，LAIN 可以支持灾难响应、环境监测和实时传感等关键任务应用。然而，这些系统面临着关键挑战，包括能源受限的无人机、随机任务到达和异构计算资源。为了解决这些问题，我们提出了一种集成的空地协作网络，并制定了一个与时间相关的整数非线性规划问题，共同优化无人机轨迹规划和任务卸载决策。由于决策变量之间的时间耦合，该问题很难解决。因此，我们设计了一个具有两个时间尺度的分层学习框架。在大的时间尺度上，Vickrey-Clarke-Groves 拍卖机制可以实现能量感知和激励兼容的轨迹分配。在小时间尺度上，我们提出了扩散异构代理近端策略优化，这是一种将潜在扩散模型嵌入到行动者网络中的生成多代理强化学习算法。每个无人机都会从高斯先验中采样动作，并通过观察条件去噪对其进行细化，从而增强适应性和策略多样性。广泛的模拟表明，我们的框架在能源效率、任务成功率和收敛性能方面优于基线。

Title: On the Dangers of Bootstrapping Generation for Continual Learning and Beyond

Authors: Daniil Zverev, A. Sophia Koepke, Joao F. Henriques
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11867
Pdf URL: https://arxiv.org/pdf/2512.11867
Copy Paste: [[2512.11867]] On the Dangers of Bootstrapping Generation for Continual Learning and Beyond(https://arxiv.org/abs/2512.11867)
Keywords: generation, generative
Abstract: The use of synthetically generated data for training models is becoming a common practice. While generated data can augment the training data, repeated training on synthetic data raises concerns about distribution drift and degradation of performance due to contamination of the dataset. We investigate the consequences of this bootstrapping process through the lens of continual learning, drawing a connection to Generative Experience Replay (GER) methods. We present a statistical analysis showing that synthetic data introduces significant bias and variance into training objectives, weakening the reliability of maximum likelihood estimation. We provide empirical evidence showing that popular generative models collapse under repeated training with synthetic data. We quantify this degradation and show that state-of-the-art GER methods fail to maintain alignment in the latent space. Our findings raise critical concerns about the use of synthetic data in continual learning.
摘要：使用综合生成的数据来训练模型正在成为一种常见的做法。虽然生成的数据可以增强训练数据，但对合成数据的重复训练会引起人们对由于数据集污染而导致的分布漂移和性能下降的担忧。我们通过持续学习的视角来研究这种引导过程的后果，并与生成经验重放（GER）方法建立联系。我们提出的统计分析表明，合成数据在训练目标中引入了显着的偏差和方差，削弱了最大似然估计的可靠性。我们提供的经验证据表明，流行的生成模型在使用合成数据进行重复训练时会崩溃。我们量化了这种退化，并表明最先进的 GER 方法无法维持潜在空间中的对齐。我们的研究结果引起了人们对在持续学习中使用合成数据的严重担忧。

Title: mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description

Authors: Mahathir Monjur, Shahriar Nirjon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11894
Pdf URL: https://arxiv.org/pdf/2512.11894
Copy Paste: [[2512.11894]] mmWEAVER: Environment-Specific mmWave Signal Synthesis from a Photo and Activity Description(https://arxiv.org/abs/2512.11894)
Keywords: generation
Abstract: Realistic signal generation and dataset augmentation are essential for advancing mmWave radar applications such as activity recognition and pose estimation, which rely heavily on diverse, and environment-specific signal datasets. However, mmWave signals are inherently complex, sparse, and high-dimensional, making physical simulation computationally expensive. This paper presents mmWeaver, a novel framework that synthesizes realistic, environment-specific complex mmWave signals by modeling them as continuous functions using Implicit Neural Representations (INRs), achieving up to 49-fold compression. mmWeaver incorporates hypernetworks that dynamically generate INR parameters based on environmental context (extracted from RGB-D images) and human motion features (derived from text-to-pose generation via MotionGPT), enabling efficient and adaptive signal synthesis. By conditioning on these semantic and geometric priors, mmWeaver generates diverse I/Q signals at multiple resolutions, preserving phase information critical for downstream tasks such as point cloud estimation and activity classification. Extensive experiments show that mmWeaver achieves a complex SSIM of 0.88 and a PSNR of 35 dB, outperforming existing methods in signal realism while improving activity recognition accuracy by up to 7% and reducing human pose estimation error by up to 15%, all while operating 6-35 times faster than simulation-based approaches.
摘要：真实的信号生成和数据集增强对于推进毫米波雷达应用（例如活动识别和姿态估计）至关重要，这些应用严重依赖于多样化且特定于环境的信号数据集。然而，毫米波信号本质上是复杂、稀疏和高维的，使得物理模拟的计算成本很高。本文提出了 mmWeaver，这是一种新颖的框架，它通过使用隐式神经表示 (INR) 将其建模为连续函数来合成现实的、特定于环境的复杂毫米波信号，从而实现高达 49 倍的压缩。 mmWeaver 结合了超网络，可根据环境上下文（从 RGB-D 图像中提取）和人体运动特征（通过 MotionGPT 从文本到姿势生成中导出）动态生成 INR 参数，从而实现高效且自适应的信号合成。通过调节这些语义和几何先验，mmWeaver 可以生成多种分辨率的不同 I/Q 信号，保留对于点云估计和活动分类等下游任务至关重要的相位信息。大量实验表明，mmWeaver 实现了 0.88 的复杂 SSIM 和 35 dB 的 PSNR，在信号真实性方面优于现有方法，同时将活动识别精度提高了 7%，将人体姿势估计误差降低了 15%，同时运行速度比基于仿真的方法快 6-35 倍。

Title: MPath: Multimodal Pathology Report Generation from Whole Slide Images

Authors: Noorul Wahab, Nasir Rajpoot
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11906
Pdf URL: https://arxiv.org/pdf/2512.11906
Copy Paste: [[2512.11906]] MPath: Multimodal Pathology Report Generation from Whole Slide Images(https://arxiv.org/abs/2512.11906)
Keywords: generation
Abstract: Automated generation of diagnostic pathology reports directly from whole slide images (WSIs) is an emerging direction in computational pathology. Translating high-resolution tissue patterns into clinically coherent text remains difficult due to large morphological variability and the complex structure of pathology narratives. We introduce MPath, a lightweight multimodal framework that conditions a pretrained biomedical language model (BioBART) on WSI-derived visual embeddings through a learned visual-prefix prompting mechanism. Instead of end-to-end vision-language pretraining, MPath leverages foundation-model WSI features (CONCH + Titan) and injects them into BioBART via a compact projection module, keeping the language backbone frozen for stability and data efficiency. MPath was developed and evaluated on the RED 2025 Grand Challenge dataset and ranked 4th in Test Phase 2, despite limited submission opportunities. The results highlight the potential of prompt-based multimodal conditioning as a scalable and interpretable strategy for pathology report generation.
摘要：直接从整个幻灯片图像（WSI）自动生成诊断病理学报告是计算病理学的一个新兴方向。由于较大的形态变异性和病理叙述的复杂结构，将高分辨率组织模式转化为临床连贯的文本仍然很困难。我们引入了 MPath，这是一种轻量级多模态框架，它通过学习的视觉前缀提示机制在 WSI 派生的视觉嵌入上调节预训练的生物医学语言模型 (BioBART)。 MPath 不是端到端视觉语言预训练，而是利用基础模型 WSI 功能（CONCH + Titan），并通过紧凑的投影模块将它们注入 BioBART，保持语言主干的稳定性和数据效率。 MPath 是在 RED 2025 Grand Challenge 数据集上开发和评估的，尽管提交机会有限，但在测试阶段 2 中排名第四。结果强调了基于提示的多模式调节作为病理报告生成的可扩展和可解释策略的潜力。

Title: FloraForge: LLM-Assisted Procedural Generation of Editable and Analysis-Ready 3D Plant Geometric Models For Agricultural Applications

Authors: Mozhgan Hadadi, Talukder Z. Jubery, Patrick S. Schnable, Arti Singh, Bedrich Benes, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11925
Pdf URL: https://arxiv.org/pdf/2512.11925
Copy Paste: [[2512.11925]] FloraForge: LLM-Assisted Procedural Generation of Editable and Analysis-Ready 3D Plant Geometric Models For Agricultural Applications(https://arxiv.org/abs/2512.11925)
Keywords: generation
Abstract: Accurate 3D plant models are crucial for computational phenotyping and physics-based simulation; however, current approaches face significant limitations. Learning-based reconstruction methods require extensive species-specific training data and lack editability. Procedural modeling offers parametric control but demands specialized expertise in geometric modeling and an in-depth understanding of complex procedural rules, making it inaccessible to domain scientists. We present FloraForge, an LLM-assisted framework that enables domain experts to generate biologically accurate, fully parametric 3D plant models through iterative natural language Plant Refinements (PR), minimizing programming expertise. Our framework leverages LLM-enabled co-design to refine Python scripts that generate parameterized plant geometries as hierarchical B-spline surface representations with botanical constraints with explicit control points and parametric deformation functions. This representation can be easily tessellated into polygonal meshes with arbitrary precision, ensuring compatibility with functional structural plant analysis workflows such as light simulation, computational fluid dynamics, and finite element analysis. We demonstrate the framework on maize, soybean, and mung bean, fitting procedural models to empirical point cloud data through manual refinement of the Plant Descriptor (PD), human-readable files. The pipeline generates dual outputs: triangular meshes for visualization and triangular meshes with additional parametric metadata for quantitative analysis. This approach uniquely combines LLM-assisted template creation, mathematically continuous representations enabling both phenotyping and rendering, and direct parametric control through PD. The framework democratizes sophisticated geometric modeling for plant science while maintaining mathematical rigor.
摘要：准确的 3D 植物模型对于计算表型分析和基于物理的模拟至关重要；然而，目前的方法面临很大的局限性。基于学习的重建方法需要大量特定于物种的训练数据并且缺乏可编辑性。程序建模提供参数控制，但需要几何建模方面的专业知识以及对复杂程序规则的深入理解，这使得领域科学家无法理解。我们推出了 FloraForge，这是一个法学硕士辅助框架，使领域专家能够通过迭代自然语言植物精炼 (PR) 生成生物学上准确的、完全参数化的 3D 植物模型，从而最大限度地减少编程专业知识。我们的框架利用支持 LLM 的协同设计来完善 Python 脚本，生成参数化植物几何形状作为分层 B 样条曲面表示，并具有带有显式控制点和参数变形函数的植物约束。这种表示可以轻松地细分为任意精度的多边形网格，确保与功能结构工厂分析工作流程（例如光模拟、计算流体动力学和有限元分析）的兼容性。我们演示了玉米、大豆和绿豆的框架，通过手动细化植物描述符（PD）和人类可读文件，将程序模型拟合到经验点云数据。该管道生成双重输出：用于可视化的三角形网格和带有用于定量分析的附加参数元数据的三角形网格。这种方法独特地结合了 LLM 辅助的模板创建、支持表型分析和渲染的数学连续表示以及通过 PD 的直接参数控制。该框架使植物科学的复杂几何建模民主化，同时保持数学严谨性。

Title: Learning to Extract Context for Context-Aware LLM Inference

Authors: Minseon Kim, Lucas Caccia, Zhengyan Shi, Matheus Pereira, Marc-Alexandre Côté, Xingdi Yuan, Alessandro Sordoni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11986
Pdf URL: https://arxiv.org/pdf/2512.11986
Copy Paste: [[2512.11986]] Learning to Extract Context for Context-Aware LLM Inference(https://arxiv.org/abs/2512.11986)
Keywords: generation
Abstract: User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.
摘要：对大型语言模型 (LLM) 的用户提示通常是模糊或不明确的，并且由用户意图、先验知识和风险因素形成的微妙上下文线索强烈影响构成适当响应的内容。误解意图或风险可能会导致不安全的输出，而过于谨慎的解释可能会导致不必要地拒绝良性请求。在本文中，我们对传统框架提出了质疑，在传统框架中，法学硕士立即对请求做出回应，而不考虑更广泛的背景因素。用户请求位于更广泛的上下文中，例如意图、知识和先前经验，这强烈影响构成适当答案的内容。我们提出了一个框架，可以从用户提示本身中提取并利用此类上下文信息。具体来说，基于强化学习的上下文生成器以类似自动编码器的方式设计，经过训练以推断基于提示的上下文信号，并使用它们来指导响应生成。这种方法对于安全任务尤其重要，其中不明确的请求可能会绕过安全措施，而良性但令人困惑的请求可能会触发不必要的拒绝。实验表明，我们的方法在跨多个基础模型的 SafetyInstruct 数据集上平均减少了 5.6% 的有害响应，并在 XSTest 和 WildJailbreak 上将攻击成功率和良性提示的合规性的调和平均值提高了 6.2%。这些结果证明了上下文提取对于更安全、更可靠的法学硕士推理的有效性。

Title: CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

Authors: Tejas Panambur, Ishan Rajendrakumar Dave, Chongjian Ge, Ersin Yumer, Xue Bai
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2512.12060
Pdf URL: https://arxiv.org/pdf/2512.12060
Copy Paste: [[2512.12060]] CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos(https://arxiv.org/abs/2512.12060)
Keywords: restoration, super-resolution, generative
Abstract: Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: this https URL.
摘要：现代文本到视频 (T2V) 扩散模型可以合成视觉上引人注目的剪辑，但它们在精细结构方面仍然很脆弱：即使是最先进的生成器也经常会产生扭曲的面部和手部、扭曲的背景以及时间上不一致的运动。这种严重的结构伪影也出现在质量非常低的现实世界视频中。相比之下，经典视频恢复和超分辨率（VR/VSR）方法针对模糊和下采样等合成退化进行了调整，并且倾向于稳定这些伪像而不是修复它们，而扩散先验恢复器通常接受光度噪声训练，并且对感知质量和保真度之间的权衡几乎没有控制。我们介绍了 CreativeVR，这是一种扩散先验引导的视频恢复框架，适用于人工智能生成 (AIGC) 和具有严重结构和时间伪影的真实视频。我们基于深度适配器的方法公开了一个精度旋钮，可以控制模型跟随输入的强度，在标准降级的精确恢复和具有挑战性的内容上更强的结构和运动校正行为之间顺利权衡。我们的关键新颖之处是训练期间使用的时间相干退化模块，该模块应用精心设计的转换来产生真实的结构故障。为了评估 AIGC 伪影恢复，我们提出了具有 FIQA、语义和感知指标以及多方面评分的 AIGC54 基准。 CreativeVR 在具有严重伪影的视频上实现了最先进的结果，并在标准视频恢复基准上具有竞争力，同时以实际吞吐量运行（在单个 80 GB A100 上，720p 时约为 13 FPS）。项目页面：此 https URL。

Title: SigTime: Learning and Visually Explaining Time Series Signatures

Authors: Yu-Chia Huang, Juntong Chen, Dongyu Liu, Kwan-Liu Ma
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.12076
Pdf URL: https://arxiv.org/pdf/2512.12076
Copy Paste: [[2512.12076]] SigTime: Learning and Visually Explaining Time Series Signatures(https://arxiv.org/abs/2512.12076)
Keywords: generation
Abstract: Understanding and distinguishing temporal patterns in time series data is essential for scientific discovery and decision-making. For example, in biomedical research, uncovering meaningful patterns in physiological signals can improve diagnosis, risk assessment, and patient outcomes. However, existing methods for time series pattern discovery face major challenges, including high computational complexity, limited interpretability, and difficulty in capturing meaningful temporal structures. To address these gaps, we introduce a novel learning framework that jointly trains two Transformer models using complementary time series representations: shapelet-based representations to capture localized temporal structures and traditional feature engineering to encode statistical properties. The learned shapelets serve as interpretable signatures that differentiate time series across classification labels. Additionally, we develop a visual analytics system -- SigTIme -- with coordinated views to facilitate exploration of time series signatures from multiple perspectives, aiding in useful insights generation. We quantitatively evaluate our learning framework on eight publicly available datasets and one proprietary clinical dataset. Additionally, we demonstrate the effectiveness of our system through two usage scenarios along with the domain experts: one involving public ECG data and the other focused on preterm labor analysis.
摘要：理解和区分时间序列数据中的时间模式对于科学发现和决策至关重要。例如，在生物医学研究中，发现生理信号中有意义的模式可以改善诊断、风险评估和患者治疗结果。然而，现有的时间序列模式发现方法面临着重大挑战，包括计算复杂性高、可解释性有限以及难以捕获有意义的时间结构。为了解决这些差距，我们引入了一种新颖的学习框架，该框架使用互补的时间序列表示联合训练两个 Transformer 模型：基于 shapelet 的表示来捕获局部时间结构，以及传统的特征工程来编码统计属性。学习到的 shapelet 充当可解释的签名，区分不同分类标签的时间序列。此外，我们开发了一个可视化分析系统——SigTIme——具有协调的视图，以促进从多个角度探索时间序列签名，帮助生成有用的见解。我们在八个公开数据集和一个专有临床数据集上定量评估我们的学习框架。此外，我们与领域专家一起通过两种使用场景展示了我们系统的有效性：一种涉及公共心电图数据，另一种专注于早产分析。

Title: BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

Authors: Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12080
Pdf URL: https://arxiv.org/pdf/2512.12080
Copy Paste: [[2512.12080]] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models(https://arxiv.org/abs/2512.12080)
Keywords: generation
Abstract: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.
摘要：自回归视频模型有望通过下一帧预测进行世界建模，但它们存在暴露偏差：干净上下文的训练与自生成帧的推理之间不匹配，导致错误复合和质量随着时间的推移而漂移。我们引入了向后聚合（BAgger），这是一种自我监督的方案，它根据模型自身的推出构建纠正轨迹，教导它从错误中恢复。与依赖于少步蒸馏和分布匹配损失的先前方法不同，BAgger 使用标准分数或流程匹配目标进行训练，从而避免了随着时间的推移而出现的大量教师和长链反向传播。我们在因果扩散变压器上实例化 BAgger，并对文本到视频、视频扩展和多提示生成进行评估，观察到更稳定的长视野运动和更好的视觉一致性，同时减少了漂移。

Title: RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

Authors: Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12083
Pdf URL: https://arxiv.org/pdf/2512.12083
Copy Paste: [[2512.12083]] RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer(https://arxiv.org/abs/2512.12083)
Keywords: generation
Abstract: The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.
摘要：预训练视觉基础模型 (VFM) 的卓越表示能力已被用来增强潜在扩散模型 (LDM)。这些方法将高维 VFM 表示（例如 DINOv3）的丰富语义注入到不同阶段的 LDM 中，从而加快学习速度并提高生成性能。然而，VFM 表示的高维性也可能导致信息过载，特别是当 VFM 特征超过用于解码的原始图像的大小时。为了解决这个问题，同时保留 VFM 功能的实用性，我们提出了 RePack（表示打包），这是一个用于改进扩散变压器（DiT）的简单而有效的框架。 RePack 通过投影到低维流形上，将 VFM 表示转换为更紧凑、解码器友好的表示。我们发现 RePack 可以有效地滤除非语义噪声，同时保留高保真重建所需的核心结构信息。实验结果表明，RePack 显着加速了 DiT 收敛，并且优于直接将原始 VFM 特征注入解码器进行图像重建的最新方法。在 DiT-XL/2 上，RePack 仅用 64 个 epoch 就实现了 3.66 的 FID，比最先进的方法快了 35%。这表明 RePack 成功提取了 VFM 表示的核心语义，同时绕过了其高维副作用。

Title: CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation

Authors: Xin Yang, Omid Ardakanian
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.12086
Pdf URL: https://arxiv.org/pdf/2512.12086
Copy Paste: [[2512.12086]] CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation(https://arxiv.org/abs/2512.12086)
Keywords: generative
Abstract: Data obfuscation is a promising technique for mitigating attribute inference attacks by semi-trusted parties with access to time-series data emitted by sensors. Recent advances leverage conditional generative models together with adversarial training or mutual information-based regularization to balance data privacy and utility. However, these methods often require modifying the downstream task, struggle to achieve a satisfactory privacy-utility trade-off, or are computationally intensive, making them impractical for deployment on resource-constrained mobile IoT devices. We propose Cloak, a novel data obfuscation framework based on latent diffusion models. In contrast to prior work, we employ contrastive learning to extract disentangled representations, which guide the latent diffusion process to retain useful information while concealing private information. This approach enables users with diverse privacy needs to navigate the privacy-utility trade-off with minimal retraining. Extensive experiments on four public time-series datasets, spanning multiple sensing modalities, and a dataset of facial images demonstrate that Cloak consistently outperforms state-of-the-art obfuscation techniques and is well-suited for deployment in resource-constrained settings.
摘要：数据混淆是一种很有前途的技术，可以减轻半可信方通过访问传感器发出的时间序列数据而进行的属性推断攻击。最近的进展利用条件生成模型与对抗性训练或基于相互信息的正则化来平衡数据隐私和实用性。然而，这些方法通常需要修改下游任务，难以实现令人满意的隐私与实用性权衡，或者计算量很大，使得它们在资源有限的移动物联网设备上部署不切实际。我们提出了 Cloak，一种基于潜在扩散模型的新型数据混淆框架。与之前的工作相比，我们采用对比学习来提取解缠结的表示，这指导潜在的扩散过程保留有用的信息，同时隐藏私人信息。这种方法使具有不同隐私需求的用户能够以最少的再培训来实现隐私与实用性的权衡。对四个公共时间序列数据集（涵盖多种传感模式）和面部图像数据集的广泛实验表明，Cloak 始终优于最先进的混淆技术，并且非常适合在资源有限的环境中部署。

Title: SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Authors: Samar Fares, Nurbek Tastan, Karthik Nandakumar
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12090
Pdf URL: https://arxiv.org/pdf/2512.12090
Copy Paste: [[2512.12090]] SPDMark: Selective Parameter Displacement for Robust Video Watermarking(https://arxiv.org/abs/2512.12090)
Keywords: generation, generative
Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.
摘要：高质量视频生成模型的出现增加了对强大的水印方案的需求，该方案可用于可靠地检测和跟踪生成视频的来源。现有的基于事后和代内方法的视频水印方法无法同时实现不可察觉性、鲁棒性和计算效率。这项工作介绍了一种称为 SPDMark（发音为“SpeedMark”）的新型视频水印框架，该框架基于视频扩散模型的选择性参数位移。通过修改生成模型中的参数子集，将水印嵌入到生成的视频中。为了使问题易于处理，位移被建模为逐层基础位移的加法组合，其中最终组合由水印密钥索引。为了提高参数效率，这项工作特别利用低秩自适应（LoRA）来实现基础转换。在训练阶段，通过最小化消息恢复、感知相似性和时间一致性损失的组合来共同学习基移和水印提取器。为了检测和定位水印视频中的时间修改，我们使用加密哈希函数从给定的基本水印密钥导出特定于帧的水印消息。在水印提取过程中，即使是从时间被篡改的视频中，也会应用最大二分匹配来恢复正确的帧顺序。对文本到视频和图像到视频生成模型的评估表明，SPDMark 能够生成难以察觉的水印，这些水印可以高精度恢复，并建立了其针对各种常见视频修改的鲁棒性。

Title: High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees

Authors: Elynn Chen, Yuefeng Han, Jiayu Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12122
Pdf URL: https://arxiv.org/pdf/2512.12122
Copy Paste: [[2512.12122]] High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees(https://arxiv.org/abs/2512.12122)
Keywords: generative
Abstract: High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textsc{rc-PCA}) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes.
摘要：高维张量值预测器出现在现代应用中，越来越多地作为从神经网络学习的表示。现有的张量分类方法依赖于稀疏性或塔克结构，并且往往缺乏理论保证。受判别信号集中在几个多线性分量上的经验证据的启发，我们为判别张量引入了 CP 低秩结构，这是一种先前未探讨过的建模视角。在张量高斯混合模型下，我们提出了具有随机复合PCA（\textsc{rc-PCA}）初始化的高维CP低秩张量判别分析（CP-TDA），这对于在较弱信号强度和不相干条件下处理相关噪声和各向异性噪声至关重要，然后是迭代细化算法。我们建立全局收敛和极小极大最优错误分类率。为了处理偏离张量正态性的张量数据，我们开发了第一个半参数张量判别模型，其中学习的张量表示通过深度生成模型映射到为 CP-TDA 定制的潜在空间。错误分类风险分解为表示错误、近似错误和估计错误。图分类的数值研究和实际数据分析表明，与现有的张量分类器和最先进的图神经网络相比，有很大的进步，特别是在高维、小样本情况下。

Title: HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone

Authors: Yihan Wang, Annan Yu, Lujun Zhang, Charuleka Varadharajan, N. Benjamin Erichson
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.12183
Pdf URL: https://arxiv.org/pdf/2512.12183
Copy Paste: [[2512.12183]] HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone(https://arxiv.org/abs/2512.12183)
Keywords: generative
Abstract: Recent advances have introduced diffusion models for probabilistic streamflow forecasting, demonstrating strong early flood-warning skill. However, current implementations rely on recurrent Long Short-Term Memory (LSTM) backbones and single-step training objectives, which limit their ability to capture long-range dependencies and produce coherent forecast trajectories across lead times. To address these limitations, we developed HydroDiffusion, a diffusion-based probabilistic forecasting framework with a decoder-only state space model backbone. The proposed framework jointly denoises full multi-day trajectories in a single pass, ensuring temporal coherence and mitigating error accumulation common in autoregressive prediction. HydroDiffusion is evaluated across 531 watersheds in the contiguous United States (CONUS) in the CAMELS dataset. We benchmark HydroDiffusion against two diffusion baselines with LSTM backbones, as well as the recently proposed Diffusion-based Runoff Model (DRUM). Results show that HydroDiffusion achieves strong nowcast accuracy when driven by observed meteorological forcings, and maintains consistent performance across the full simulation horizon. Moreover, HydroDiffusion delivers stronger deterministic and probabilistic forecast skill than DRUM in operational forecasting. These results establish HydroDiffusion as a robust generative modeling framework for medium-range streamflow forecasting, providing both a new modeling benchmark and a foundation for future research on probabilistic hydrologic prediction at continental scales.
摘要：最近的进展引入了用于概率水流预测的扩散模型，展示了强大的早期洪水预警能力。然而，当前的实现依赖于循环长短期记忆 (LSTM) 主干和单步训练目标，这限制了它们捕获长期依赖性并在整个交付周期内生成一致预测轨迹的能力。为了解决这些限制，我们开发了 HydroDiffusion，这是一种基于扩散的概率预测框架，具有仅解码器的状态空间模型主干。所提出的框架在一次传递中对完整的多日轨迹进行联合去噪，确保时间一致性并减轻自回归预测中常见的误差累积。 HydroDiffusion 在 CAMELS 数据集中对美国本土 (CONUS) 的 531 个流域进行了评估。我们将 HydroDiffusion 与具有 LSTM 主干的两个扩散基线以及最近提出的基于扩散的径流模型 (DRUM) 进行基准测试。结果表明，HydroDiffusion 在观测到的气象强迫驱动下实现了很强的临近预报精度，并在整个模拟范围内保持一致的性能。此外，HydroDiffusion 在业务预测方面提供比 DRUM 更强的确定性和概率预测能力。这些结果将 HydroDiffusion 确立为中期水流预测的稳健生成模型框架，为大陆尺度概率水文预测的未来研究提供了新的建模基准和基础。

Title: SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

Authors: Xuancheng Xu, Yaning Li, Sisi You, Bing-Kun Bao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12193
Pdf URL: https://arxiv.org/pdf/2512.12193
Copy Paste: [[2512.12193]] SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation(https://arxiv.org/abs/2512.12193)
Keywords: generation
Abstract: Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.
摘要：定制视频生成旨在生成能够忠实地保留参考图像中主体外观的视频，同时保持参考视频中时间一致的运动。由于缺乏对主体和运动的对象级指导，现有方法很难确保主体外观相似性和运动模式一致性。为了解决这个问题，我们提出了 SMRABooth，它利用自监督编码器和光流编码器来提供对象级主题和运动表示。这些表示在 LoRA 微调过程中与模型保持一致。我们的方法分为三个核心阶段：（1）我们通过自监督编码器利用主题表示来指导主题对齐，使模型能够捕获主题的整体结构并增强高级语义一致性。 (2) 我们利用光流编码器的运动表示来捕获结构相干的物体级运动轨迹，而与外观无关。 (3) 我们提出了一种主体-运动关联解耦策略，该策略在位置和时间上应用稀疏 LoRA 注入，有效减少主体和运动 LoRA 之间的干扰。大量实验表明，SMRABooth 在主题和动作定制方面表现出色，保持一致的主题外观和运动模式，证明了其在可控文本到视频生成方面的有效性。

Title: MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching

Authors: Jirui Jin, Cheng Zeng, Pawan Prakash, Ellad B. Tadmor, Adrian Roitberg, Richard G. Hennig, Stefano Martiniani, Mingjie Liu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.12198
Pdf URL: https://arxiv.org/pdf/2512.12198
Copy Paste: [[2512.12198]] MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching(https://arxiv.org/abs/2512.12198)
Keywords: generation, generative
Abstract: Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, promoting structural diversity, and enabling efficient sampling for discovery. Recent advances in computer vision introduced a range of new guidance strategies for generative models, many of which can be adapted to support these goals. In this work, we integrate state-of-the-art guidance methods -- including classifier-free guidance, autoguidance, and model guidance -- in a leading molecule generation framework built on an SE(3)-equivariant flow matching process. We propose a hybrid guidance strategy that separately guides continuous and discrete molecular modalities -- operating on velocity fields and predicted logits, respectively -- while jointly optimizing their guidance scales via Bayesian optimization. Our implementation, benchmarked on the QM9 and QMe14S datasets, achieves new state-of-the-art performance in property alignment for de novo molecular generation. The generated molecules also exhibit high structural validity. Furthermore, we systematically compare the strengths and limitations of various guidance methods, offering insights into their broader applicability.
摘要：条件分子生成的关键目标包括确保化学有效性、将生成的分子与目标特性对齐、促进结构多样性以及实现高效采样以进行发现。计算机视觉的最新进展为生成模型引入了一系列新的指导策略，其中许多策略可以适应支持这些目标。在这项工作中，我们将最先进的引导方法（包括无分类器引导、自动引导和模型引导）集成到基于 SE(3) 等变流匹配过程的领先分子生成框架中。我们提出了一种混合引导策略，该策略分别引导连续和离散分子模态（分别在速度场和预测逻辑上运行），同时通过贝叶斯优化联合优化其引导尺度。我们的实施以 QM9 和 QMe14S 数据集为基准，在从头分子生成的属性对齐方面实现了最先进的新性能。生成的分子还表现出较高的结构有效性。此外，我们系统地比较了各种指导方法的优点和局限性，提供了对其更广泛适用性的见解。

Title: A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction

Authors: Indranil Bhattacharjee, Vartika Narayani Srinet, Anirudha Bhattacharjee, Braj Bhushan, Bishakh Bhattacharya
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.12208
Pdf URL: https://arxiv.org/pdf/2512.12208
Copy Paste: [[2512.12208]] A Hybrid Deep Learning Framework for Emotion Recognition in Children with Autism During NAO Robot-Mediated Interaction(https://arxiv.org/abs/2512.12208)
Keywords: generation
Abstract: Understanding emotional responses in children with Autism Spectrum Disorder (ASD) during social interaction remains a critical challenge in both developmental psychology and human-robot interaction. This study presents a novel deep learning pipeline for emotion recognition in autistic children in response to a name-calling event by a humanoid robot (NAO), under controlled experimental settings. The dataset comprises of around 50,000 facial frames extracted from video recordings of 15 children with ASD. A hybrid model combining a fine-tuned ResNet-50-based Convolutional Neural Network (CNN) and a three-layer Graph Convolutional Network (GCN) trained on both visual and geometric features extracted from MediaPipe FaceMesh landmarks. Emotions were probabilistically labeled using a weighted ensemble of two models: DeepFace's and FER, each contributing to soft-label generation across seven emotion classes. Final classification leveraged a fused embedding optimized via Kullback-Leibler divergence. The proposed method demonstrates robust performance in modeling subtle affective responses and offers significant promise for affective profiling of ASD children in clinical and therapeutic human-robot interaction contexts, as the pipeline effectively captures micro emotional cues in neurodivergent children, addressing a major gap in autism-specific HRI research. This work represents the first such large-scale, real-world dataset and pipeline from India on autism-focused emotion analysis using social robotics, contributing an essential foundation for future personalized assistive technologies.
摘要：了解自闭症谱系障碍 (ASD) 儿童在社交互动中的情绪反应仍然是发展心理学和人机交互中的一个关键挑战。这项研究提出了一种新颖的深度学习流程，用于在受控实验环境下对自闭症儿童进行情绪识别，以响应人形机器人 (NAO) 的辱骂事件。该数据集包含从 15 名自闭症谱系障碍儿童的视频中提取的约 50,000 个面部帧。一种混合模型，结合了基于 ResNet-50 的微调卷积神经网络 (CNN) 和三层图卷积网络 (GCN)，并根据从 MediaPipe FaceMesh 地标中提取的视觉和几何特征进行训练。使用 DeepFace 和 FER 两个模型的加权集成对情绪进行概率标记，每个模型都有助于跨七个情绪类别生成软标签。最终分类利用了通过 Kullback-Leibler 散度优化的融合嵌入。所提出的方法在建模微妙的情感反应方面表现出强大的性能，并为临床和治疗性人机交互环境中自闭症儿童的情感分析提供了重要的希望，因为该管道有效地捕获了神经分歧儿童的微情感线索，解决了自闭症特定 HRI 研究中的主要空白。这项工作代表了印度第一个使用社交机器人技术进行以自闭症为中心的情绪分析的大规模真实数据集和管道，为未来的个性化辅助技术奠定了重要基础。

Title: CineLOG: A Training Free Approach for Cinematic Long Video Generation

Authors: Zahra Dehghanian, Morteza Abolghasemi, Hamid Beigy, Hamid R. Rabiee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12209
Pdf URL: https://arxiv.org/pdf/2512.12209
Copy Paste: [[2512.12209]] CineLOG: A Training Free Approach for Cinematic Long Video Generation(https://arxiv.org/abs/2512.12209)
Keywords: generation
Abstract: Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at this https URL.
摘要：可控视频合成是计算机视觉的一个核心挑战，但当前的模型难以实现文本提示之外的细粒度控制，特别是对于摄像机轨迹和类型等电影属性。现有数据集经常遭受严重的数据不平衡、嘈杂的标签或对真实差距的显着模拟。为了解决这个问题，我们引入了 CineLOG，这是一个包含 5,000 个高质量、平衡且未剪辑的视频剪辑的新数据集。每个条目都附有详细的场景描述、基于标准电影分类的明确摄影机说明以及类型标签，确保均衡覆盖 17 种不同的摄影机动作和 15 种电影类型。我们还展示了旨在创建该数据集的新颖管道，它将复杂的文本到视频（T2V）生成任务分解为四个更简单的阶段，并采用更成熟的技术。为了实现连贯的多镜头序列，我们引入了一种新颖的轨迹引导过渡模块，可以生成平滑的时空插值。广泛的人类评估表明，我们的流程在遵守特定摄像机和剧本指令的同时保持专业的视觉质量，显着优于 SOTA 端到端 T2V 模型。所有代码和数据均可在此 https URL 中获取。

Title: ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation

Authors: Minheng Ni, Zhengyuan Yang, Yaowen Zhang, Linjie Li, Chung-Ching Lin, Kevin Lin, Zhendong Wang, Xiaofei Wang, Shujie Liu, Lei Zhang, Wangmeng Zuo, Lijuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12220
Pdf URL: https://arxiv.org/pdf/2512.12220
Copy Paste: [[2512.12220]] ProImage-Bench: Rubric-Based Evaluation for Professional Image Generation(https://arxiv.org/abs/2512.12220)
Keywords: generation
Abstract: We study professional image generation, where a model must synthesize information-dense, scientifically precise illustrations from technical descriptions rather than merely produce visually plausible pictures. To quantify the progress, we introduce ProImage-Bench, a rubric-based benchmark that targets biology schematics, engineering/patent drawings, and general scientific diagrams. For 654 figures collected from real textbooks and technical reports, we construct detailed image instructions and a hierarchy of rubrics that decompose correctness into 6,076 criteria and 44,131 binary checks. Rubrics are derived from surrounding text and reference figures using large multimodal models, and are evaluated by an automated LMM-based judge with a principled penalty scheme that aggregates sub-question outcomes into interpretable criterion scores. We benchmark several representative text-to-image models on ProImage-Bench and find that, despite strong open-domain performance, the best base model reaches only 0.791 rubric accuracy and 0.553 criterion score overall, revealing substantial gaps in fine-grained scientific fidelity. Finally, we show that the same rubrics provide actionable supervision: feeding failed checks back into an editing model for iterative refinement boosts a strong generator from 0.653 to 0.865 in rubric accuracy and from 0.388 to 0.697 in criterion score. ProImage-Bench thus offers both a rigorous diagnostic for professional image generation and a scalable signal for improving specification-faithful scientific illustrations.
摘要：我们研究专业的图像生成，模型必须从技术描述中合成信息密集、科学精确的插图，而不仅仅是生成视觉上合理的图片。为了量化进展，我们引入了 ProImage-Bench，这是一个基于标题的基准，针对生物原理图、工程/专利图和一般科学图表。对于从真实教科书和技术报告中收集的 654 个图形，我们构建了详细的图像说明和评分标准层次结构，将正确性分解为 6,076 个标准和 44,131 个二进制检查。评分标准是使用大型多模态模型从周围的文本和参考数字中得出的，并由基于 LMM 的自动化法官使用原则性的惩罚方案进行评估，该惩罚方案将子问题结果汇总为可解释的标准分数。我们在 ProImage-Bench 上对几个具有代表性的文本到图像模型进行了基准测试，发现尽管开放域性能很强，但最佳基础模型的总体准确度仅为 0.791，标准得分为 0.553，揭示了细粒度科学保真度方面的巨大差距。最后，我们证明相同的评分标准提供了可操作的监督：将失败的检查反馈到编辑模型中进行迭代细化，可以将强大的生成器的评分准确率从 0.653 提高到 0.865，将标准得分从 0.388 提高到 0.697。因此，ProImage-Bench 既为专业图像生成提供了严格的诊断，又为改进符合规范的科学插图提供了可扩展的信号。

Title: Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

Authors: Tianyu Zhang, Dong Liu, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12229
Pdf URL: https://arxiv.org/pdf/2512.12229
Copy Paste: [[2512.12229]] Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder(https://arxiv.org/abs/2512.12229)
Keywords: generative
Abstract: Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.
摘要：超低比特率图像压缩（低于每像素 0.05 位）对于带宽受限和计算受限的编码场景（例如边缘设备）越来越重要。现有框架通常依赖于大型预训练编码器（例如，VAE 或基于分词器的模型）并在其生成潜在空间内执行变换编码。虽然这些方法实现了令人印象深刻的感知保真度，但它们对重型编码器网络的依赖使得它们不适合部署在弱发送设备上。在这项工作中，我们探索了应用浅层编码器进行超低比特率压缩的可行性，并提出了一种新颖的非对称极端图像压缩（AEIC）框架，该框架同时追求编码简单性和解码质量。具体来说，AEIC 采用中等甚至浅层编码器网络，同时利用一步扩散解码器在极端比特率下保持高保真度和高真实感重建。为了进一步提高浅层编码器的效率，我们设计了一种双边特征蒸馏方案，将知识从具有中等编码器的 AEIC 转移到其浅层编码器变体。实验表明，AEIC 不仅在超低比特率下的速率-失真-感知性能上优于现有方法，而且在 1080P 输入图像上提供 35.8 FPS 的卓越编码效率，同时与现有方法相比保持有竞争力的解码速度。

Title: Moment and Highlight Detection via MLLM Frame Segmentation

Authors: I Putu Andika Bagas Jiwanta, Ayu Purwarianti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12246
Pdf URL: https://arxiv.org/pdf/2512.12246
Copy Paste: [[2512.12246]] Moment and Highlight Detection via MLLM Frame Segmentation(https://arxiv.org/abs/2512.12246)
Keywords: generation, generative
Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
摘要：从自然语言查询中检测视频时刻和精彩片段已通过基于转换器的方法统一起来。其他作品使用生成式多模态 LLM (MLLM) 来预测时刻和/或亮点作为文本时间戳，利用其推理能力。基于文本的生成虽然有效，但无法为帧级预测提供直接梯度，因为该模型仅发出语言标记。尽管最近的强化学习（RL）方法试图解决这个问题，但我们提出了一种新颖的方法，将分割目标直接应用于 LLM 的输出标记。 LLM 接收固定数量的帧以及提示，强制其输出一系列连续的“0”和/或“1”字符，每帧一个字符。 “0”/“1”字符受益于法学硕士固有的语言能力，同时也分别充当背景和前景概率。训练采用概率分割损失以及正常的因果 LM 损失。在推理时，集束搜索生成序列和逻辑，分别充当矩和显着性分数。尽管仅采样 25 帧（不到同类方法的一半），但我们的方法在 QVHighlights 上实现了强大的高光检测 (56.74 HIT@1)。此外，我们的高效方法在时刻检索方面的得分高于基线（35.28 MAP）。根据经验，即使因果 LM 损失趋于平稳，分割损失也能提供稳定的互补学习信号。

Title: Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection

Authors: Jiahao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12281
Pdf URL: https://arxiv.org/pdf/2512.12281
Copy Paste: [[2512.12281]] Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection(https://arxiv.org/abs/2512.12281)
Keywords: generation
Abstract: Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.
摘要：设计高性能目标检测架构是一项复杂的任务，传统的手动设计既耗时又费力，而神经架构搜索（NAS）的计算量却令人望而却步。虽然最近使用大型语言模型 (LLM) 的方法显示出了前景，但它们通常在搜索循环中充当迭代优化器，而不是直接根据对数据的整体理解来生成架构。为了解决这一差距，我们提出了 Cognitive-YOLO，这是一种用于 LLM 驱动的架构综合的新颖框架，可直接根据数据集的内在特征生成网络配置。我们的方法由三个阶段组成：首先，分析模块从目标数据集中提取关键元特征（例如，对象尺度分布和场景密度）；其次，法学硕士根据这些特征进行推理，并通过检索增强生成（RAG）检索到最先进的组件进行增强，将架构合成为结构化神经架构描述语言（NADL）；最后，编译器将此描述实例化为可部署模型。对五个不同目标检测数据集的大量实验表明，我们提出的 Cognitive-YOLO 始终能够生成卓越的架构，实现了极具竞争力的性能，并在多个基准测试中与强大的基线模型相比，展示了卓越的每参数性能权衡。至关重要的是，我们的消融研究证明了法学硕士的数据驱动推理是性能的主要驱动因素，这表明对数据“第一原理”的深入理解对于实现卓越的架构比简单地检索 SOTA 组件更为关键。

Title: MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

Authors: Benjamin Beilharz, Thomas S. A. Wallis
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.12307
Pdf URL: https://arxiv.org/pdf/2512.12307
Copy Paste: [[2512.12307]] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding(https://arxiv.org/abs/2512.12307)
Keywords: generative
Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
摘要：尽管深度学习方法在许多视觉基准测试中取得了令人印象深刻的成功，但理解和解释这些模型的表示和决策仍然很困难。尽管视觉模型通常是在 2D 输入上进行训练的，但它们通常被假设为开发底层 3D 场景的隐式表示（例如，显示对部分遮挡的容忍度，或推理相对深度的能力）。在这里，我们介绍 MRD（可微分渲染的同色异体），这是一种使用基于物理的可微分渲染来探测视觉模型对生成 3D 场景属性的隐式理解的方法，通过查找物理上不同但产生相同模型激活的 3D 场景参数（即模型同色异体）。与以前用于评估模型表示的基于像素的方法不同，这些重建结果始终基于物理场景描述。这意味着我们可以在保持材质和光照恒定的情况下探测模型对物体形状的敏感度。作为原理验证，我们评估了多个模型恢复几何形状（形状）和双向反射分布函数（材质）场景参数的能力。结果显示目标场景和优化场景之间的模型激活高度相似，但视觉结果各不相同。定性地讲，这些重建有助于研究模型敏感或不变的物理场景属性。 MRD 有望通过分析物理场景参数如何驱动模型响应的变化来增进我们对计算机和人类视觉的理解。

Title: WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

Authors: Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12309
Pdf URL: https://arxiv.org/pdf/2512.12309
Copy Paste: [[2512.12309]] WeDetect: Fast Open-Vocabulary Object Detection as Retrieval(https://arxiv.org/abs/2512.12309)
Keywords: generation
Abstract: Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
摘要：开放词汇对象检测旨在通过文本提示检测任意类。没有跨模式融合层（非融合）的方法通过将识别视为检索问题来提供更快的推理，即将区域与共享嵌入空间中的文本查询进行匹配。在这项工作中，我们充分探索了这种检索哲学，并通过名为WeDetect的模型系列展示了其在效率和多功能性方面的独特优势：（1）最先进的性能。 WeDetect是一款双塔架构的实时检测器。我们表明，凭借精心策划的数据和充分的训练，非融合 WeDetect 超越了其他融合模型，并建立了强大的开放词汇基础。 (2)历史数据快速回溯。 WeDetect-Uni 是一个基于 WeDetect 的通用提案生成器。我们冻结整个检测器，仅微调对象提示以检索跨类别的通用对象建议。重要的是，提案嵌入是特定于类的，并启用新的应用程序、对象检索，支持历史数据中的检索对象。 (3) 与 LMM 集成以实现引用表达理解 (REC)。我们进一步提出了 WeDetect-Ref，一种基于 LMM 的对象分类器，用于处理复杂的引用表达式，它从 WeDetect-Uni 提取的建议列表中检索目标对象。它丢弃下一个标记预测并在单个前向传递中对对象进行分类。 WeDetect 系列将检测、提案生成、对象检索和 REC 统一在一个一致的检索框架下，在 15 个基准测试中实现了最先进的性能，并具有高推理效率。

Title: Unified Control for Inference-Time Guidance of Denoising Diffusion Models

Authors: Maurya Goyal, Anuj Singh, Hadi Jamali-Rad
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12339
Pdf URL: https://arxiv.org/pdf/2512.12339
Copy Paste: [[2512.12339]] Unified Control for Inference-Time Guidance of Denoising Diffusion Models(https://arxiv.org/abs/2512.12339)
Keywords: generation
Abstract: Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks. The code is available at this https URL
摘要：将扩散模型的输出与下游目标保持一致对于提高特定任务的绩效至关重要。一般来说，用于对齐扩散模型的推理时间免训练方法可以分为两种主要策略：基于采样的方法，它探索多个候选输出并选择那些具有更高奖励信号的输出；以及梯度引导方法，它使用可微的奖励近似来直接引导生成过程。在这项工作中，我们提出了一种通用算法 UniCoDe，它将采样和基于梯度的引导的优点结合到一个统一的框架中。 UniCoDe 在采样过程中集成了局部梯度信号，从而解决了复杂的基于奖励的采样方法中固有的采样效率低下的问题。通过紧密结合这两种范式，UniCoDe 可以实现更高效的采样，同时在奖励对齐和与扩散无条件先验的分歧之间提供更好的权衡。实证结果表明，UniCoDe 在一系列任务中与最先进的基线相比仍然具有竞争力。该代码可在此 https URL 获取

Title: Synthetic Swarm Mosquito Dataset for Acoustic Classification: A Proof of Concept

Authors: Thai-Duy Dinh, Minh-Luan Vo, Cuong Tuan Nguyen, Bich-Hien Vo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12365
Pdf URL: https://arxiv.org/pdf/2512.12365
Copy Paste: [[2512.12365]] Synthetic Swarm Mosquito Dataset for Acoustic Classification: A Proof of Concept(https://arxiv.org/abs/2512.12365)
Keywords: generation
Abstract: Mosquito-borne diseases pose a serious global health threat, causing over 700,000 deaths annually. This work introduces a proof-of-concept Synthetic Swarm Mosquito Dataset for Acoustic Classification, created to simulate realistic multi-species and noisy swarm conditions. Unlike conventional datasets that require labor-intensive recording of individual mosquitoes, the synthetic approach enables scalable data generation while reducing human resource demands. Using log-mel spectrograms, we evaluated lightweight deep learning architectures for the classification of mosquito species. Experiments show that these models can effectively identify six major mosquito vectors and are suitable for deployment on embedded low-power devices. The study demonstrates the potential of synthetic swarm audio datasets to accelerate acoustic mosquito research and enable scalable real-time surveillance solutions.
摘要：蚊媒疾病对全球健康构成严重威胁，每年导致超过 70 万人死亡。这项工作介绍了用于声学分类的概念验证合成群体蚊子数据集，旨在模拟现实的多物种和嘈杂的群体条件。与需要对个体蚊子进行劳动密集型记录的传统数据集不同，这种合成方法可以实现可扩展的数据生成，同时减少人力资源需求。使用 log-mel 谱图，我们评估了用于蚊子物种分类的轻量级深度学习架构。实验表明，这些模型可以有效识别六种主要蚊媒，适合部署在嵌入式低功耗设备上。该研究证明了合成群体音频数据集在加速声学蚊子研究并实现可扩展的实时监控解决方案方面的潜力。

Title: STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Authors: Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12372
Pdf URL: https://arxiv.org/pdf/2512.12372
Copy Paste: [[2512.12372]] STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative(https://arxiv.org/abs/2512.12372)
Keywords: generation, generative
Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.
摘要：尽管生成模型的最新进展在视频合成中实现了显着的视觉保真度，但创建连贯的多镜头叙事仍然是一个重大挑战。为了解决这个问题，基于关键帧的方法已成为计算密集型端到端方法的有前途的替代方案，具有细粒度控制和更高效率的优势。然而，这些方法通常无法保持跨镜头的一致性和捕捉电影语言。在本文中，我们介绍了 STAGE，一种 SToryboard 锚定生成工作流程，用于重新制定基于关键帧的多镜头视频生成任务。我们提出 STEP2 来预测由每个镜头的起始帧对组成的结构故事板，而不是使用稀疏关键帧。我们引入了多镜头内存包以确保远程实体一致性、用于镜头内一致性的双编码策略以及用于学习电影镜头间过渡的两阶段训练方案。我们还贡献了大规模的 ConStoryBoard 数据集，包括高质量的影片剪辑，以及故事进展、电影属性和人类偏好的细粒度注释。大量实验表明，STAGE 在结构化叙事控制和跨镜头连贯性方面取得了卓越的表现。

Title: V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Authors: Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim, Sangwon Jung, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12375
Pdf URL: https://arxiv.org/pdf/2512.12375
Copy Paste: [[2512.12375]] V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping(https://arxiv.org/abs/2512.12375)
Keywords: generation
Abstract: Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.
摘要：视频个性化旨在生成忠实反映用户提供的主题的视频，同时遵循文本提示。然而，现有的方法通常依赖于大量基于视频的微调或大规模视频数据集，这会带来大量的计算成本并且难以扩展。此外，他们仍然努力保持跨框架的细粒度外观一致性。为了解决这些限制，我们引入了 V-Warper，这是一种用于基于 Transformer 的视频扩散模型的免训练从粗到细的个性化框架。该框架增强了细粒度的身份保真度，无需任何额外的视频培训。 (1) 轻量级的粗糙外观适应阶段仅利用任务已经需要的一小组参考图像。此步骤通过纯图像 LoRA 和主题嵌入自适应对全局主题身份进行编码。 (2) 推理时精细外观注入阶段通过计算无 RoPE 中间层查询的语义对应关系（关键特征）来细化视觉保真度。这些对应关系引导外观丰富的值表示变形到生成过程的语义对齐区域，并通过掩蔽确保空间可靠性。 V-Warper 显着提高了外观保真度，同时保留了即时对齐和运动动态，并且无需大规模视频微调即可有效实现这些增益。

Title: Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

Authors: Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12387
Pdf URL: https://arxiv.org/pdf/2512.12387
Copy Paste: [[2512.12387]] Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment(https://arxiv.org/abs/2512.12387)
Keywords: generation, generative
Abstract: Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: this https URL.
摘要：事实证明，组相对策略优化 (GRPO) 在增强大型语言模型 (LLM) 的对齐能力方面非常有效。然而，当前 GRPO 对基于流匹配的图像生成的适应忽略了其核心原理与视觉合成过程的独特动态之间的根本冲突。这种不匹配导致两个关键限制：（i）在所有时间步上统一应用稀疏终端奖励会损害时间信用分配，忽略从早期结构形成到后期调整的生成阶段的不同关键性。 (ii) 完全依赖相对的组内奖励会导致优化信号随着训练收敛而减弱，从而在奖励多样性完全耗尽时导致优化停滞。为了解决这些限制，我们提出了价值锚定组策略优化（VGPO），这是一个跨时间和组维度重新定义价值估计的框架。具体来说，VGPO 将稀疏的终端奖励转化为密集的、过程感知的价值估计，通过对每个生成阶段的预期累积奖励进行建模来实现精确的信用分配。此外，VGPO 用绝对值增强的新颖过程取代了标准组标准化，即使奖励多样性下降也能保持稳定的优化信号。对三个基准的大量实验表明，VGPO 实现了最先进的图像质量，同时提高了特定任务的准确性，有效地缓解了奖励黑客行为。项目网页：此 https URL。

Title: ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States

Authors: Haowen Wang, Xiaoping Yuan, Fugang Zhang, Rui Jian, Yuanwei Zhu, Xiuquan Qiao, Yakun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12395
Pdf URL: https://arxiv.org/pdf/2512.12395
Copy Paste: [[2512.12395]] ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States(https://arxiv.org/abs/2512.12395)
Keywords: generative
Abstract: Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.
摘要：生成铰接式资产对于机器人、数字孪生和实体智能至关重要。现有的生成模型通常依赖于表示闭合状态的单视图输入，由于几何形状和关节动力学之间的纠缠，导致模糊或不切实际的运动结构。为了应对这些挑战，我们引入了 ArtGen，这是一种基于条件扩散的框架，能够从任意零件级状态的单视图图像或文本描述生成具有精确几何形状和连贯运动学的铰接式 3D 对象。具体来说，ArtGen 采用跨状态蒙特卡洛采样来明确强制全局运动学一致性，减少结构运动纠缠。此外，我们集成了思想链推理模块来推断强大的结构先验，例如零件语义、关节类型和连接性，指导稀疏专家扩散变换器专门研究不同的运动学交互。此外，通过局部全局注意力增强的组合 3D-VAE 潜在先验可以有效捕获细粒度几何和全局零件级关系。对 PartNet-Mobility 基准的大量实验表明，ArtGen 的性能显着优于最先进的方法。

Title: BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

Authors: Hangwei Zhang, Armando Teles Fortes, Tianyi Wei, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12425
Pdf URL: https://arxiv.org/pdf/2512.12425
Copy Paste: [[2512.12425]] BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation(https://arxiv.org/abs/2512.12425)
Keywords: generation
Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.
摘要：散景和单目深度估计通过相同的镜头成像几何结构紧密耦合，但当前的方法以不完整的方式利用这种联系。高质量的散景渲染管道通常依赖于嘈杂的深度图，这会将估计误差放大为可见的伪影，而现代单目度量深度模型仍然在弱纹理、遥远和几何模糊的区域中苦苦挣扎，而在这些区域中，散焦线索的信息最为丰富。我们引入了 BokehDepth，这是一个两阶段框架，它将散景合成与深度预测解耦，并将散焦视为辅助的无监督几何线索。在 Stage-1 中，物理引导的可控散景生成器建立在强大的预训练图像编辑主干基础上，可通过单个锐输入生成具有校准散景强度的无深度散景堆栈。在第 2 阶段，轻量级散焦感知聚合模块插入现有的单目深度编码器，沿散焦维度融合特征，并暴露稳定的深度敏感变化，同时保持下游解码器不变。在具有挑战性的基准测试中，BokehDepth 提高了基于深度图的散景基线的视觉保真度，并持续提高了强大的单目深度基础模型的度量准确性和鲁棒性。

Title: Endless World: Real-Time 3D-Aware Long Video Generation

Authors: Ke Zhang, Yiqun Mei, Jiacong Xu, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12430
Pdf URL: https://arxiv.org/pdf/2512.12430
Copy Paste: [[2512.12430]] Endless World: Real-Time 3D-Aware Long Video Generation(https://arxiv.org/abs/2512.12430)
Keywords: generation
Abstract: Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video this http URL support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training this http URL, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene this http URL experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our project has been available on this https URL.
摘要：生成具有稳定 3D 结构的长而连贯的视频序列仍然是一项重大挑战，特别是在流媒体场景中。受此启发，我们引入了 Endless World，这是一个无限、3D 一致视频的实时框架，该 http URL 支持无限视频生成，我们引入了条件自回归训练策略，将新生成的内容与现有视频帧对齐。这种设计保留了远程依赖性，同时保持计算效率，无需额外训练此 http URL 即可在单个 GPU 上进行实时推理，我们的 Endless World 集成了全局 3D 感知注意力，以提供跨时间的连续几何指导。我们的 3D 注入机制在整个扩展序列中强制执行物理合理性和几何一致性，解决了长视界和动态场景中的关键挑战。该 http URL 实验表明，Endless World 可以生成长、稳定且视觉连贯的视频，在视觉保真度和空间一致性方面实现与现有方法相比具有竞争力或优越的性能。我们的项目已在此 https URL 上提供。

Title: Exploring the Design Space of Transition Matching

Authors: Uriel Singer, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12465
Pdf URL: https://arxiv.org/pdf/2512.12465
Copy Paste: [[2512.12465]] Exploring the Design Space of Transition Matching(https://arxiv.org/abs/2512.12465)
Keywords: generation, generative
Abstract: Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.
摘要：转移匹配 (TM) 是一种新兴的生成建模范例，它概括了扩散和流匹配模型以及连续状态自回归模型。 TM 与之前的范例类似，逐渐将噪声样本转换为数据样本，但它使用第二个“内部”生成模型来实现转换步骤，使转换比扩散模型和流动模型更具表现力。为了使这个范式易于处理，TM 采用了一个大型骨干网络和一个较小的“头”模块来有效地执行生成转换步骤。在这项工作中，我们对 TM 框架中头部的设计、训练和采样进行了大规模、系统的研究，重点关注其时间连续的双向变体。通过涉及训练 56 个不同的 1.7B 文本到图像模型（产生 549 个独特的评估）的全面消融和实验，我们评估了训练期间头部模块架构和建模以及有用的随机 TM 采样器系列的影响。我们分析了对生成质量、训练和推理效率的影响。我们发现，具有 MLP 头、经过特定时间加权训练并使用高频采样器采样的 TM 在所有测试基线中达到最先进的所有指标中提供了最佳排名，而具有序列缩放和低频采样的 Transformer 头在图像美学方面表现出色，位居亚军。最后，我们认为所提出的实验强调了可能提供最大质量和效率增益的设计方面，同时表明哪些设计选择不太可能提供进一步的增益。

Title: Generative Spatiotemporal Data Augmentation

Authors: Jinfan Zhou, Lixin Luo, Sungmin Eum, Heesung Kwon, Jeong Joon Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12508
Pdf URL: https://arxiv.org/pdf/2512.12508
Copy Paste: [[2512.12508]] Generative Spatiotemporal Data Augmentation(https://arxiv.org/abs/2512.12508)
Keywords: generative
Abstract: We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.
摘要：我们使用视频基础模型探索时空数据增强，以使摄像机视角和场景动态多样化。与基于简单几何变换或外观扰动的现有方法不同，我们的方法利用现成的视频扩散模型从给定的图像数据集生成真实的 3D 空间和时间变化。将这些合成视频剪辑合并为补充训练数据可以在低数据设置中产生一致的性能增益，例如注释稀缺的无人机捕获的图像。除了经验改进之外，我们还提供了实用指南：（i）选择适当的时空生成设置，（ii）将注释转移到合成框架，以及（iii）解决遮挡问题 - 在生成的视图中新显示和未标记的区域。对 COCO 子集和无人机捕获数据集的实验表明，如果明智地应用，时空增强会沿着传统和先前生成方法未充分代表的轴拓宽数据分布，为提高数据稀缺状态下的模型性能提供有效的杠杆。

Title: Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses

Authors: David Strnadel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12581
Pdf URL: https://arxiv.org/pdf/2512.12581
Copy Paste: [[2512.12581]] Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses(https://arxiv.org/abs/2512.12581)
Keywords: generative
Abstract: This paper presents an exploratory, simulator-based proof of concept investigating whether differentiable energy terms derived from parameterized quantum circuits can serve as auxiliary regularization signals in Generative Adversarial Networks (GANs). We augment the Auxiliary Classifier GAN (ACGAN) generator objective with a Variational Quantum Eigensolver (VQE)-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit's EstimatorQNN and TorchConnector. Important limitations: All experiments run on a noiseless statevector simulator with only 4 qubits, use a deliberately simple Hamiltonian parameterization, and lack ablation studies comparing against equivalent classical biases. The computational overhead (approximately 200x slower than classical ACGAN) reflects simulator artifacts rather than inherent quantum costs. On MNIST, we observe that the energy-regularized model (termed QACGAN) achieves high classification accuracy (99 to 100 percent) within 5 epochs compared to 87.8 percent for ACGAN, suggesting the auxiliary term influences class conditioning. However, sample quality metrics (FID) show high variance across runs (coefficient of variation approximately 25 percent at epoch 5), with values ranging from 19.92 to 35.96. Extended runs stabilize around FID 23 to 24, comparable to the ACGAN baseline. We explicitly do not claim quantum advantage, improved stability in any general sense, or scalability beyond this toy setting. The contribution is methodological: demonstrating that VQE-style energy computations can be integrated into GAN training loops via differentiable pathways. Whether such auxiliary signals provide benefits beyond equivalent classical regularizers remains an open question requiring systematic ablation studies, which we leave for future work.
摘要：本文提出了一种基于模拟器的探索性概念验证，研究从参数化量子电路导出的可微分能量项是否可以充当生成对抗网络（GAN）中的辅助正则化信号。我们使用 Qiskit 的 EstimatorQNN 和 TorchConnector 根据类特定伊辛哈密顿量计算出受变分量子本征解算器 (VQE) 启发的能量项，从而增强了辅助分类器 GAN (ACGAN) 生成器目标。重要限制：所有实验都在只有 4 个量子位的无噪声状态向量模拟器上运行，使用故意简单的哈密顿参数化，并且缺乏与等效经典偏差进行比较的消融研究。计算开销（比经典 ACGAN 慢大约 200 倍）反映的是模拟器伪影，而不是固有的量子成本。在 MNIST 上，我们观察到能量正则化模型（称为 QACGAN）在 5 个 epoch 内实现了较高的分类准确度（99% 到 100%），而 ACGAN 的分类准确度为 87.8%，这表明辅助项影响了类别调节。然而，样本质量指标 (FID) 在运行中显示出较高的方差（第 5 轮的变异系数约为 25%），值范围为 19.92 至 35.96。延长运行稳定在 FID 23 至 24 左右，与 ACGAN 基线相当。我们明确不声称量子优势、任何一般意义上的稳定性提高或超出此玩具设置的可扩展性。该贡献是方法论上的：证明 VQE 式能量计算可以通过可微分路径集成到 GAN 训练循环中。这些辅助信号是否能提供超越等效经典正则化器的好处仍然是一个悬而未决的问题，需要系统的消融研究，我们将其留给未来的工作。

Title: Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Authors: Karthikeya KV
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12595
Pdf URL: https://arxiv.org/pdf/2512.12595
Copy Paste: [[2512.12595]] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation(https://arxiv.org/abs/2512.12595)
Keywords: generation, generative
Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.
摘要：这项研究引入了一个变革性框架，用于将视觉增强型大语言模型 (LLM) 与基于变压器的先进架构相集成，以应对高分辨率图像合成和多模态数据解释方面的挑战。所提出的模型采用了整流流机制，通过线性路径连接噪声和数据，从而实现高效、高质量的生成。采用双向标记化策略来无缝合并来自文本、图像和视频模式的输入，从而促进对不同数据类型的统一理解。通过嵌入时空特征并利用混合文本图像序列建模方法，该框架在合成图像和连贯的多模态表示中实现了无与伦比的保真度。该架构通过噪声感知学习算法进行了优化，解决了噪声数据分布中的差异，并提高了不同输入条件下的生成性能。对基准数据集的严格评估表明，与基于扩散的方法相比，图像分辨率清晰度提高了 25%，计算要求降低了 20%。此外，该模型表现出强大的可扩展性和适应性，展示了其在自主系统、创意内容生成和高级视频分析等应用中的潜力。这项工作强调了以视觉为中心的法学硕士在重新定义计算机视觉和多模式人工智能能力方面的作用。

Title: Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models

Authors: Kei Yoshitake, Kento Hosono, Ken Kobayashi, Kazuhide Nakata
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12596
Pdf URL: https://arxiv.org/pdf/2512.12596
Copy Paste: [[2512.12596]] Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models(https://arxiv.org/abs/2512.12596)
Keywords: generation
Abstract: In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image's detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based "placement plan" based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We validated the effectiveness of our approach through evaluation experiments, conducting both quantitative and qualitative comparisons against existing methods. The results demonstrate that by explicitly considering the background image's content, our method produces noticeably higher-quality advertisement layouts.
摘要：在本文中，我们提出了一种利用视觉语言模型（VLM）生成基于图像的广告布局的方法。传统的广告布局技术主要依靠显着性映射来检测背景图像内的显着区域，但此类方法通常无法完全考虑图像的详细组成和语义内容。为了克服这一限制，我们的方法利用 VLM 来识别背景中描绘的产品和其他元素，并告知文本和徽标的位置。所提出的布局生成流程由两个步骤组成。第一步，VLM 分析图像以识别对象类型及其空间关系，然后根据此分析生成基于文本的“放置计划”。在第二步中，通过生成 HTML 格式的代码将该计划呈现为最终布局。我们通过评估实验验证了我们方法的有效性，与现有方法进行定量和定性比较。结果表明，通过明确考虑背景图像的内容，我们的方法可以产生明显更高质量的广告布局。

Title: Geometry-Aware Scene-Consistent Image Generation

Authors: Cong Xie, Che Wang, Yan Zhang, Zheng Pan, Han Zou, Zhenpeng Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12598
Pdf URL: https://arxiv.org/pdf/2512.12598
Copy Paste: [[2512.12598]] Geometry-Aware Scene-Consistent Image Generation(https://arxiv.org/abs/2512.12598)
Keywords: generation
Abstract: We study geometry-aware scene-consistent image generation: given a reference scene image and a text condition specifying an entity to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same physical environment as the reference scene while correctly generating the entity according to the spatial relation described in the text. Existing methods struggle to balance scene preservation with prompt adherence: they either replicate the scene with high fidelity but poor responsiveness to the prompt, or prioritize prompt compliance at the expense of scene consistency. To resolve this trade-off, we introduce two key contributions: (i) a scene-consistent data construction pipeline that generates diverse, geometrically-grounded training pairs, and (ii) a novel geometry-guided attention loss that leverages cross-view cues to regularize the model's spatial reasoning. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image consistency than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method produces geometrically coherent images with diverse compositions that remain faithful to the textual instructions and the underlying scene structure.
摘要：我们研究几何感知场景一致图像生成：给定参考场景图像和指定要在场景中生成的实体及其与场景的空间关系的文本条件，目标是合成一个输出图像，该输出图像保留与参考场景相同的物理环境，同时根据文本中描述的空间关系正确生成实体。现有的方法很难在场景保留与及时遵守之间取得平衡：它们要么以高保真度复制场景，但对提示的响应性较差，要么以牺牲场景一致性为代价优先考虑及时遵守。为了解决这种权衡问题，我们引入了两个关键贡献：（i）场景一致的数据构建管道，可生成多样化的、基于几何的训练对；（ii）新颖的几何引导注意力损失，利用跨视图线索来规范模型的空间推理。根据自动指标和人类偏好研究，对我们的场景一致性基准进行的实验表明，我们的方法比最先进的基线实现了更好的场景对齐和文本图像一致性。我们的方法产生具有不同构图的几何连贯图像，这些图像仍然忠实于文本指令和底层场景结构。

Title: No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching

Authors: Tingyan Wen, Haoyu Li, Yihuang Chen, Xing Zhou, Lifei Zhu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12604
Pdf URL: https://arxiv.org/pdf/2512.12604
Copy Paste: [[2512.12604]] No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching(https://arxiv.org/abs/2512.12604)
Keywords: generative
Abstract: Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.
摘要：扩散模型实现了卓越的生成质量，但计算开销随步数、模型深度和序列长度而变化。特征缓存是有效的，因为相邻的时间步产生高度相似的特征。然而，一个固有的权衡仍然存在：积极的时间步重用提供了很大的加速，但很容易跨越临界线，损害保真度，而块或令牌级重用更安全，但产生的计算节省有限。我们提出了 X-Slim (eXtreme-Slimming Caching)，这是一种免训练、基于缓存的加速器，据我们所知，它是第一个跨时间步长、结构（块）和空间（令牌）利用可缓存冗余的统一框架。 X-Slim 不是简单地混合级别，而是引入了双阈值控制器，将缓存转变为先推后抛光的过程：它首先将时间步级别的重用推至预警线，然后切换到轻量级块和令牌级别刷新以抛光剩余的冗余，并在跨越关键线时触发完整推理以重置累积错误。在每个级别，上下文感知指示器决定缓存的时间和位置。在不同的任务中，X-Slim 推进了速度质量前沿。在 FLUX.1-dev 和 HunyuanVideo 上，它最多可将延迟降低 4.97 倍和 3.52 倍，同时将感知损失降至最低。在 DiT-XL/2 上，与之前的方法相比，它的加速速度提高了 3.13 倍，FID 提高了 2.42 倍。

Title: Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Authors: Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.12623
Pdf URL: https://arxiv.org/pdf/2512.12623
Copy Paste: [[2512.12623]] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space(https://arxiv.org/abs/2512.12623)
Keywords: generation
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
摘要：多模态大型语言模型 (MLLM) 的最新进展通过在语义空间中结合思想链 (CoT) 推理，显着增强了跨模态理解和推理。在此基础上，最近的研究将 CoT 机制扩展到视觉模态，使模型能够在推理过程中通过外部工具或显式图像生成来整合视觉信息。然而，这些方法仍然依赖于明确的逐步推理、不稳定的感知推理交互和显着的计算开销。受人类认知的启发，我们认为思维不是线性展开的，而是通过思维内推理和感知的动态交错展开的。受此观点的启发，我们提出了 DMLR，这是一种测试时动态多模态潜在推理框架，它采用置信引导的潜在策略梯度优化来细化潜在思考标记以进行深度推理。此外，还引入了动态视觉注入策略，该策略检索每个潜在思考标记最相关的视觉特征并更新最佳视觉补丁集。然后将更新的补丁注入到潜在的思考令牌中，以实现动态的视觉文本交错。七个多模态推理基准和各种模型架构的实验表明，DMLR 显着提高了推理和感知性能，同时保持了高推理效率。

Title: DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Authors: Zhou Tao, Shida Wang, Yongxiang Hua, Haoyu Cao, Linli Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12633
Pdf URL: https://arxiv.org/pdf/2512.12633
Copy Paste: [[2512.12633]] DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model(https://arxiv.org/abs/2512.12633)
Keywords: generation
Abstract: Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.
摘要：多模态大语言模型在各种视觉语言任务上取得了令人印象深刻的性能，但其细粒度的视觉感知和精确的空间推理仍然有限。在这项工作中，我们引入了 DiG（差分接地），这是一种新颖的代理任务框架，MLLM 通过识别和定位相似图像对之间的所有差异来学习细粒度感知，而无需事先了解其数量。为了支持可扩展的训练，我们开发了一个基于 3D 渲染的自动化数据生成管道，可生成差异完全可控的高质量配对图像。为了解决差异信号的稀疏性，我们进一步采用课程学习，逐步增加复杂性，从单一差异到多重差异，从而实现稳定的优化。大量实验表明，DiG 显着提高了各种视觉感知基准的模型性能，并且学习到的细粒度感知技能可以有效地转移到标准下游任务，包括 RefCOCO、RefCOCO+、RefCOCOg 和通用多模态感知基准。我们的结果强调差分接地是一种可扩展且稳健的方法，可用于推进 MLLM 中的细粒度视觉推理。

Title: InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

Authors: Sreehari Rajan, Kunal Bhosikar, Charu Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12664
Pdf URL: https://arxiv.org/pdf/2512.12664
Copy Paste: [[2512.12664]] InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation(https://arxiv.org/abs/2512.12664)
Keywords: generation
Abstract: Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.
摘要：生成对口语和物理对象自然响应的真实人体动作对于交互式数字体验至关重要。然而，当前的方法独立地处理语音驱动的手势或对象交互，由于缺乏集成的、全面的数据集，限制了现实世界的适用性。为了克服这个问题，我们引入了 InteracTalker，这是一个新颖的框架，它将基于提示的对象感知交互与协同语音手势生成无缝集成。我们通过采用多阶段训练过程来学习统一的动作、语音和提示嵌入空间来实现这一目标。为了支持这一点，我们策划了一个丰富的人与物体交互数据集，该数据集是通过使用详细的物体交互注释增强现有的文本到运动数据集而形成的。我们的框架利用通用运动适应模块，可以进行独立训练，适应相应的运动条件，然后在推理过程中动态组合。为了解决异构条件信号之间的不平衡问题，我们提出了一种自适应融合策略，该策略在扩散采样期间动态地重新加权条件信号。 InteracTalker 成功地统一了这些以前独立的任务，在协同语音手势生成和对象交互合成方面都优于先前的方法，优于以手势为中心的扩散方法，产生高度逼真、对象感知的全身运动，并具有增强的真实性、灵活性和控制力。

Title: DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization

Authors: Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Guoqing Ma, Yidan Liang, Jingjiang Liu, Hao Chen, Shimin Di
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12669
Pdf URL: https://arxiv.org/pdf/2512.12669
Copy Paste: [[2512.12669]] DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization(https://arxiv.org/abs/2512.12669)
Keywords: generative
Abstract: Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.
摘要：时态知识图推理（TKGR）旨在沿着时间线补全缺失的事实元素。根据查询的时间位置，任务分为插值和外推。现有的插值方法通常将时间信息嵌入到单个事实中以完成缺失的历史知识，而外推技术通常利用图形快照上的序列模型来识别未来事件预测的重复模式。这些方法面临两个关键挑战：插值中有限的上下文建模和外推中的认知泛化偏差。为了解决这些问题，我们提出了一种统一的 TKGR 方法，称为 DynaGen。对于插值，DynaGen 动态构建以实体为中心的子图，并使用协同双分支 GNN 编码器对其进行处理，以捕获不断变化的结构上下文。为了进行推断，它应用了条件扩散过程，这迫使模型学习潜在的进化原理而不仅仅是表面模式，从而增强其预测看不见的未来事件的能力。对六个基准数据集的广泛实验表明 DynaGen 实现了最先进的性能。平均而言，与第二好的模型相比，DynaGen 的插值平均倒数排名 (MRR) 得分提高了 2.61 分，外推提高了 1.45 分。

Title: On Approaches to Building Surrogate ODE Models for Diffusion Bridges

Authors: Maria Khilchuk, Vladimir Latypov, Pavel Kleshchev, Alexander Hvatov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12671
Pdf URL: https://arxiv.org/pdf/2512.12671
Copy Paste: [[2512.12671]] On Approaches to Building Surrogate ODE Models for Diffusion Bridges(https://arxiv.org/abs/2512.12671)
Keywords: generative
Abstract: Diffusion and Schrödinger Bridge models have established state-of-the-art performance in generative modeling but are often hampered by significant computational costs and complex training procedures. While continuous-time bridges promise faster sampling, overparameterized neural networks describe their optimal dynamics, and the underlying stochastic differential equations can be difficult to integrate efficiently. This work introduces a novel paradigm that uses surrogate models to create simpler, faster, and more flexible approximations of these dynamics. We propose two specific algorithms: SINDy Flow Matching (SINDy-FM), which leverages sparse regression to identify interpretable, symbolic differential equations from data, and a Neural-ODE reformulation of the Schrödinger Bridge (DSBM-NeuralODE) for flexible continuous-time parameterization. Our experiments on Gaussian transport tasks and MNIST latent translation demonstrate that these surrogates achieve competitive performance while offering dramatic improvements in efficiency and interpretability. The symbolic SINDy-FM models, in particular, reduce parameter counts by several orders of magnitude and enable near-instantaneous inference, paving the way for a new class of tractable and high-performing bridge models for practical deployment.
摘要：扩散和薛定谔桥模型已经在生成建模中建立了最先进的性能，但经常受到巨大的计算成本和复杂的训练程序的阻碍。虽然连续时间桥承诺更快的采样，但过度参数化的神经网络描述了它们的最佳动态，并且底层的随机微分方程可能难以有效积分。这项工作引入了一种新颖的范例，它使用代理模型来创建这些动态的更简单、更快、更灵活的近似。我们提出了两种具体算法：SINDy 流匹配 (SINDy-FM)，它利用稀疏回归从数据中识别可解释的符号微分方程；以及薛定谔桥的神经常微分方程重构 (DSBM-NeuralODE)，用于灵活的连续时间参数化。我们在高斯传输任务和 MNIST 潜在翻译上的实验表明，这些代理实现了有竞争力的性能，同时在效率和可解释性方面提供了显着的改进。特别是，符号化的 SINDy-FM 模型将参数数量减少了几个数量级，并实现了近乎瞬时的推理，为新型易于处理的高性能桥梁模型的实际部署铺平了道路。

Title: Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Authors: Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12675
Pdf URL: https://arxiv.org/pdf/2512.12675
Copy Paste: [[2512.12675]] Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling(https://arxiv.org/abs/2512.12675)
Keywords: generation
Abstract: Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: this https URL.
摘要：主题驱动的图像生成已经从单主题合成发展到多主题合成，同时忽略了区别，即当输入包含多个候选者时识别和生成正确主题的能力。这种限制限制了复杂、现实的视觉设置中的有效性。我们提出了 Scone，一种集成了组合和区分的统一理解生成方法。 Scone 使理解专家能够充当语义桥梁，传达语义信息并指导生成专家保留主题身份，同时最大限度地减少干扰。两阶段训练方案首先学习组合，然后通过语义对齐和基于注意力的掩蔽来增强区别。我们还引入了 SconeEval，这是一个评估不同场景的组成和区别的基准。实验表明，Scone 在两个基准测试中的组合和区分任务上优于现有的开源模型。我们的模型、基准测试和训练数据可在以下网址获取：此 https URL。

Title: Robust Motion Generation using Part-level Reliable Data from Videos

Authors: Boyuan Li, Sipeng Zheng, Bin Cao, Ruihua Song, Zongqing Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12703
Pdf URL: https://arxiv.org/pdf/2512.12703
Copy Paste: [[2512.12703]] Robust Motion Generation using Part-level Reliable Data from Videos(https://arxiv.org/abs/2512.12703)
Keywords: generation
Abstract: Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance. To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as "credible". Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts. In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity. Project page: this https URL
摘要：从大规模网络视频中提取人体动作为角色动画中的数据稀缺问题提供了可扩展的解决方案。然而，由于离屏捕捉或遮挡，许多视频帧中的某些人体部分无法被看到。它带来了一个困境：丢弃丢失任何部分的数据会限制规模和多样性，而保留它会损害数据质量和模型性能。为了解决这个问题，我们建议利用从视频中提取的可靠的零件级数据，通过强大的零件感知屏蔽自回归模型来增强运动生成。首先，我们将人体分解为五个部分，并将视频帧中清晰可见的部分检测为“可信”。其次，可信部分由我们提出的部分感知变分自动编码器编码为潜在标记。第三，我们提出了一种鲁棒的部分级屏蔽生成模型来预测屏蔽的可信部分，同时忽略那些噪声部分。此外，我们还贡献了 K700-M，这是一个具有挑战性的新基准，包含大约 20 万个真实世界的运动序列，用于评估。实验结果表明，我们的方法在运动质量、语义一致性和多样性方面成功优于干净数据集和噪声数据集上的基线。项目页面：此 https URL

Title: Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images

Authors: Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Changgyun Kim, Taemin Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12718
Pdf URL: https://arxiv.org/pdf/2512.12718
Copy Paste: [[2512.12718]] Spinal Line Detection for Posture Evaluation through Train-ing-free 3D Human Body Reconstruction with 2D Depth Images(https://arxiv.org/abs/2512.12718)
Keywords: restoration
Abstract: The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.
摘要：脊柱角度是身体平衡的重要指标。恢复人体的3D形状并估计脊柱中心线非常重要。现有的基于多图像的身体复原方法需要昂贵的设备和复杂的程序，并且基于单图像的身体复原方法存在局限性，因为由于遮挡和视点限制而难以准确估计脊柱中心线等内部结构。本研究提出了一种方法来弥补基于多图像的方法的缺点并解决单图像方法的局限性。我们提出了一种 3D 身体姿势分析系统，该系统集成四个方向的深度图像来恢复 3D 人体模型并自动估计脊柱中心线。通过全局配准和精细配准的分层匹配，对噪声和遮挡进行恢复。此外，应用自适应顶点缩减来保持网格的分辨率和形状可靠性，并通过使用细节层次集成同时保证脊柱角度估计的准确性和稳定性。该方法在不依赖训练数据或复杂的神经网络模型的情况下实现了高精度3D脊柱配准估计，并且验证证实了匹配质量的提高。

Title: GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Authors: Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12751
Pdf URL: https://arxiv.org/pdf/2512.12751
Copy Paste: [[2512.12751]] GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation(https://arxiv.org/abs/2512.12751)
Keywords: generation
Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
摘要：物理感知驾驶世界模型对于驾驶规划、分布外数据合成和闭环评估至关重要。然而，现有的方法通常依赖于单一的扩散模型来直接将驾驶动作映射到视频，这使得学习变得困难并导致物理上不一致的输出。为了克服这些挑战，我们提出了 GenieDrive，这是一种专为物理感知驾驶视频生成而设计的新颖框架。我们的方法首先生成 4D 占用，它作为后续视频生成的物理基础。 4D 占用包含丰富的物理信息，包括高分辨率的 3D 结构和动力学。为了促进这种高分辨率占用率的有效压缩，我们提出了一种 VAE，将占用率编码为潜在的三平面表示，将潜在大小减少到仅先前方法中使用的 58%。我们进一步引入相互控制注意（MCA）来准确建模控制对占用演化的影响，并以端到端的方式联合训练 VAE 和后续预测模块，以最大限度地提高预测精度。总之，这些设计以 41 FPS 的推理速度将预测 mIoU 提高了 7.2%，同时仅使用 347 M 个参数。此外，视频生成模型中引入了归一化多视图注意力机制，在 4D 占用的指导下生成多视图驾驶视频，显着提高了视频质量，FVD 降低了 20.7%。实验表明，GenieDrive 可实现高度可控、多视图一致且物理感知的驾驶视频生成。

Title: FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

Authors: Yue Jiang, Dingkang Yang, Minghao Han, Jinghang Han, Zizhi Chen, Yizhou Liu, Mingcheng Li, Peng Zhai, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12756
Pdf URL: https://arxiv.org/pdf/2512.12756
Copy Paste: [[2512.12756]] FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning(https://arxiv.org/abs/2512.12756)
Keywords: generation
Abstract: Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.
摘要：尽管多模态大语言模型（MLLM）和新兴的全模态架构取得了快速进展，但当前的基准在范围和集成方面仍然有限，存在模态覆盖不完整、以文本为中心的输出交互受限以及模态之间的相互依赖和互补性较弱等问题。为了弥补这些差距，我们推出了 FysicsWorld，这是第一个统一的全模态基准测试，支持跨图像、视频、音频和文本的双向输入输出，从而实现跨理解、生成和推理的全面的任意对任意评估。 FysicsWorld 包含 16 项主要任务和 3,268 个精选样本，这些样本来自 40 多个高质量来源，涵盖丰富的开放领域类别和不同的问题类型。我们还提出了将跨模态互补筛选（CMCS）策略集成到系统数据构建框架中，该框架为口语交互和依赖于融合的跨模态推理生成全模态数据。通过对 30 多个最先进的基线（涵盖 MLLM、特定模态模型、统一理解生成模型和全模态语言模型）的综合评估，FysicsWorld 揭示了不同模型在理解、生成和推理方面的性能差异和局限性。我们的基准为评估和推进下一代全模态架构建立了统一的基础和强大的基线。

Title: CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

Authors: Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12768
Pdf URL: https://arxiv.org/pdf/2512.12768
Copy Paste: [[2512.12768]] CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence(https://arxiv.org/abs/2512.12768)
Keywords: generation
Abstract: Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
摘要：大型多模态模型的最新进展表明，显式推理机制在提高模型可靠性、可解释性和跨模态对齐方面发挥着关键作用。虽然这种以推理为中心的方法已被证明在语言和视觉任务中有效，但它们向 3D 的扩展仍然不发达。 CoRe3D 引入了统一的 3D 理解和生成推理框架，该框架联合操作语义和空间抽象，使从语言推断出的高级意图能够直接指导低级 3D 内容形成。该设计的核心是基于空间的推理表示，它将 3D 潜在空间分解为局部区域，允许模型以组合和程序的方式对几何进行推理。通过将语义思维链推理与结构化空间推理紧密耦合，CoRe3D 生成的 3D 输出表现出很强的局部一致性以及与语言描述的忠实一致。

Title: Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

Authors: Hao Wang, Ashish Bastola, Chaoyi Zhou, Wenhui Zhu, Xiwen Chen, Xuanzhao Dong, Siyu Huang, Abolfazl Razi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12774
Pdf URL: https://arxiv.org/pdf/2512.12774
Copy Paste: [[2512.12774]] Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior(https://arxiv.org/abs/2512.12774)
Keywords: generative
Abstract: As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.
摘要：随着生成模型生成高保真视觉内容的能力越来越强，对高效、可解释和可编辑图像表示的需求也大幅增长。 2D 高斯分布 (2DGS) 的最新进展已成为一种有前景的解决方案，提供显式控制、高可解释性和实时渲染功能（> 1000 FPS）。然而，高质量的 2DGS 通常需要后期优化。现有方法采用随机或启发式（例如梯度图），通常对图像复杂性不敏感并导致收敛速度慢（>10s）。最近的方法引入了可学习网络来预测初始高斯配置，但代价是增加了计算和架构的复杂性。为了弥补这一差距，我们提出了 Fast-2DGS，这是一种用于高效高斯图像表示的轻量级框架。具体来说，我们引入了深度高斯先验，作为条件网络实现，以捕获不同复杂度下高斯基元的空间分布。此外，我们提出了一个属性回归网络来预测密集高斯属性。实验表明，这种解开的架构可以在一次前向传递中实现高质量的重建，然后进行最少的微调。更重要的是，我们的方法在不影响视觉质量的情况下显着降低了计算成本，使 2DGS 更接近行业就绪部署。

Title: Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

Authors: Atalay Denknalbant, Emre Sezdi, Zeki Furkan Kutlu, Polat Goktas
Subjects: cs.LG, q-fin.ST, stat.AP
Abstract URL: https://arxiv.org/abs/2512.12783
Pdf URL: https://arxiv.org/pdf/2512.12783
Copy Paste: [[2512.12783]] Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset(https://arxiv.org/abs/2512.12783)
Keywords: generation
Abstract: Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced $F_{1}$ from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.
摘要：金融排斥限制了创业精神，增加了收入波动，并扩大了贫富差距。伊斯坦布尔银行服务不足的消费者通常没有政府档案，因为他们的收入和付款都是通过非正式渠道流动的。为了研究如何评估此类借款人，我们创建了一个包含 10 万名伊斯坦布尔居民的综合数据集，该数据集再现了 2025 年第一季度 TÜıK 人口普查边际和电信使用模式。检索增强生成将这些公共统计数据输入 OpenAI o3 模型，该模型综合了真实但私密的记录。每个配置文件包含七个社会人口统计变量和九个替代属性，这些属性描述了手机规格、在线购物节奏、订阅支出、汽车拥有量、月租金和信用卡标志。为了测试替代金融数据的影响，CatBoost、LightGBM 和 XGBoost 分别在两个版本中进行训练。演示模型仅使用社会人口统计变量；完整的模型包括社会人口统计和替代属性。经过五倍分层验证，替代块将曲线下面积提高了约百分之一三，并将平衡 $F_{1}$ 从大约 0.84 提升到 0.95，增益为 14%。我们提供了开放的伊斯坦布尔 2025 年第一季度综合数据集、完全可重复的建模流程以及经验证据，表明一组简明的行为属性可以接近局级歧视能力，同时为缺乏正式信用记录的借款人提供服务。这些调查结果为贷款机构和监管机构提供了一个透明的蓝图，以向银行服务不足的人群提供公平和安全的信贷服务。

Title: Learning Common and Salient Generative Factors Between Two Image Datasets

Authors: Yunlong He, Gwilherm Lesné, Ziqian Liu, Michaël Soumm, Pietro Gori
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12800
Pdf URL: https://arxiv.org/pdf/2512.12800
Copy Paste: [[2512.12800]] Learning Common and Salient Generative Factors Between Two Image Datasets(https://arxiv.org/abs/2512.12800)
Keywords: generation, generative
Abstract: Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.
摘要：图像合成领域的最新进展使得高质量图像的生成和操作成为可能。大多数工作侧重于：1）条件操作，即根据给定属性修改图像，或 2）解缠结表示学习，其中每个潜在方向应表示不同的语义属性。在本文中，我们关注一个不同的、研究较少的研究问题，称为对比分析（CA）。给定两个图像数据集，我们希望将两个数据集共享的共同生成因素与仅特定于一个数据集的显着因素分开。与使用属性作为编辑监督信号（例如眼镜、性别）的现有方法相比，所提出的方法较弱，因为它仅使用数据集信号。我们提出了一种新颖的 CA 框架，可以适用于 GAN 和扩散模型，以学习常见因素和显着因素。通过定义新的、适应性强的学习策略和损失，我们确保常见因素和显着因素之间的相关分离，从而保留高质量的一代。我们在不同的数据集上评估我们的方法，涵盖人脸、动物图像和医学扫描。与之前的方法相比，我们的框架表现出卓越的分离能力和图像质量合成。

Title: On the continuity of flows

Authors: Congzhou M Sha
Subjects: cs.LG, cs.AI, physics.data-an
Abstract URL: https://arxiv.org/abs/2512.12821
Pdf URL: https://arxiv.org/pdf/2512.12821
Copy Paste: [[2512.12821]] On the continuity of flows(https://arxiv.org/abs/2512.12821)
Keywords: generative
Abstract: Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to $L^2$ loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.
摘要：流匹配已成为通过连续标准化流进行生成建模的强大框架。我们研究了潜在的拓扑约束：当先验分布和目标分布具有不匹配的拓扑（例如，单峰到多峰）时，标准流匹配目标下的最佳速度场可能表现出空间不连续性。我们认为这种不连续性源于连续流必须分叉以将单一模式映射到多个模式的要求，迫使粒子在中间时间做出离散的路由决策。通过对双峰高斯混合的理论分析，我们证明最佳速度场沿着决策边界表现出跳跃不连续性，随着时间接近目标分布，其幅度接近无穷大。我们的分析表明，这种现象并不是 $L^2$ 损失特有的，而可能是分布之间拓扑不匹配的结果。我们根据经验验证了我们的理论，并讨论了流形上流匹配的潜在影响，将我们的发现与黎曼流匹配的最新工作以及学习神经网络中不连续表示的挑战联系起来。

Title: Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

Authors: N.K.B.M.P.K.B. Narasinghe, Uthayasanker Thayasivam
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12824
Pdf URL: https://arxiv.org/pdf/2512.12824
Copy Paste: [[2512.12824]] Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners(https://arxiv.org/abs/2512.12824)
Keywords: generative
Abstract: Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
摘要：大规模多模态基础模型，特别是对比字幕（CoCa），通过将对比对齐与生成字幕相结合，已经取得了最先进的结果。虽然零样本迁移能力已得到充分证明，但这些生成对比混合体对数据极度稀缺（少样本学习）的下游任务的适应仍有待探索。现有文献主要关注 CLIP 等双编码器架构，在理解 CoCa 独特的潜在空间如何响应参数高效微调 (PEFT) 方面留下了空白。本文提出了一项关于采用 CoCa 视觉主干进行少镜头图像分类的全面实证研究。我们系统地评估了一系列策略，从免训练混合原型到通过低秩适应（LoRA）进行深度参数适应。首先，我们确定了“增强分歧”：虽然强数据增强会降低低样本设置中线性探测的性能，但它对于稳定 LoRA 微调至关重要。我们还证明，与标准交叉熵相比，结合监督对比 (SupCon) 损失的混合目标在不同的射击次数下产生一致的性能改进。至关重要的是，我们描述了训练配置对数据稀缺性的敏感性，为缩放正则化、排名和采样策略提供经验参考设置，以促进生成对比基础模型的有效适应。

Title: Network Level Evaluation of Hangup Susceptibility of HRGCs using Deep Learning and Sensing Techniques: A Goal Towards Safer Future

Authors: Kaustav Chatterjee, Joshua Li, Kundan Parajulee, Jared Schwennesen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12832
Pdf URL: https://arxiv.org/pdf/2512.12832
Copy Paste: [[2512.12832]] Network Level Evaluation of Hangup Susceptibility of HRGCs using Deep Learning and Sensing Techniques: A Goal Towards Safer Future(https://arxiv.org/abs/2512.12832)
Keywords: generation
Abstract: Steep profiled Highway Railway Grade Crossings (HRGCs) pose safety hazards to vehicles with low ground clearance, which may become stranded on the tracks, creating risks of train vehicle collisions. This research develops a framework for network level evaluation of hangup susceptibility of HRGCs. Profile data from different crossings in Oklahoma were collected using both a walking profiler and the Pave3D8K Laser Imaging System. A hybrid deep learning model, combining Long Short Term Memory (LSTM) and Transformer architectures, was developed to reconstruct accurate HRGC profiles from Pave3D8K Laser Imaging System data. Vehicle dimension data from around 350 specialty vehicles were collected at various locations across Oklahoma to enable up to date statistical design dimensions. Hangup susceptibility was analyzed using three vehicle dimension scenarios (a) median dimension (median wheelbase and ground clearance), (b) 75 25 percentile dimension (75 percentile wheelbase, 25 percentile ground clearance), and (c) worst case dimension (maximum wheelbase and minimum ground clearance). Results indicate 36, 62, and 67 crossings at the highest hangup risk levels under these scenarios, respectively. An ArcGIS database and a software interface were developed to support transportation agencies in mitigating crossing hazards. This framework advances safety evaluation by integrating next generation sensing, deep learning, and infrastructure datasets into practical decision support tools.
摘要：陡峭的公路铁路平交道口 (HRGC) 对离地间隙较低的车辆构成安全隐患，车辆可能会滞留在轨道上，从而产生火车车辆相撞的风险。本研究开发了 HRGC 挂断敏感性的网络级评估框架。使用步行轮廓仪和 Pave3D8K 激光成像系统收集俄克拉荷马州不同交叉口的轮廓数据。开发了一种结合了长短期记忆 (LSTM) 和 Transformer 架构的混合深度学习模型，用于根据 Pave3D8K 激光成像系统数据重建准确的 HRGC 轮廓。在俄克拉荷马州的不同地点收集了约 350 辆特种车辆的车辆尺寸数据，以实现最新的统计设计尺寸。使用三种车辆尺寸情景分析挂起敏感性：(a) 中值尺寸（中值轴距和离地间隙）、(b) 75 25 百分位尺寸（75 百分位轴距、25 百分位离地间隙）和 (c) 最坏情况尺寸（最大轴距和最小离地间隙）。结果表明，在这些情景下，分别有 36 个、62 个和 67 个交叉口处于最高的挂断风险水平。开发了 ArcGIS 数据库和软件界面，以支持运输机构减轻交叉危险。该框架通过将下一代传感、深度学习和基础设施数据集集成到实际的决策支持工具中来推进安全评估。

Title: Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Authors: Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12858
Pdf URL: https://arxiv.org/pdf/2512.12858
Copy Paste: [[2512.12858]] Information-Consistent Language Model Recommendations through Group Relative Policy Optimization(https://arxiv.org/abs/2512.12858)
Keywords: generation, generative
Abstract: Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.
摘要：大型语言模型 (LLM) 越来越多地部署在金融、教育、医疗保健和客户支持等关键业务领域，用户期望在这些领域获得一致且可靠的建议。然而，当提示的措辞有微小差异时，法学硕士经常表现出可变性，即使在语义上是相同的。这种不一致会破坏信任，使合规性变得复杂，并破坏用户体验。虽然个性化在某些情况下是可取的，但许多企业场景（例如人力资源入职、客户支持或政策披露）需要不变的信息传递，无论措辞或先前的对话历史如何。现有的方法，包括检索增强生成（RAG）和温度调节，可以提高事实性或降低随机性，但不能保证等效提示的稳定性。在本文中，我们提出了一种基于组相对策略优化（GRPO）的强化学习框架来直接优化一致性。与 GRPO 的先前应用程序（仅限于推理和代码生成）不同，我们采用 GRPO 来增强语义等效提示组中信息内容的稳定性。我们引入基于熵的帮助性和稳定性奖励，将提示变体视为组并重置对话上下文以隔离措辞效果。投资和工作推荐任务的实验表明，我们的 GRPO 训练模型比微调或基于解码的基线更有效地减少变异性。据我们所知，这是 GRPO 的一种新颖应用，用于使法学硕士朝着信息一致性的方向调整，将可变性重新定义为企业部署中可纠正的缺陷，而不是生成多样性的可接受特征。

Title: SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition

Authors: Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
Subjects: cs.CV, cs.AI, cs.CL, cs.IR, cs.RO
Abstract URL: https://arxiv.org/abs/2512.12885
Pdf URL: https://arxiv.org/pdf/2512.12885
Copy Paste: [[2512.12885]] SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition(https://arxiv.org/abs/2512.12885)
Keywords: generation
Abstract: Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
摘要：自动道路标志识别是智能交通系统的一项关键任务，但传统的深度学习方法面临着标志类别数量庞大以及创建详尽标记数据集的不切实际的问题。本文介绍了一种新颖的零样本识别框架，该框架采用检索增强生成（RAG）范式来应对这一挑战。我们的方法首先使用视觉语言模型（VLM）从输入图像生成符号的文本描述。此描述用于从参考设计矢量数据库中检索一小组最相关的候选标志。随后，大型语言模型 (LLM) 对检索到的候选者进行推理，以做出最终的细粒度识别。我们在俄亥俄州 MUTCD 的一整套 303 个监管标志上验证了这种方法。实验结果证明了该框架的有效性，在理想参考图像上实现了 95.58% 的准确率，在具有挑战性的现实世界道路数据上实现了 82.45% 的准确率。这项工作展示了基于 RAG 的架构的可行性，无需特定任务的培训即可创建可扩展且准确的路标识别系统。

Title: Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

Authors: Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12887
Pdf URL: https://arxiv.org/pdf/2512.12887
Copy Paste: [[2512.12887]] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification(https://arxiv.org/abs/2512.12887)
Keywords: generation
Abstract: 3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.
摘要：3D 医学图像分类对于现代临床工作流程至关重要。医学基础模型（FM）已成为一种有前景的扩展到新任务的方法，但目前的研究存在三个关键缺陷：数据机制偏差、次优适应和任务覆盖范围不足。在本文中，我们解决了这些陷阱并介绍了 AnyMC3D，这是一种改编自 2D FM 的可扩展 3D 分类器。我们的方法通过在单个冻结主干上仅添加轻量级插件（每个任务大约 1M 个参数）来有效地扩展到新任务。这个多功能框架还支持多视图输入、辅助像素级监督和可解释的热图生成。我们建立了涵盖不同病理学、解剖学和模式的 12 项任务的综合基准，并系统地分析了最先进的 3D 分类技术。我们的分析揭示了关键见解：(1) 有效的适应对于释放 FM 潜力至关重要，(2) 如果适当适应，通用 FM 可以与医疗专用 FM 相匹配，(3) 基于 2D 的方法超越 3D 分类的 3D 架构。我们首次证明了使用单一可扩展框架（包括 VLM3D 挑战赛中的第一名）在不同应用程序中实现最先进性能的可行性，从而无需单独的特定于任务的模型。

Title: Distillation of Discrete Diffusion by Exact Conditional Distribution Matching

Authors: Yansong Gao, Yu Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12889
Pdf URL: https://arxiv.org/pdf/2512.12889
Copy Paste: [[2512.12889]] Distillation of Discrete Diffusion by Exact Conditional Distribution Matching(https://arxiv.org/abs/2512.12889)
Keywords: generative
Abstract: Discrete diffusion models (DDMs) are a powerful class of generative models for categorical data, but they typically require many function evaluations for a single sample, making inference expensive. Existing acceleration methods either rely on approximate simulators, such as $\tau$-leaping, or on distillation schemes that train new student models and auxiliary networks with proxy objectives. We propose a simple and principled distillation alternative based on \emph{conditional distribution matching}. Our key observation is that the reverse conditional distribution of clean data given a noisy state, $p_{0\mid t}(x_0 \mid x_t)$, admits a Markov decomposition through intermediate times and can be recovered from marginal density ratios and the known forward CTMC kernel. We exploit this structure to define distillation objectives that directly match conditional distributions between a pre-trained teacher and a low-NFE student, both for one-step and few-step samplers.
摘要：离散扩散模型 (DDM) 是一类功能强大的分类数据生成模型，但它们通常需要对单个样本进行多次函数评估，从而导致推理成本高昂。现有的加速方法要么依赖于近似模拟器，例如 $\tau$-leaping，要么依赖于训练新学生模型和具有代理目标的辅助网络的蒸馏方案。我们提出了一种基于\emph{条件分布匹配}的简单且有原则的蒸馏替代方案。我们的主要观察结果是，给定噪声状态的干净数据的反向条件分布 $p_{0\mid t}(x_0 \mid x_t)$ 允许通过中间时间进行马尔可夫分解，并且可以从边际密度比和已知的前向 CTMC 内核中恢复。我们利用这种结构来定义蒸馏目标，直接匹配经过预训练的教师和低 NFE 学生之间的条件分布，无论是一步采样还是少步采样。

Title: Wait, Wait, Wait... Why Do Reasoning Models Loop?

Authors: Charilaos Pipis, Shivam Garg, Vasilis Kontonis, Vaishnavi Shrivastava, Akshay Krishnamurthy, Dimitris Papailiopoulos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12895
Pdf URL: https://arxiv.org/pdf/2512.12895
Copy Paste: [[2512.12895]] Wait, Wait, Wait... Why Do Reasoning Models Loop?(https://arxiv.org/abs/2512.12895)
Keywords: generation
Abstract: Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.
摘要：推理模型（例如 DeepSeek-R1）会生成长的思想链来解决更困难的问题，但它们经常循环，在低温或贪婪解码下重复相同的文本。我们研究为什么会发生这种情况以及温度扮演什么角色。通过开放推理模型，我们发现循环在低温下很常见。较大的模型往往循环较少，而经过精炼的学生循环显着，即使他们的老师很少这样做。这表明训练分布与学习模型之间的不匹配（我们将其称为学习错误）是一个关键原因。为了理解此类错误如何导致循环，我们引入了合成图推理任务并演示了两种机制。首先，学习难度造成的风险厌恶：当正确的进步动作很难学习，但有一个简单的循环动作可用时，模型会将相对较多的概率放在循环动作上，并陷入困境。其次，即使没有困难，变形金刚也会表现出对时间相关错误的归纳偏差，因此不断选择相同的少数动作并出现循环。较高的温度通过促进探索来减少循环，但它不能修复学习中的错误，因此在高温下世代保持的时间比所需的时间长得多；从这个意义上说，温度只是权宜之计，而不是整体解决方案。最后我们讨论了旨在直接减少学习错误的训练时间干预措施。

Title: Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution

Authors: Abhinav Kumar, Tristan Aumentado-Armstrong, Lazar Valkov, Gopal Sharma, Alex Levinshtein, Radek Grzeszczuk, Suren Kumar
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12898
Pdf URL: https://arxiv.org/pdf/2512.12898
Copy Paste: [[2512.12898]] Qonvolution: Towards Learning High-Frequency Signals with Queried Convolution(https://arxiv.org/abs/2512.12898)
Keywords: super-resolution
Abstract: Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.
摘要：准确学习高频信号是计算机视觉和图形领域的一个挑战，因为神经网络经常由于频谱偏差或优化困难而难以处理这些信号。虽然傅里叶编码等当前技术在提高性能方面取得了长足进步，但在呈现高频信息时仍有改进的空间。本文介绍了查询卷积（Qonvolutions），这是一种利用卷积邻域属性的简单而强大的修改。 Qonvolution 将低频信号与查询（例如坐标）进行卷积，以增强复杂高频信号的学习。我们凭经验证明，Qonvolutions 可以提高对计算机视觉和图形社区至关重要的各种高频学习任务的性能，包括 1D 回归、2D 超分辨率、2D 图像回归和新视图合成 (NVS)。特别是，通过将高斯泼溅与 NVS 的 Qonvolutions 相结合，我们在现实世界的复杂场景中展示了最先进的性能，甚至在图像质量上超越了强大的辐射场模型。

Title: Next-generation reservoir computing validated by classification task

Authors: Ken-ichi Kitayama
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12903
Pdf URL: https://arxiv.org/pdf/2512.12903
Copy Paste: [[2512.12903]] Next-generation reservoir computing validated by classification task(https://arxiv.org/abs/2512.12903)
Keywords: generation
Abstract: An emerging computing paradigm, so-called next-generation reservoir computing (NG-RC) is investigated. True to its namesake, NG-RC requires no actual reservoirs for input data mixing but rather computing the polynomial terms directly from the time series inputs. However, benchmark tests so far reported have been one-sided, limited to prediction tasks of temporal waveforms such as Lorenz 63 attractor and Mackey-Glass chaotic signal. We will demonstrate for the first time that NG-RC can perform classification task as good as conventional RC. This validates the versatile computational capability of NG-RC in tasks of both prediction and classification.
摘要：研究了一种新兴的计算范式，即所谓的下一代水库计算（NG-RC）。顾名思义，NG-RC 不需要实际的存储库来混合输入数据，而是直接从时间序列输入计算多项式项。然而，迄今为止报道的基准测试都是片面的，仅限于 Lorenz 63 吸引子和 Mackey-Glass 混沌信号等时间波形的预测任务。我们将首次证明 NG-RC 可以像传统 RC 一样执行分类任务。这验证了 NG-RC 在预测和分类任务中的多功能计算能力。

Title: Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Authors: Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang, Bohan Zhuang, Gao Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13006
Pdf URL: https://arxiv.org/pdf/2512.13006
Copy Paste: [[2512.13006]] Few-Step Distillation for Text-to-Image Generation: A Practical Guide(https://arxiv.org/abs/2512.13006)
Keywords: generation
Abstract: Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on this http URL.
摘要：扩散蒸馏极大地加速了类条件图像合成，但其在开放式文本到图像（T2I）生成中的适用性仍不清楚。我们提出了第一个系统研究，该研究在强大的 T2I 教师模型 FLUX.1-lite 上采用和比较了最先进的蒸馏技术。通过将现有方法转化为统一的框架，我们确定了从离散类标签转向自由形式语言提示时出现的关键障碍。除了彻底的方法分析之外，我们还提供有关输入缩放、网络架构和超参数的实用指南，并附有开源实现和预训练的学生模型。我们的研究结果为在现实世界的 T2I 应用中部署快速、高保真且资源高效的扩散发生器奠定了坚实的基础。代码可在此 http URL 上找到。

Title: JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Authors: Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13014
Pdf URL: https://arxiv.org/pdf/2512.13014
Copy Paste: [[2512.13014]] JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion(https://arxiv.org/abs/2512.13014)
Keywords: generation, generative
Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
摘要：考虑到像素级注释固有的成本高昂且耗时的性质，包含足够多样化的合成图像与真实像素级注释配对的合成数据集的生成最近在训练高性能语义分割模型方面引起了越来越多的关注。然而，现有方法需要在图像生成后预测伪注释或生成以手动注释掩模为条件的图像，这会导致图像注释语义不一致或可扩展性问题。为了用一块石头解决这两个问题，我们提出了一种用于语义分割的新型数据集生成扩散框架，称为 JoDiffusion。首先，给定一个标准的潜在扩散模型，JoDiffusion 结合了一个独立的注释变分自动编码器（VAE）网络，将注释掩模映射到图像共享的潜在空间中。然后，定制扩散模型以捕获每个图像的联合分布及其以文本提示为条件的注释掩模。通过执行这些操作，JoDiffusion 能够同时生成配对图像和仅以文本提示为条件的语义一致的注释蒙版，从而展示出卓越的可扩展性。此外，还开发了掩码优化策略来减轻生成过程中产生的注释噪声。在 Pascal VOC、COCO 和 ADE20K 数据集上的实验表明，与现有方法相比，JoDiffusion 生成的带注释数据集在语义分割方面产生了显着的性能改进。

Title: What Happens Next? Next Scene Prediction with a Unified Video Model

Authors: Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao, Vimal Bhat
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13015
Pdf URL: https://arxiv.org/pdf/2512.13015
Copy Paste: [[2512.13015]] What Happens Next? Next Scene Prediction with a Unified Video Model(https://arxiv.org/abs/2512.13015)
Keywords: generation
Abstract: Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
摘要：最近用于联合理解和生成的统一模型具有显着先进的视觉生成能力。然而，他们对文本到视频生成等传统任务的关注使得统一模型的时间推理潜力在很大程度上未被充分开发。为了解决这一差距，我们引入了下一场景预测（NSP），这是一项将统一视频模型推向时间和因果推理的新任务。与文本到视频的生成不同，NSP 需要从先前的上下文中预测可能的未来，需要更深入的理解和推理。为了解决这个任务，我们提出了一个统一的框架，结合了用于理解的 Qwen-VL 和用于合成的 LTX，并通过潜在查询嵌入和连接器模块进行桥接。该模型在我们新策划的大规模 NSP 数据集上分三个阶段进行训练：文本到视频预训练、监督微调和强化学习（通过 GRPO）以及我们提出的因果一致性奖励。实验表明，我们的模型在基准测试中实现了最先进的性能，提高了通用多模态系统预测接下来发生的情况的能力。

Title: SneakPeek: Future-Guided Instructional Streaming Video Generation

Authors: Cheeun Hong, German Barquero, Fadime Sener, Markos Georgopoulos, Edgar Schönfeld, Stefan Popov, Yuming Du, Oscar Mañas, Albert Pumarola
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13019
Pdf URL: https://arxiv.org/pdf/2512.13019
Copy Paste: [[2512.13019]] SneakPeek: Future-Guided Instructional Streaming Video Generation(https://arxiv.org/abs/2512.13019)
Keywords: generation
Abstract: Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
摘要：教学视频生成是一项新兴任务，旨在从文本描述中合成程序活动的连贯演示。这种能力对内容创建、教育和人机交互具有广泛的影响，但现有的视频传播模型很难在多个动作步骤的长序列中保持时间一致性和可控性。我们引入了一种面向未来的流式教学视频生成管道，称为 SneakPeek，这是一种基于扩散的自回归框架，旨在生成以初始图像和结构化文本提示为条件的精确、逐步的教学视频。我们的方法引入了三个关键创新来增强一致性和可控性：（1）预测因果适应，其中因果模型学习执行下一帧预测并预测未来的关键帧； (2) 未来引导的自强制，采用双区域 KV 缓存方案，解决推理时的暴露偏差问题； (3)多提示调节，它提供对多步指令的细粒度和程序控制。这些组件共同减轻时间漂移，保持运动一致性，并支持交互式视频生成，其中未来的提示更新会动态影响正在进行的流视频生成。实验结果表明，我们的方法产生时间连贯且语义忠实的教学视频，准确地遵循复杂的多步骤任务描述。

Title: Motus: A Unified Latent Action World Model

Authors: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.13030
Pdf URL: https://arxiv.org/pdf/2512.13030
Copy Paste: [[2512.13030]] Motus: A Unified Latent Action World Model(https://arxiv.org/abs/2512.13030)
Keywords: generation, generative
Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
摘要：虽然通用的实体代理必须作为一个统一的系统发挥作用，但当前的方法是建立在用于理解、世界建模和控制的孤立模型之上的。这种碎片化阻碍了多模式生成能力的统一，并阻碍了从大规模异构数据中进行学习。在本文中，我们提出了 Motus，这是一种统一的潜在动作世界模型，它利用现有的通用预训练模型和丰富的、可共享的运动信息。 Motus引入了Mixture-of-Transformer（MoT）架构来集成三个专家（即理解、视频生成和动作），并采用UniDiffuser式调度器来实现不同建模模式（即世界模型、视觉-语言-动作模型、逆动力学模型、视频生成模型和视频-动作联合预测模型）之间的灵活切换。 Motus进一步利用光流来学习潜在动作，并采用三相训练管道和六层数据金字塔的配方，从而提取像素级的“增量动作”并实现大规模动作预训练。实验表明，Motus 在模拟（比 X-VLA 提高了 +15%，比 Pi0.5 提高了 +45%）和现实场景（提高了 +11~48%）方面均实现了优于最先进方法的性能，证明所有功能和先验的统一建模显着有利于下游机器人任务。

Title: Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Authors: Hao Chen, Yiwei Wang, Songze Li
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2512.13039
Pdf URL: https://arxiv.org/pdf/2512.13039
Copy Paste: [[2512.13039]] Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models(https://arxiv.org/abs/2512.13039)
Keywords: generation
Abstract: Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image this http URL, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.
摘要：概念擦除通过微调扩散模型来去除不需要的或有害的视觉概念，已成为减轻文本到图像此http URL中不安全或非法图像生成的主流方法，现有的去除方法通常采用单向擦除策略，通过抑制目标概念或增强安全替代方案，从而难以在概念去除和生成质量之间实现平衡权衡。为了解决这个限制，我们提出了一种新颖的双向图像引导概念擦除（Bi-Erasing）框架，该框架可以同时执行概念抑制和安全增强。具体来说，基于文本提示和相应图像的联合表示，Bi-Erasing 引入了两个解耦的图像分支：一个负责抑制有害语义的负分支和一个为安全替代方案提供视觉指导的正分支。通过共同优化这些互补方向，我们的方法实现了擦除效率和生成可用性之间的平衡。此外，我们对图像分支应用基于掩模的过滤，以防止擦除过程中不相关内容的干扰。通过广泛的实验评估，所提出的双向擦除在平衡概念删除有效性和视觉保真度方面优于基线方法。

Title: Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Authors: Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen, Dingkang Yang, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13072
Pdf URL: https://arxiv.org/pdf/2512.13072
Copy Paste: [[2512.13072]] Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models(https://arxiv.org/abs/2512.13072)
Keywords: generation
Abstract: Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
摘要：多模态生物医学视觉语言模型（VLM）在持续学习（CL）领域展现出巨大的潜力。然而，他们面临一个核心困境：如何保留细粒度的模态内特征，同时弥合不同模态之间的显着领域差距。为了应对这一挑战，我们提出了一个全面的框架。利用源自 PubMed 科学论文的 1800 万多模态综合医学检索数据库，我们率先将检索增强生成 (RAG) 集成到 CL 中。具体来说，我们采用多模式、多层 RAG 系统，通过动态、按需知识检索为模型微调提供实时指导。在此基础上，我们引入了动态知识蒸馏框架。该框架通过根据所需的详细程度动态调整参数空间的重要性、提取知识的粒度以及参考数据集的数据分布，精确地解决了上述核心困境。为了彻底验证我们策略的临床价值，我们设计了更严格的\textbf{M}医学通才任务增量学习（MGTIL）基准。该基准旨在同时评估模型适应重大领域变化、保留微妙的域内特征以及实时学习新颖且复杂的医疗任务的能力。大量的实验结果表明，我们提出的方法在所有指标上都实现了最先进的（SOTA）性能。该代码在补充材料中提供。

Title: Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

Authors: Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Liu, Jintao Cheng, Xieyuanli Chen, Xiaoyu Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13107
Pdf URL: https://arxiv.org/pdf/2512.13107
Copy Paste: [[2512.13107]] Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather(https://arxiv.org/abs/2512.13107)
Keywords: restoration
Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
摘要：多模态 3D 物体检测对于机器人和自动驾驶的可靠感知非常重要。然而，由于天气引起的扭曲和不同数据模式之间的不一致，其有效性在恶劣天气条件下仍然有限。在这项工作中，我们提出了 DiffFusion，这是一种新颖的框架，旨在通过基于扩散的恢复和自适应跨模态融合来增强在恶劣天气下的鲁棒性。我们的主要见解是扩散模型具有强大的去噪和生成数据的能力，可以适应各种天气条件。在此基础上，DiffFusion 引入了 Diffusion-IR 来恢复因天气影响而退化的图像，并引入点云恢复 (PCR)，使用图像对象线索来补偿损坏的 LiDAR 数据。为了解决两种模式之间的不一致问题，我们开发了双向自适应融合和对齐模块（BAFAM）。它支持动态多模态融合和双向鸟瞰图 (BEV) 对齐，以保持一致的空间对应关系。对三个公共数据集的大量实验表明，DiffFusion 在恶劣天气下实现了最先进的鲁棒性，同时保持了强大的清洁数据性能。现实世界 DENSE 数据集上的零样本结果进一步验证了其泛化性。我们的 DiffFusion 的实现将作为开源发布。

Title: A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

Authors: Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou, Zengyang Che, Hongxue Meng, Xin Li, Yaowei Wang, Hongpeng Wang, Min Zhang, Heng Tao Shen, Zheng Zhang, Yongbing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13164
Pdf URL: https://arxiv.org/pdf/2512.13164
Copy Paste: [[2512.13164]] A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis(https://arxiv.org/abs/2512.13164)
Keywords: generative
Abstract: The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
摘要：病理学临床级人工智能的发展受到缺乏多样化、高质量注释数据集的限制。生成模型提供了一种潜在的解决方案，但存在语义不稳定和形态幻觉的问题，从而影响了诊断的可靠性。为了应对这一挑战，我们引入了组织合成的相关调节对齐框架（CRAFTS），这是第一个用于病理学特定文本到图像合成的生成基础模型。通过利用大约 280 万个图像标题对的双阶段训练策略，CRAFTS 结合了一种新颖的对齐机制，可以抑制语义漂移以确保生物学准确性。该模型生成涵盖 30 种癌症类型的多种病理图像，其质量经过客观指标和病理学家评估的严格验证。此外，CRAFTS 增强数据集增强了各种临床任务的性能，包括分类、跨模式检索、自我监督学习和视觉问答。此外，将 CRAFTS 与 ControlNet 相结合，可以通过核分割掩模和荧光图像等输入来精确控制组织结构。通过克服数据稀缺和隐私问题的关键障碍，CRAFTS 提供了无限的多样化、带注释的组织学数据来源，有效地为罕见和复杂的癌症表型创建了强大的诊断工具。

Title: POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Authors: Zhuo Chen, Chengqun Yang, Zhuo Su, Zheng Lv, Jingnan Gao, Xiaoyuan Zhang, Xiaokang Yang, Yichao Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13192
Pdf URL: https://arxiv.org/pdf/2512.13192
Copy Paste: [[2512.13192]] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling(https://arxiv.org/abs/2512.13192)
Keywords: generative
Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.
摘要：面部重新照明旨在在新颖的照明下合成逼真的肖像，同时保留身份和几何形状。然而，大规模、物理一致的照明数据的可用性有限，进展仍然受到限制。为了解决这个问题，我们引入了 POLAR，这是一个大规模且物理校准的一次一光 (OLAT) 数据集，包含在 156 个照明方向、多个视图和不同表情下捕获的 200 多个主题。在 POLAR 的基础上，我们开发了一种基于流的生成模型 POLARNet，它可以预测单个肖像的每光 OLAT 响应，捕获细粒度和方向感知的照明效果，同时保留面部身份。与依赖统计或上下文线索的扩散或背景条件方法不同，我们的公式将照明建模为照明状态之间连续的、物理上可解释的转换，从而实现可扩展和可控的重新照明。 POLAR 和 POLARNet 共同形成了一个统一的照明学习框架，将真实数据、生成合成和物理接地重新照明联系起来，为可扩展和可重复的肖像照明建立了一个自我维持的“先有鸡还是先有蛋”循环。

Title: Evaluating Adversarial Attacks on Federated Learning for Temperature Forecasting

Authors: Karina Chichifoi, Fabio Merizzi, Michele Colajanni
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.13207
Pdf URL: https://arxiv.org/pdf/2512.13207
Copy Paste: [[2512.13207]] Evaluating Adversarial Attacks on Federated Learning for Temperature Forecasting(https://arxiv.org/abs/2512.13207)
Keywords: generation
Abstract: Deep learning and federated learning (FL) are becoming powerful partners for next-generation weather forecasting. Deep learning enables high-resolution spatiotemporal forecasts that can surpass traditional numerical models, while FL allows institutions in different locations to collaboratively train models without sharing raw data, addressing efficiency and security concerns. While FL has shown promise across heterogeneous regions, its distributed nature introduces new vulnerabilities. In particular, data poisoning attacks, in which compromised clients inject manipulated training data, can degrade performance or introduce systematic biases. These threats are amplified by spatial dependencies in meteorological data, allowing localized perturbations to influence broader regions through global model aggregation. In this study, we investigate how adversarial clients distort federated surface temperature forecasts trained on the Copernicus European Regional ReAnalysis (CERRA) dataset. We simulate geographically distributed clients and evaluate patch-based and global biasing attacks on regional temperature forecasts. Our results show that even a small fraction of poisoned clients can mislead predictions across large, spatially connected areas. A global temperature bias attack from a single compromised client shifts predictions by up to -1.7 K, while coordinated patch attacks more than triple the mean squared error and produce persistent regional anomalies exceeding +3.5 K. Finally, we assess trimmed mean aggregation as a defense mechanism, showing that it successfully defends against global bias attacks (2-13\% degradation) but fails against patch attacks (281-603\% amplification), exposing limitations of outlier-based defenses for spatially correlated data.
摘要：深度学习和联邦学习 (FL) 正在成为下一代天气预报的强大合作伙伴。深度学习可以实现超越传统数值模型的高分辨率时空预测，而 FL 允许不同地点的机构协作训练模型，而无需共享原始数据，从而解决了效率和安全问题。虽然 FL 在异构区域中表现出了良好的前景，但其分布式特性引入了新的漏洞。特别是，数据中毒攻击（其中受感染的客户端注入受操纵的训练数据）可能会降低性能或引入系统偏差。这些威胁因气象数据的空间依赖性而被放大，使得局部扰动通过全球模型聚合影响更广泛的区域。在这项研究中，我们研究了对抗性客户端如何扭曲在哥白尼欧洲区域再分析（CERRA）数据集上训练的联合表面温度预测。我们模拟地理上分布的客户端，并评估对区域温度预测的基于补丁和全局偏差的攻击。我们的结果表明，即使是一小部分中毒的客户端也可能会误导大范围的、空间相连的区域的预测。来自单个受感染客户端的全局温度偏差攻击使预测变化高达 -1.7 K，而协调补丁攻击使均方误差超过三倍，并产生超过 +3.5 K 的持续区域异常。最后，我们评估了修剪均值聚合作为防御机制，表明它成功抵御全局偏差攻击（2-13\% 降级），但无法抵御补丁攻击（281-603\% 放大），暴露了基于异常值的空间相关数据防御的局限性。

Title: CORE: Contrastive Masked Feature Reconstruction on Graphs

Authors: Jianyuan Bo, Yuan Fang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13235
Pdf URL: https://arxiv.org/pdf/2512.13235
Copy Paste: [[2512.13235]] CORE: Contrastive Masked Feature Reconstruction on Graphs(https://arxiv.org/abs/2512.13235)
Keywords: generative
Abstract: In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
摘要：在快速发展的图自监督学习领域，生成方法和对比方法已成为两种主要方法。我们的研究重点是屏蔽特征重建（MFR），这是一种生成技术，模型学习以自我监督的方式恢复屏蔽节点的原始特征。我们观察到 MFR 和图对比学习（GCL）都旨在最大化相似元素之间的一致性。基于这一观察，我们揭示了一个新颖的理论见解：在特定条件下，MFR 和节点级 GCL 的目标是一致的，尽管它们的运行机制不同。这种理论联系表明这些方法是互补的而不是根本不同的，促使我们探索它们的集成以增强图上的自我监督学习。我们的研究提出了对比屏蔽特征重建（CORE），这是一种新颖的图自监督学习框架，它将对比学习集成到 MFR 中。具体来说，我们专门在屏蔽节点的原始特征和重建特征之间形成正对，鼓励编码器优先考虑上下文信息而不是节点自身的特征。此外，我们利用屏蔽节点本身作为负样本，将 MFR 的重建能力与 GCL 的判别能力相结合，以更好地捕获内在的图结构。根据经验，我们提出的框架 CORE 在节点和图分类任务上显着优于 MFR，展示了最先进的结果。特别是，CORE 在节点分类任务上超过 GraphMAE 和 GraphMAE2 高达 2.80% 和 3.72%，在图分类任务上超过 GraphMAE 和 GraphMAE2 高达 3.82% 和 3.76%。

Title: Learning to Retrieve with Weakened Labels: Robust Training under Label Noise

Authors: Arnab Sharma
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2512.13237
Pdf URL: https://arxiv.org/pdf/2512.13237
Copy Paste: [[2512.13237]] Learning to Retrieve with Weakened Labels: Robust Training under Label Noise(https://arxiv.org/abs/2512.13237)
Keywords: generation
Abstract: Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model's confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
摘要：神经编码器经常在 NLP 领域中用于执行密集检索任务，例如，为问答任务中的给定查询生成候选文档。然而，训练数据中稀疏的注释和标签噪声使得训练或微调此类检索模型具有挑战性。尽管现有的工作试图通过合并修改后的损失函数或数据清理来缓解这些问题，但这些方法要么需要在训练期间调整一些超参数，要么会增加训练设置的复杂性。在这项工作中，我们考虑使用标签弱化方法来在存在标签噪声的情况下生成稳健的检索模型。我们不是为每个查询文档对强制使用单个可能错误的标签，而是允许从观察到的监督和模型的置信度分数中派生出一组合理的标签。我们考虑两种检索模型、一种重新排名模型、考虑四种不同的排名数据集来进行广泛的评估。为此，我们还通过使用语义感知噪声生成技术来考虑现实的噪声设置，以生成不同比率的噪声。我们的初步结果表明，与 10 种不同的最先进损失函数相比，标签弱化可以提高检索任务的性能。

Title: BézierFlow: Bézier Stochastic Interpolant Schedulers for Few-Step Generation

Authors: Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13255
Pdf URL: https://arxiv.org/pdf/2512.13255
Copy Paste: [[2512.13255]] BézierFlow: Bézier Stochastic Interpolant Schedulers for Few-Step Generation(https://arxiv.org/abs/2512.13255)
Keywords: generation
Abstract: We introduce BézierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. BézierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as Bézier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to Bézier control points. Across a range of pretrained diffusion and flow models, BézierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to Bézier-based trajectory transformations.
摘要：我们引入了 BézierFlow，这是一种轻量级训练方法，可通过预训练的扩散和流动模型进行几步生成。 BézierFlow 使用 $\leq$ 10 NFE 进行采样，性能提高了 2-3 倍，同时仅需要 15 分钟的训练。最近的轻量级训练方法通过学习最佳时间步显示出前景，但其范围仍然仅限于 ODE 离散化。为了扩大这个范围，我们建议通过参数化随机插值（SI）调度器来学习采样轨迹的最优变换。主要挑战在于设计满足关键需求的参数化，包括边界条件、可微性和信噪比的单调性。为了有效地满足这些要求，我们将调度程序函数表示为贝塞尔函数，其中控制点自然地强制执行这些属性。这将问题简化为学习时间范围内的有序点集，而点的解释从 ODE 时间步长变为贝塞尔控制点。在一系列预训练的扩散和流动模型中，BézierFlow 始终优于先前的时间步长学习方法，证明了将搜索空间从离散时间步长扩展到基于 Bézier 的轨迹变换的有效性。

Title: Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Authors: Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13281
Pdf URL: https://arxiv.org/pdf/2512.13281
Copy Paste: [[2512.13281]] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?(https://arxiv.org/abs/2512.13281)
Keywords: generation
Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at this https URL.
摘要：视频生成领域的最新进展产生了通常与真实视频无法区分的生动内容，这使得人工智能生成的视频检测成为一个新兴的社会挑战。之前的 AIGC 检测基准主要评估没有音频的视频，针对广泛的叙事领域，并且仅专注于分类。然而，目前尚不清楚最先进的视频生成模型是否可以生成可靠地欺骗人类和 VLM 的沉浸式音频配对视频。为此，我们推出了 Video Reality Test，这是一个源自 ASMR 的视频基准测试套件，用于测试紧密视听耦合下的感知真实感，具有以下维度： \textbf{(i) 沉浸式 ASMR 视频音频源。} 该基准测试基于精心策划的真实 ASMR 视频，旨在细粒度的动作与物体交互，具有跨物体、动作和背景的多样性。 \textbf{(ii) 同行评审评估。} 一种对抗性创作者-评审者协议，其中视频生成模型充当旨在愚弄评审者的创作者，而 VLM 则充当寻求识别虚假内容的评审者。我们的实验结果表明：最好的创建者 Veo3.1-Fast 甚至愚弄了大多数 VLM：最强的审稿人（Gemini 2.5-Pro）仅达到 56% 的准确率（随机 50%），远低于人类专家的准确率（81.25%）。添加音频可以提高真假辨别能力，但水印等表面线索仍然会严重误导模型。这些发现描绘了视频生成现实主义的当前边界，并暴露了 VLM 在感知保真度和视听一致性方面的局限性。我们的代码可以在这个 https URL 上找到。

Title: CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Authors: Bo Liu, Qiao Qin, Qinghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13285
Pdf URL: https://arxiv.org/pdf/2512.13285
Copy Paste: [[2512.13285]] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images(https://arxiv.org/abs/2512.13285)
Keywords: generation, generative
Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
摘要：生成模型的快速发展增加了对能够泛化各种不断发展的生成技术的生成图像检测器的需求。然而，现有的方法，包括那些利用预先训练的视觉语言模型的方法，通常会产生高度纠缠的表示，将与任务相关的取证线索（因果特征）与虚假或不相关的模式（非因果特征）混合在一起，从而限制了泛化。为了解决这个问题，我们提出了 CausalCLIP，这是一个框架，它明确地将因果特征与非因果特征分开，并采用以因果推理原则为指导的有针对性的过滤，以仅保留最可转移和最具辨别力的取证线索。通过使用结构因果模型对生成过程进行建模，并通过基于 Gumbel-Softmax 的特征屏蔽和希尔伯特-施密特独立准则 (HSIC) 约束强制执行统计独立性，CausalCLIP 隔离了对分布变化具有鲁棒性的稳定因果特征。在不同系列的未见过的生成模型上进行测试时，CausalCLIP 表现出强大的泛化能力，与最先进的方法相比，准确率提高了 6.83%，平均精度提高了 4.06%。

Title: LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

Authors: Shu Yu, Chaochao Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13290
Pdf URL: https://arxiv.org/pdf/2512.13290
Copy Paste: [[2512.13290]] LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models(https://arxiv.org/abs/2512.13290)
Keywords: generation
Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at this https URL.
摘要：扩散模型（DM）在图像和视频生成方面取得了显着的成功。然而，他们仍然在 (1) 物理对齐和 (2) 分配外 (OOD) 指令遵循方面遇到困难。我们认为，这些问题源于模型未能学习因果方向并未能理清新颖重组的因果因素。我们引入因果场景图（CSG）和物理对齐探针（PAP）数据集来实现诊断干预。该分析产生了三个关键见解。首先，DM 很难对提示中未明确确定的元素进行多跳推理。其次，提示嵌入包含纹理和物理的解开表示。第三，视觉因果结构是在最初的、计算有限的去噪步骤中不成比例地建立的。基于这些发现，我们引入了 LINA（自适应学习干预），这是一种学习预测特定提示干预的新颖框架，它采用（1）在提示和视觉潜在空间中进行有针对性的指导，以及（2）重新分配的、因果关系感知的去噪计划。我们的方法在图像和视频 DM 中强制执行物理对齐和 OOD 指令，在具有挑战性的因果生成任务和 Winoground 数据集上实现最先进的性能。我们的项目页面位于此 https URL。

Title: ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

Authors: Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, Hongtao Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13303
Pdf URL: https://arxiv.org/pdf/2512.13303
Copy Paste: [[2512.13303]] ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement(https://arxiv.org/abs/2512.13303)
Keywords: generation
Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
摘要：虽然现有的生成模型和统一模型在一般图像生成方面表现出色，但它们在处理超出一般场景之外需要深度推理、规划和精确的数据到视觉映射能力的任务时遇到了困难。为了超越现有的限制，我们引入了一项新的且具有挑战性的任务：创造性的表格可视化，要求模型生成一个信息图表，忠实且美观地可视化给定表格中的数据。为了应对这一挑战，我们提出了 ShowTable，这是一种通过渐进式自我校正过程将 MLLM 与扩散模型协同作用的管道。 MLLM作为中央协调器，推理视觉计划并判断视觉错误以提供细化指令，扩散执行MLLM的命令，实现高保真结果。为了支持这项任务和我们的管道，我们引入了三个自动化数据构建管道来训练不同的模块。此外，我们还引入了 TableVisBench，这是一个新的基准测试，包含 5 个评估维度的 800 个具有挑战性的实例，以评估此任务的性能。实验表明，我们的管道用不同的模型实例化，显着优于基线，突出了其有效的多模式推理、生成和纠错能力。

Title: KlingAvatar 2.0 Technical Report

Authors: Kling Team: Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13313
Pdf URL: https://arxiv.org/pdf/2512.13313
Copy Paste: [[2512.13313]] KlingAvatar 2.0 Technical Report(https://arxiv.org/abs/2512.13313)
Keywords: generation
Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
摘要：阿凡达视频生成模型近年来取得了显着的进步。然而，先前的工作在生成长时间高分辨率视频方面表现出有限的效率，随着视频长度的增加，会出现时间漂移、质量下降和提示跟随弱等问题。为了应对这些挑战，我们提出了 KlingAvatar 2.0，这是一个时空级联框架，可以在空间分辨率和时间维度上进行升级。该框架首先生成捕获全局语义和运动的低分辨率蓝图视频关键帧，然后使用首尾帧策略将其细化为高分辨率、时间连贯的子剪辑，同时保留长视频中的平滑时间过渡。为了增强扩展视频中的跨模态指令融合和对齐，我们引入了由三位特定模态大语言模型（LLM）专家组成的联合推理总监。这些专家推理模态优先级并推断潜在的用户意图，通过多轮对话将输入转换为详细的故事情节。负面董事进一步细化负面提示，以改善指令一致性。在这些组件的基础上，我们扩展了框架以支持特定于 ID 的多字符控制。大量的实验表明，我们的模型有效地解决了高效、多模态对齐的长格式高分辨率视频生成的挑战，提供增强的视觉清晰度、具有准确唇形同步的逼真唇齿渲染、强大的身份保留和连贯的多模态指令遵循。

Title: ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning

Authors: Mayank Gulati, Benedikt Groß, Gerhard Wunder
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13316
Pdf URL: https://arxiv.org/pdf/2512.13316
Copy Paste: [[2512.13316]] ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning(https://arxiv.org/abs/2512.13316)
Keywords: generative
Abstract: We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations. Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures
摘要：我们提出了 ALIGN-FL，一种分布式学习的新方法，它通过选择性共享生成组件来解决从高度不相交的数据分布中学习的挑战。我们的框架不是交换完整的模型参数，而是通过仅在客户端之间传输生成能力来实现隐私保护学习，而服务器则使用合成样本执行全局训练。通过互补的隐私机制：具有自适应裁剪的 DP-SGD 和 Lipschitz 正则化 VAE 解码器以及支持异构客户端的有状态架构，我们在具有跨域异常值的 MNIST 和 Fashion-MNIST 数据集上实验验证了我们的方法。我们的分析表明，这两种隐私机制都可以有效地将敏感异常值映射到典型数据点，同时在跨孤岛协作的极端非独立同分布场景中保持实用性。索引术语：客户端不变学习、联邦学习 (FL)、隐私保护生成模型、非独立同分布 (Non-IID)、异构架构

Title: Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

Authors: Anran Qi, Changjian Li, Adrien Bousseau, Niloy J.Mitra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13392
Pdf URL: https://arxiv.org/pdf/2512.13392
Copy Paste: [[2512.13392]] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs(https://arxiv.org/abs/2512.13392)
Keywords: generation, generative
Abstract: We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: this https URL
摘要：我们通过对最终帧的去除遮挡区域进行明确的用户控制来解决图像到视频的生成问题。当前的图像到视频管道可以产生合理的运动，但很难生成可预测的、清晰的运动，同时在新显示的区域中强制执行用户指定的内容。我们的关键思想是将运动规范与外观合成分开：我们引入了一种轻量级、用户可编辑的代理动态图（PDG），它确定性但近似地驱动零件运动，而冻结扩散先验用于合成跟随该运动的合理外观。在我们的免训练管道中，用户松散地注释并放置 PDG，我们从中计算密集的运动流，以利用扩散作为运动引导着色器。然后，我们让用户编辑图像中未遮挡区域的外观，并利用 PDG 编码的可见性信息来执行潜在空间合成，以协调这些区域中的运动与用户意图。这种设计无需微调即可实现可控的清晰度和用户对咬合解除的控制。我们展示了相对于将图像转化为铰接物体、家具、车辆和可变形物体的短视频的最先进替代方案的明显优势。我们的方法将松散姿势和结构形式的生成控制与可预测控制（以去除遮挡区域的最终帧中的外观规范的形式）混合在一起，解锁了新的图像到视频工作流程。代码将在接受后发布。项目页面：此 https URL

Title: Computer vision training dataset generation for robotic environments using Gaussian splatting

Authors: Patryk Niżeniec, Marcin Iwanowski
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.13411
Pdf URL: https://arxiv.org/pdf/2512.13411
Copy Paste: [[2512.13411]] Computer vision training dataset generation for robotic environments using Gaussian splatting(https://arxiv.org/abs/2512.13411)
Keywords: generation
Abstract: This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
摘要：本文介绍了一种新颖的管道，用于为机器人环境中的计算机视觉任务生成大规模、高度真实且自动标记的数据集。我们的方法解决了合成图像和真实世界图像之间的领域差距以及手动注释的耗时瓶颈的关键挑战。我们利用 3D 高斯溅射 (3DGS) 创建操作环境和对象的逼真表示。然后，这些资源将用于游戏引擎，其中物理模拟会创建自然的排列。一种新颖的两次渲染技术将splats的真实感与代理网格生成的阴影贴图结合起来。然后通过算法将该贴图与图像合成，以添加物理上合理的阴影和微妙的高光，从而显着增强真实感。像素完美的分割掩模会自动生成并格式化，以便直接与 YOLO 等对象检测模型一起使用。我们的实验表明，混合训练策略将一小组真实图像与大量合成数据相结合，可以产生最佳的检测和分割性能，证实这是有效实现稳健且准确的模型的最佳策略。

Title: Learning to Generate Cross-Task Unexploitable Examples

Authors: Haoxuan Qu, Qiuchi Xiang, Yujun Cai, Yirui Wu, Majid Mirmehdi, Hossein Rahmani, Jun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13416
Pdf URL: https://arxiv.org/pdf/2512.13416
Copy Paste: [[2512.13416]] Learning to Generate Cross-Task Unexploitable Examples(https://arxiv.org/abs/2512.13416)
Keywords: generation
Abstract: Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.
摘要：不可利用的示例生成旨在在个人图像上传到网上之前将其转换为不可利用（不可学习）的版本，从而防止在线个人图像未经授权的利用。最近，这项任务由于其与个人数据隐私的关键相关性而引起了广泛的研究关注。然而，尽管最近取得了进展，但该任务的现有方法仍然受到实际适用性的限制，因为它们可能无法生成在不同的现实世界计算机视觉任务中普遍无法利用的示例。为了解决这个问题，在这项工作中，我们提出了一种新颖的元跨任务不可利用示例生成（MCT-UEG）框架。在我们框架的核心，为了优化不可利用的示例生成器以有效地生成广泛不可利用的示例，我们设计了一个面向平坦最小值的元训练和测试方案。大量的实验证明了我们框架的有效性。

Title: RecTok: Reconstruction Distillation along Rectified Flow

Authors: Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13421
Pdf URL: https://arxiv.org/pdf/2512.13421
Copy Paste: [[2512.13421]] RecTok: Reconstruction Distillation along Rectified Flow(https://arxiv.org/abs/2512.13421)
Keywords: generation
Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.
摘要：视觉分词器在扩散模型中发挥着至关重要的作用。潜在空间的维数决定了潜在特征的重建保真度和语义表达能力。然而，维度和生成质量之间存在固有的基本权衡，将现有方法限制在低维潜在空间。尽管最近的工作利用视觉基础模型来丰富视觉分词器的语义并加速收敛，但高维分词器的性能仍然不如低维分词器。在这项工作中，我们提出了 RecTok，它通过两个关键创新克服了高维视觉分词器的局限性：流语义蒸馏和重构——对齐蒸馏。我们的主要见解是使流匹配中的前向流语义丰富，作为扩散变压器的训练空间，而不是像以前的工作那样关注潜在空间。具体来说，我们的方法将 VFM 中的语义信息提炼成流匹配中的前向流轨迹。我们通过引入屏蔽特征重建损失来进一步增强语义。我们的 RecTok 实现了卓越的图像重建、生成质量和判别性能。无论有无分类器引导设置，它都在 gFID-50K 上实现了最先进的结果，同时保持了语义丰富的潜在空间结构。此外，随着潜在维度的增加，我们观察到持续的改进。代码和型号可从此 https URL 获取。

Title: Test-Time Modification: Inverse Domain Transformation for Robust Perception

Authors: Arpit Jadon, Joshua Niemeijer, Yuki M. Asano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13454
Pdf URL: https://arxiv.org/pdf/2512.13454
Copy Paste: [[2512.13454]] Test-Time Modification: Inverse Domain Transformation for Robust Perception(https://arxiv.org/abs/2512.13454)
Keywords: generation, generative
Abstract: Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.
摘要：生成基础模型包含广泛的视觉知识，可以产生不同的图像变化，这使得它们特别有希望用于推进领域泛化任务。虽然它们可用于训练数据增强，但合成全面的目标域变化仍然缓慢、昂贵且不完整。我们提出了一种替代方案：在测试时使用扩散模型将目标图像映射回训练下游模型的源分布。这种方法只需要源域描述，保留任务模型，并消除大规模合成数据生成。我们在具有未知目标分布的真实到真实域泛化场景中的挑战性环境变化下展示了分割、检测和分类任务的一致改进。我们的分析涵盖多个生成模型和下游模型，包括用于增强鲁棒性的集成变体。该方法实现了显着的相对增益：在 BDD100K-Night 上提升了 137%，在 ImageNet-R 上提升了 68%，在 DarkZurich 上提升了 62%。

Title: PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

Authors: Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13465
Pdf URL: https://arxiv.org/pdf/2512.13465
Copy Paste: [[2512.13465]] PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence(https://arxiv.org/abs/2512.13465)
Keywords: generation
Abstract: Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
摘要：姿势引导视频生成是指通过一系列姿势控制生成视频中主体的运动。它能够精确控制主体运动，并在动画中具有重要的应用。然而，当前的姿势引导视频生成方法仅限于仅接受人类姿势作为输入，因此对于其他主体的姿势的泛化能力较差。为了解决这个问题，我们提出了 PoseAnything，这是第一个通用姿势引导视频生成框架，能够处理人类和非人类角色，支持任意骨骼输入。为了增强运动过程中的一致性保持，我们引入了Part-aware Temporal Coherence Module，它将主体划分为不同的部分，建立部分对应关系，并计算跨帧的相应部分之间的交叉注意力，以实现细粒度的部分级一致性。此外，我们提出了主体和相机运动解耦 CFG，这是一种新颖的引导策略，通过将主体和相机运动控制信息分别注入 CFG 的正锚点和负锚点，首次在姿势引导视频生成中实现独立的相机运动控制。此外，我们还推出了 XPose，这是一个高质量的公共数据集，包含 50,000 个非人类姿势视频对，以及用于注释和过滤的自动化管道。大量实验表明，Pose-Anything 在有效性和泛化方面都显着优于最先进的方法。

Title: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$

Authors: Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13492
Pdf URL: https://arxiv.org/pdf/2512.13492
Copy Paste: [[2512.13492]] Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$(https://arxiv.org/abs/2512.13492)
Keywords: generation
Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at this https URL
摘要：原生 4K（2160$\times$3840）视频生成仍然是一个严峻的挑战，因为随着时空分辨率的增加，全注意力的计算量呈二次爆炸，使得模型很难在效率和质量之间取得平衡。本文提出了一种新颖的 Transformer 改造策略，称为 $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer)，该策略在不改变全注意力预训练模型的核心架构的情况下，通过优化其前向逻辑来显着降低计算需求。具体来说，$\textbf{T3-Video}$引入了一种多尺度权重共享窗口注意力机制，并且通过分层阻塞和保留轴的全注意力设计，可以仅使用适度的计算和数据来实现预训练模型的“注意力模式”转换。 4K-VBench 上的结果表明，$\textbf{T3-Video}$ 大大优于现有方法：在提供性能改进（+4.29$\uparrow$ VQA 和 +0.08$\uparrow$ VTC）的同时，它使原生 4K 视频生成速度加快了 10$\times$ 以上。项目页面位于此 https URL

Title: Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Authors: Jiangning Zhang, Junwei Zhu, Zhenye Gan, Donghao Luo, Chuming Lin, Feifan Xu, Xu Peng, Jianlong Hu, Yuansen Liu, Yijia Hong, Weijian Cao, Han Feng, Xu Chen, Chencan Fu, Keke He, Xiaobin Hu, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13495
Pdf URL: https://arxiv.org/pdf/2512.13495
Copy Paste: [[2512.13495]] Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation(https://arxiv.org/abs/2512.13495)
Keywords: generation
Abstract: We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at this https URL
摘要：我们提出了一种多模态驱动的高保真长期数字人类动画框架，称为$\textbf{Soul}$，它从单帧肖像图像、文本提示和音频生成语义连贯的视频，实现精确的唇形同步、生动的面部表情和强大的身份保存。我们构建了 Soul-1M，包含 100 万个精细注释的样本，具有精确的自动化注释管道（涵盖肖像、上半身、全身和多人场景），以缓解数据稀缺性，并精心策划 Soul-Bench，以对音频/文本引导的动画方法进行全面、公平的评估。该模型建立在 Wan2.2-5B 主干之上，将音频注入层和多种训练策略与阈值感知码本替换集成在一起，以确保长期生成一致性。同时，使用step/CFG蒸馏和轻量级VAE来优化推理效率，实现了11.4$\times$的加速，而质量损失可以忽略不计。大量实验表明，Soul 在视频质量、视频文本对齐、身份保留和口型同步准确性方面显着优于当前领先的开源和商业模型，在虚拟主播和电影制作等现实场景中展示了广泛的适用性。项目页面位于此 https URL

Title: Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Authors: Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Zeyu Sun, Wenjing Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13507
Pdf URL: https://arxiv.org/pdf/2512.13507
Copy Paste: [[2512.13507]] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model(https://arxiv.org/abs/2512.13507)
Keywords: generation
Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.
摘要：视频生成领域的最新进展为统一视听生成铺平了道路。在这项工作中，我们展示了 Seedance 1.5 pro，这是一个专门为原生联合音频视频生成而设计的基础模型。该模型利用双分支扩散变压器架构，将跨模态联合模块与专门的多级数据管道集成在一起，实现卓越的视听同步和卓越的生成质量。为了确保实用性，我们实施了细致的训练后优化，包括对高质量数据集的监督微调（SFT）和具有多维奖励模型的人类反馈强化学习（RLHF）。此外，我们还引入了一个加速框架，可将推理速度提高 10 倍以上。 Seedance 1.5 pro 通过精确的多语言和方言口型同步、动态电影摄像机控制和增强的叙事连贯性而脱颖而出，将其定位为专业级内容创作的强大引擎。 Seedance 1.5 pro 现在可以通过 Volcano Engine 访问此 https URL。

Title: MMhops-R1: Multimodal Multi-hop Reasoning

Authors: Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13573
Pdf URL: https://arxiv.org/pdf/2512.13573
Copy Paste: [[2512.13573]] MMhops-R1: Multimodal Multi-hop Reasoning(https://arxiv.org/abs/2512.13573)
Keywords: generation
Abstract: The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
摘要：通过迭代地集成各种模态和外部知识的信息来执行多模态多跳推理的能力对于解决复杂的现实世界挑战至关重要。然而，现有的多模态大型语言模型（MLLM）主要限于单步推理，因为现有基准缺乏评估和驱动多跳能力所需的复杂性。为了弥补这一差距，我们引入了 MMhops，这是一种新颖的大规模基准，旨在系统地评估和促进多模式多跳推理。 MMhops 数据集包含两种具有挑战性的任务格式：桥接和比较，这需要模型通过集成外部知识动态构建复杂的推理链。为了解决 MMhops 带来的挑战，我们提出了 MMhops-R1，一种用于动态推理的新型多模态检索增强生成（mRAG）框架。我们的框架利用强化学习来优化模型，以自主规划推理路径、制定有针对性的查询并综合多级信息。综合实验表明，MMhops-R1 的性能显着优于 MMhops 的强大基线，这凸显了动态规划和多模态知识集成对于复杂推理至关重要。此外，MMhops-R1 对需要固定跳推理的任务表现出强大的泛化能力，强调了我们动态规划方法的稳健性。总之，我们的工作提供了一个具有挑战性的新基准和强大的基线模型，我们将发布相关的代码、数据和权重，以促进这一关键领域的未来研究。

Title: Image Diffusion Preview with Consistency Solver

Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13592
Pdf URL: https://arxiv.org/pdf/2512.13592
Copy Paste: [[2512.13592]] Image Diffusion Preview with Consistency Solver(https://arxiv.org/abs/2512.13592)
Keywords: generation
Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.
摘要：图像扩散模型的缓慢推理过程显着降低了交互式用户体验。为了解决这个问题，我们引入了扩散预览，这是一种新颖的范例，采用快速、低步采样来生成用于用户评估的初步输出，推迟全步细化，直到预览被认为令人满意。现有的加速方法，包括免训练求解器和训练后蒸馏，很难提供高质量的预览或确保预览和最终输出之间的一致性。我们提出了源自一般线性多步方法的 ConsistencySolver，这是一种通过强化学习优化的轻量级、可训练的高阶求解器，可提高预览质量和一致性。实验结果表明，ConsistencySolver 显着提高了低步场景中的生成质量和一致性，使其成为高效预览和优化工作流程的理想选择。值得注意的是，它使用减少 47% 的步骤实现了与多步 DPM-Solver 相当的 FID 分数，同时优于蒸馏基线。此外，用户研究表明，我们的方法将总体用户交互时间减少了近 50%，同时保持了生成质量。代码可从此 https URL 获取。

Title: LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13604
Pdf URL: https://arxiv.org/pdf/2512.13604
Copy Paste: [[2512.13604]] LongVie 2: Multimodal Controllable Ultra-Long Video World Model(https://arxiv.org/abs/2512.13604)
Keywords: generation
Abstract: Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
摘要：在预先训练的视频生成系统上构建视频世界模型是迈向通用时空智能的重要但具有挑战性的一步。世界模型应该具备三个基本属性：可控性、长期视觉质量和时间一致性。为此，我们采取渐进的方式，首先增强可控性，然后向长期高质量发电延伸。我们提出了 LongVie 2，一个经过三个阶段训练的端到端自回归框架：（1）多模态引导，集成密集和稀疏控制信号，以提供隐式世界级监督并提高可控性；（2）对输入帧进行退化感知训练，弥合训练和长期推理之间的差距，以保持较高的视觉质量； (3) 历史上下文指导，将相邻剪辑的上下文信息对齐以确保时间一致性。我们进一步介绍了 LongVGenBench，这是一个综合基准测试，包含 100 个高分辨率的一分钟视频，涵盖不同的现实世界和合成环境。大量实验表明，LongVie 2在远程可控性、时间一致性和视觉保真度方面实现了最先进的性能，并支持持续长达五分钟的连续视频生成，标志着向统一视频世界建模迈出了重要一步。

Title: Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13609
Pdf URL: https://arxiv.org/pdf/2512.13609
Copy Paste: [[2512.13609]] Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models(https://arxiv.org/abs/2512.13609)
Keywords: generative
Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
摘要：我们引入 Do-Undo 任务和基准来解决视觉语言模型中的关键差距：理解并生成由现实世界动作驱动的物理上合理的场景转换。与之前专注于对象级编辑的工作不同，Do-Undo 需要模型模拟物理动作的结果，然后准确地反转它，反映视觉世界中真实的因果关系。我们从现实世界的视频中收集了一个大规模的可逆动作数据集，并设计了一种训练策略，以增强动作基础的一致性。我们的实验表明，当前的模型与物理可逆性作斗争，强调了这项任务对于具体人工智能、机器人和物理感知生成模型的重要性。 Do-Undo 建立了一个直观的测试平台，用于评估和推进多模态系统中的物理推理。

Title: StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Authors: Guransh Singh, Md Shah Fahad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13632
Pdf URL: https://arxiv.org/pdf/2512.13632
Copy Paste: [[2512.13632]] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion(https://arxiv.org/abs/2512.13632)
Keywords: generation
Abstract: Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.
摘要：当不流畅重叠时，口吃检测就会失败。由于训练数据中这些特定组合的稀缺，现有的参数模型很难区分复杂的、同时发生的不流畅（例如，具有“延长”的“块”）。虽然检索增强生成 (RAG) 通过将模型建立在外部知识中而彻底改变了 NLP，但这种范式在病态语音处理中仍未得到探索。为了弥补这一差距，我们引入了StutterFuse，这是第一个用于多标签口吃检测的检索增强分类器（RAC）。通过在临床实例的非参数记忆库上调节 Conformer 编码器，我们允许模型通过参考而不是记忆进行分类。我们进一步识别并解决“模态崩溃”，这是一种“回音室”效应，其中幼稚的检索可以提高召回率，但会降低精确度。我们使用以下方法来缓解这种情况：(1) SetCon，一种 Jaccard 加权度量学习目标，可优化多标签集相似性；(2) 门控专家混合融合策略，可在声学证据和检索到的上下文之间动态仲裁。在 SEP-28k 数据集上，StutterFuse 的加权 F1 分数为 0.65，优于强大的基线，并展示了出色的零样本跨语言泛化能力。

Title: Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All

Authors: Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13639
Pdf URL: https://arxiv.org/pdf/2512.13639
Copy Paste: [[2512.13639]] Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All(https://arxiv.org/abs/2512.13639)
Keywords: generation
Abstract: This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
摘要：本文提出了用于新颖视图合成的新数据集，该数据集由具有令人惊叹的真实感和复杂细节的高质量动画电影生成。我们的数据集捕获各种动态场景，包括详细的纹理、光照和运动，使其成为训练和评估尖端 4D 场景重建和新颖的视图生成模型的理想选择。除了高保真 RGB 图像之外，我们还提供多种补充模态，包括深度、表面法线、对象分割和光流，从而能够更深入地理解场景几何和运动。该数据集分为三个不同的基准测试场景：密集的多视图相机设置、稀疏的相机排列和单目视频序列，从而能够在不同程度的数据稀疏度上进行广泛的实验和比较。该数据集结合了丰富的视觉效果、高质量注释和多样化的实验设置，为突破视图合成和 3D 视觉的界限提供了独特的资源。

Title: Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

Authors: Wenhan Chen, Sezer Karaoglu, Theo Gevers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13665
Pdf URL: https://arxiv.org/pdf/2512.13665
Copy Paste: [[2512.13665]] Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency(https://arxiv.org/abs/2512.13665)
Keywords: generation
Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
摘要：基于扩散的生成技术的最新进展使人工智能模型能够生成高度逼真的视频，从而提高了对可靠检测机制的需求。然而，现有的检测方法仅对生成视频中存在的 3D 几何图案进行有限的探索。在本文中，我们使用消失点作为 3D 几何图案的显式表示，揭示了真实视频和人工智能生成视频之间几何一致性的根本差异。我们介绍 Grab-3D，这是一种几何感知转换器框架，用于基于 3D 几何时间一致性检测 AI 生成的视频。为了实现可靠的评估，我们构建了人工智能生成的静态场景视频数据集，允许稳定的 3D 几何特征提取。我们提出了一种几何感知转换器，配备几何位置编码、时间几何注意力和基于 EMA 的几何分类器头，以将 3D 几何感知显式注入时间建模中。实验表明，Grab-3D 的性能显着优于最先进的检测器，实现了对看不见的生成器的稳健跨域泛化。

Title: A Scientific Reasoning Model for Organic Synthesis Procedure Generation

Authors: Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz, Jose Garrido Torres, Victor Garcia Satorras, Shoko Ueda, Christopher M. Bishop, Marwin Segler
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13668
Pdf URL: https://arxiv.org/pdf/2512.13668
Copy Paste: [[2512.13668]] A Scientific Reasoning Model for Organic Synthesis Procedure Generation(https://arxiv.org/abs/2512.13668)
Keywords: generation
Abstract: Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.
摘要：解决计算机辅助合成规划对于实现全自动、机器人辅助合成工作流程和提高药物发现的效率至关重要。然而，一个关键的挑战是弥合计算路线设计和实际实验室执行之间的差距，特别是准确预测每个合成步骤的可行实验程序。在这项工作中，我们提出了 QFANG，一种科学推理语言模型，能够直接从反应方程生成精确的结构化实验程序，并具有明确的思维链推理。为了开发 QFANG，我们策划了一个高质量的数据集，其中包含 905,990 个化学反应以及结构化的动作序列，这些化学反应是使用大型语言模型从专利文献中提取和处理的。我们引入了化学引导推理（CGR）框架，该框架可大规模生成基于化学知识的思想链数据。该模型随后经过监督微调以引发复杂的化学推理。最后，我们应用可验证奖励的强化学习（RLVR）来进一步提高程序的准确性。实验结果表明，QFANG 的性能优于先进的通用推理模型和最近邻检索基线，这是通过传统 NLP 相似性指标和使用法学硕士作为法官的化学感知评估器来衡量的。此外，QFANG 还可以推广到某些域外反应类别，并适应实验室条件和用户特定约束的变化。我们相信，QFANG 生成高质量合成程序的能力代表了弥合计算合成规划和全自动实验室合成之间差距的重要一步。

Title: Directional Textual Inversion for Personalized Text-to-Image Generation

Authors: Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13672
Pdf URL: https://arxiv.org/pdf/2512.13672
Copy Paste: [[2512.13672]] Directional Textual Inversion for Personalized Text-to-Image Generation(https://arxiv.org/abs/2512.13672)
Keywords: generation
Abstract: Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
摘要：文本反转 (TI) 是一种有效的文本到图像个性化方法，但在复杂的提示上常常失败。我们将这些失败追溯到嵌入规范通货膨胀：学习的令牌漂移到分布范围之外，从而降低了规范前 Transformer 中的即时条件作用。根据经验，我们表明语义主要是通过 CLIP 令牌空间中的方向进行编码的，而膨胀的规范会损害语境化；从理论上讲，我们分析了大幅度如何衰减位置信息并阻碍预范数块中的残差更新。我们提出了方向文本反演（DTI），它将嵌入幅度固定为分布尺度，并通过黎曼 SGD 仅优化单位超球面上的方向。我们将方向学习转化为带有 von Mises-Fisher 先验的 MAP，从而产生了一个简单且高效的合并方向恒定先验梯度。在个性化任务中，DTI 比 TI 和 TI 变体提高了文本保真度，同时保持了主题相似性。至关重要的是，DTI 的超球面参数化可以在学习的概念 (slerp) 之间实现平滑、语义一致的插值，这是标准 TI 中不具备的功能。我们的研究结果表明，仅方向优化是实现快速忠实个性化的稳健且可扩展的路径。

Title: JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Authors: Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13677
Pdf URL: https://arxiv.org/pdf/2512.13677
Copy Paste: [[2512.13677]] JoVA: Unified Multimodal Learning for Joint Video-Audio Generation(https://arxiv.org/abs/2512.13677)
Keywords: generation
Abstract: In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
摘要：在本文中，我们提出了 JoVA，一个用于联合视频音频生成的统一框架。尽管最近取得了令人鼓舞的进展，但现有方法仍面临两个关键限制。首先，大多数现有方法只能产生环境声音，缺乏产生与嘴唇运动同步的人类语音的能力。其次，最近对统一人类视频音频生成的尝试通常依赖于显式融合或特定于模态的对齐模块，这引入了额外的架构设计并削弱了原始变压器的模型简单性。为了解决这些问题，JoVA 在每个转换器层内的视频和音频令牌上采用联合自注意力机制，从而实现直接高效的跨模式交互，而无需额外的对齐模块。此外，为了实现高质量的唇语同步，我们引入了一种基于面部关键点检测的简单而有效的嘴部区域损失，这在不影响架构简单性的情况下增强了训练期间对关键嘴部区域的监督。大量的基准实验表明，JoVA 在口型同步精度、语音质量和整体视频音频生成保真度方面优于统一和音频驱动的最先进方法或具有竞争力。我们的结果将 JoVA 确立为高质量多模式生成的优雅框架。

Title: Feedforward 3D Editing via Text-Steerable Image-to-3D

Authors: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13678
Pdf URL: https://arxiv.org/pdf/2512.13678
Copy Paste: [[2512.13678]] Feedforward 3D Editing via Text-Steerable Image-to-3D(https://arxiv.org/abs/2512.13678)
Keywords: generation, generative
Abstract: Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL
摘要：图像转 3D 领域的最新进展为设计、AR/VR 和机器人技术带来了巨大的可能性。然而，要在实际应用中使用人工智能生成的 3D 资产，一个关键要求是能够轻松编辑它们。我们提出了一种前馈方法 Steer3D，为图像到 3D 模型添加文本可操纵性，从而可以使用语言编辑生成的 3D 资源。我们的方法受到 ControlNet 的启发，我们将其适应图像到 3D 生成，以在前向传递中直接实现文本引导。我们构建了一个用于自动数据生成的可扩展数据引擎，并开发了基于流程匹配训练和直接偏好优化（DPO）的两阶段训练方案。与竞争方法相比，Steer3D 更忠实地遵循语言指令，并与原始 3D 资产保持更好的一致性，同时速度提高 2.4 倍至 28.5 倍。 Steer3D 证明可以添加新的模态（文本）来引导使用 100k 数据生成预训练的图像到 3D 生成模型。项目网站：这个https URL

Title: I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13683
Pdf URL: https://arxiv.org/pdf/2512.13683
Copy Paste: [[2512.13683]] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners(https://arxiv.org/abs/2512.13683)
Keywords: generation
Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL
摘要：泛化仍然是交互式 3D 场景生成的核心挑战。现有的基于学习的方法在有限的场景数据集中进行空间理解，限制了对新布局的泛化。相反，我们对预训练的 3D 实例生成器进行重新编程，使其充当场景级学习器，用以模型为中心的空间监督取代数据集限制的监督。这种重新编程解锁了生成器可转移的空间知识，从而能够泛化到看不见的布局和新颖的对象组合。值得注意的是，即使训练场景是随机组成的对象，空间推理仍然会出现。这表明生成器的可转移场景先验提供了丰富的学习信号，用于从纯粹的几何线索推断接近度、支持度和对称性。取代广泛使用的规范空间，我们用场景空间的以视图为中心的公式来实例化这种见解，产生一个完全前馈的、可概括的场景生成器，可以直接从实例模型中学习空间关系。定量和定性结果表明，3D 实例生成器是隐式空间学习器和推理器，指向交互式 3D 场景理解和生成的基础模型。项目页面：此 https URL

Title: Towards Scalable Pre-training of Visual Tokenizers for Generation

Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13687
Pdf URL: https://arxiv.org/pdf/2512.13687
Copy Paste: [[2512.13687]] Towards Scalable Pre-training of Visual Tokenizers for Generation(https://arxiv.org/abs/2512.13687)
Keywords: generation, generative
Abstract: The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.
摘要：视觉分词器（例如 VAE）中潜在空间的质量对于现代生成模型至关重要。然而，标准的基于重建的训练范式产生了一个偏向于低级信息的潜在空间，导致了一个基础缺陷：更好的像素级精度并不会带来更高质量的生成。这意味着将大量计算注入视觉分词器预训练中并不能很好地提高生成性能。我们将其识别为“预训练缩放问题”，并建议进行必要的转变：为了有效生成，潜在空间必须简明地表示高级语义。我们提出了 VTP，一个统一的视觉分词器预训练框架，开创了图像文本对比、自监督和重建损失的联合优化。我们的大规模研究揭示了两个主要发现：（1）理解是生成的关键驱动因素，（2）更好的扩展特性，其中生成性能可以通过分配给视觉分词器预训练的计算、参数和数据有效扩展。经过大规模预训练后，我们的分词器提供了具有竞争力的配置文件（ImageNet 上的零样本精度为 78.2，rFID 为 0.36），并且与先进的蒸馏方法相比，生成收敛速度加快了 4.1 倍。更重要的是，它可以有效地扩展：在不修改标准 DiT 训练规范的情况下，仅在预训练 VTP 上投入更多 FLOPS 即可在下游生成中实现 65.8% 的 FID 改进，而传统自动编码器很早就停滞在 1/10 FLOPS。我们的预训练模型可通过此 https URL 获取。

Title: DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Authors: Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13690
Pdf URL: https://arxiv.org/pdf/2512.13690
Copy Paste: [[2512.13690]] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders(https://arxiv.org/abs/2512.13690)
Keywords: generation, generative
Abstract: Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
摘要：视频扩散模型彻底改变了生成视频合成，但它们不精确、缓慢，并且在生成过程中可能不透明——使用户长时间蒙在鼓里。在这项工作中，我们提出了 DiffusionBrowser，一种与模型无关的轻量级解码器框架，允许用户在去噪过程中的任何点（时间步或变换器块）交互式地生成预览。我们的模型可以以超过 4$\times$ 的实时速度（4 秒视频不到 1 秒）生成包含 RGB 和场景内在特性的多模式预览表示，从而为最终视频提供一致的外观和运动。通过经过训练的解码器，我们证明可以通过随机性重注入和模态控制以交互方式引导中间噪声步骤的生成，从而解锁新的控制能力。此外，我们使用学习到的解码器系统地探索模型，揭示场景、对象和其他细节在黑盒去噪过程中是如何组成和组装的。