2025-12-04

Title: Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL

Authors: Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.03059
Pdf URL: https://arxiv.org/pdf/2512.03059
Copy Paste: [[2512.03059]] Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL(https://arxiv.org/abs/2512.03059)
Keywords: generation
Abstract: The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low-carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging - especially under real-world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for. In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi-source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision-making. We develop a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Lagrangian (DAC-MAPPO-Lagrangian), which integrates Lagrangian relaxation into the Double Actor-Critic (DAC) framework. At the high level, we adopt a centralized PPO-Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO-Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real-world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed.
摘要：将电动巴士 (EB) 与光伏 (PV) 面板等可再生能源相结合是促进可持续和低碳公共交通的一种有前途的方法。然而，优化 EB 充电计划以最大限度地降低运营成本，同时确保安全运行而不耗尽电池仍然具有挑战性，尤其是在现实条件下，必须考虑光伏发电的不确定性、动态电价、可变的行程时间和有限的充电基础设施。在本文中，我们提出了一种安全的分层深度强化学习（HDRL）框架，用于解决多源不确定性下的 EB 充电调度问题（EBCSP）。我们将问题表述为约束马尔可夫决策过程（CMDP），并提供支持临时抽象决策的选项。我们开发了一种新颖的 HDRL 算法，即 Double Actor-Critic 多智能体近端策略优化拉格朗日 (DAC-MAPPO-Lagrangian)，它将拉格朗日松弛集成到 Double Actor-Critic (DAC) 框架中。在高层，我们采用集中式 PPO-拉格朗日算法来学习安全的充电器分配策略。在底层，我们结合 MAPPO-拉格朗日来学习集中训练和分散执行（CTDE）范式下的分散充电功率决策。对真实世界数据的大量实验表明，所提出的方法在成本最小化和安全合规性方面均优于现有基线，同时保持快速收敛速度。

Title: Optimizing Life Sciences Agents in Real-Time using Reinforcement Learning

Authors: Nihir Chadderwala
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2512.03065
Pdf URL: https://arxiv.org/pdf/2512.03065
Copy Paste: [[2512.03065]] Optimizing Life Sciences Agents in Real-Time using Reinforcement Learning(https://arxiv.org/abs/2512.03065)
Keywords: generation, generative
Abstract: Generative AI agents in life sciences face a critical challenge: determining the optimal approach for diverse queries ranging from simple factoid questions to complex mechanistic reasoning. Traditional methods rely on fixed rules or expensive labeled training data, neither of which adapts to changing conditions or user preferences. We present a novel framework that combines AWS Strands Agents with Thompson Sampling contextual bandits to enable AI agents to learn optimal decision-making strategies from user feedback alone. Our system optimizes three key dimensions: generation strategy selection (direct vs. chain-of-thought), tool selection (literature search, drug databases, etc.), and domain routing (pharmacology, molecular biology, clinical specialists). Through empirical evaluation on life science queries, we demonstrate 15-30\% improvement in user satisfaction compared to random baselines, with clear learning patterns emerging after 20-30 queries. Our approach requires no ground truth labels, adapts continuously to user preferences, and provides a principled solution to the exploration-exploitation dilemma in agentic AI systems.
摘要：生命科学中的生成人工智能代理面临着严峻的挑战：确定从简单的事实问题到复杂的机械推理等各种查询的最佳方法。传统方法依赖于固定规则或昂贵的标记训练数据，这两种方法都不能适应不断变化的条件或用户偏好。我们提出了一种新颖的框架，它将 AWS Strands Agents 与 Thompson Sampling 上下文强盗相结合，使 AI 代理能够仅从用户反馈中学习最佳决策策略。我们的系统优化了三个关键维度：生成策略选择（直接与思想链）、工具选择（文献搜索、药物数据库等）和领域路由（药理学、分子生物学、临床专家）。通过对生命科学查询的实证评估，我们证明与随机基线相比，用户满意度提高了 15-30%，并且在 20-30 次查询后出现了清晰的学习模式。我们的方法不需要真实标签，不断适应用户偏好，并为代理人工智能系统中的探索-利用困境提供原则性的解决方案。

Title: Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

Authors: Xiwen Wei, Mustafa Munir, Radu Marculescu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03125
Pdf URL: https://arxiv.org/pdf/2512.03125
Copy Paste: [[2512.03125]] Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models(https://arxiv.org/abs/2512.03125)
Keywords: generation, generative
Abstract: Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: this https URL
摘要：统一多模态生成模型 (UMGM) 将视觉理解和图像生成统一在单个自回归框架内。然而，他们持续学习新任务的能力受到灾难性遗忘的严重阻碍，无论是在一种模式内（模式内）还是跨模式（模式间）。虽然之前的持续学习（CL）工作已经研究了模态内遗忘，但模态间遗忘在很大程度上仍未得到探索。在本文中，我们识别并实证验证了 UMGM 中的这种现象，并提供了植根于模态之间梯度冲突的理论解释。为了解决模态内和模间遗忘问题，我们提出了模态解耦专家（MoDE），这是一种轻量级且可扩展的架构，可以隔离模态特定的更新以减轻梯度冲突，并利用知识蒸馏来防止灾难性遗忘并保留预先训练的能力。与之前保持模态耦合并遭受模态梯度冲突的 CL 方法不同，MoDE 明确地解耦模态以防止干扰。跨不同基准的实验表明，MoDE 显着减轻了模态间和模内遗忘，在统一的多模态生成设置中优于先前的 CL 基线。代码将公开：此 https URL

Title: Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra

Authors: Ziyu Xiong, Yichi Zhang, Foyez Alauddin, Chu Xin Cheng, Joon Soo An, Mohammad R. Seyedsayamdost, Ellen D. Zhong
Subjects: cs.LG, cs.AI, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2512.03127
Pdf URL: https://arxiv.org/pdf/2512.03127
Copy Paste: [[2512.03127]] Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra(https://arxiv.org/abs/2512.03127)
Keywords: generation
Abstract: Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce ChefNMR (CHemical Elucidation From NMR), an end-to-end framework that directly predicts an unknown molecule's structure solely from its 1D NMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1D NMR spectra for over 111,000 natural products. ChefNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery. Code is available at this https URL.
摘要：核磁共振 (NMR) 波谱是确定小分子结构的基础技术，对于发现新型天然产物和临床疗法尤其重要。然而，解释 NMR 谱仍然是一个耗时的手动过程，需要广泛的领域专业知识。我们引入了 ChefNMR（NMR 化学阐明），这是一种端到端框架，可以仅根据未知分子的 1D NMR 谱和化学式直接预测其结构。我们将结构阐明框架为基于非等变变压器架构的原子扩散模型的条件生成。为了对天然产物中发现的复杂化学基团进行建模，我们生成了超过 111,000 种天然产物的模拟一维核磁共振谱数据集。 ChefNMR 可预测具有挑战性的天然产物化合物的结构，准确率高达 65% 以上。这项工作朝着解决自动化小分子结构阐明的巨大挑战迈出了重要一步，并凸显了深度学习在加速分子发现方面的潜力。代码可从此 https URL 获取。

Title: Does Head Pose Correction Improve Biometric Facial Recognition?

Authors: Justin Norman, Hany Farid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03199
Pdf URL: https://arxiv.org/pdf/2512.03199
Copy Paste: [[2512.03199]] Does Head Pose Correction Improve Biometric Facial Recognition?(https://arxiv.org/abs/2512.03199)
Keywords: restoration
Abstract: Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.
摘要：生物特征面部识别模型在处理现实世界图像时通常表现出准确性显着下降，通常表现为质量差、非正面主体姿势和主体遮挡。我们研究有针对性的、人工智能驱动的头部姿势校正和图像恢复是否可以提高识别精度。使用与模型无关的大规模取证评估管道，我们评估了三种恢复方法的影响：3D 重建 (NextFace)、2D 正面化 (CFR-GAN) 和特征增强 (CodeFormer)。我们发现，这些技术的天真应用会大大降低面部识别的准确性。然而，我们还发现选择性应用 CFR-GAN 与 CodeFormer 相结合会产生有意义的改进。

Title: SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

Authors: Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, Yang Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.03244
Pdf URL: https://arxiv.org/pdf/2512.03244
Copy Paste: [[2512.03244]] SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning(https://arxiv.org/abs/2512.03244)
Keywords: generative
Abstract: Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
摘要：提供密集、步骤级反馈的过程奖励模型（PRM）已显示出强化学习的前景，但其采用仍然受到昂贵的步骤级注释或地面实况参考的需求的限制。我们提出 SPARK：一个三阶段框架，其中在第一阶段，生成器模型产生不同的解决方案，验证器模型使用并行缩放（自一致性）和顺序缩放（元批判）来评估它们。在第二阶段，我们使用这些验证输出作为合成训练数据来微调生成过程奖励模型，随后在训练期间充当奖励信号。我们表明，在步骤级别聚合多个独立验证会产生超越真实结果监督的过程奖励模型的训练数据，在 ProcessBench（识别数学推理中错误步骤的基准）上实现 67.5 F1，而参考引导训练为 66.4，GPT-4o 为 61.9。在最后阶段，我们将带有思想链验证的生成式 PRM（PRM-CoT）用作数学推理 RL 实验中的奖励模型，并引入格式约束以防止奖励黑客攻击。使用 Qwen2.5-Math-7B，我们在六个数学推理基准测试中实现了 47.4% 的平均准确度，优于基于地面实况的 RLVR (43.9%)。我们的工作实现了超越真实方法的无参考强化学习训练，为缺乏可验证答案或可访问的真实事实的领域开辟了新的可能性。

Title: PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery

Authors: Mark Moussa, Andre Williams, Seth Roffe, Douglas Morton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.03257
Pdf URL: https://arxiv.org/pdf/2512.03257
Copy Paste: [[2512.03257]] PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery(https://arxiv.org/abs/2512.03257)
Keywords: generation
Abstract: Rapid and accurate wildfire detection is crucial for emergency response and environmental management. In airborne and spaceborne missions, real-time algorithms must distinguish between no fire, active fire, and post-fire conditions, and estimate fire intensity. Multispectral and hyperspectral thermal imagers provide rich spectral information, but high data dimensionality and limited onboard resources make real-time processing challenging. As wildfires increase in frequency and severity, the need for low-latency and computationally efficient onboard detection methods is critical. We present a systematic evaluation of multiple deep learning architectures, including custom Convolutional Neural Networks (CNNs) and Transformer-based models, for multi-class fire classification. We also introduce PyroFocus, a two-stage pipeline that performs fire classification followed by fire radiative power (FRP) regression or segmentation to reduce inference time and computational cost for onboard deployment. Using data from NASA's MODIS/ASTER Airborne Simulator (MASTER), which is similar to a next-generation fire detection sensor, we compare accuracy, inference latency, and resource efficiency. Experimental results show that the proposed two-stage pipeline achieves strong trade-offs between speed and accuracy, demonstrating significant potential for real-time edge deployment in future wildfire monitoring missions.
摘要：快速、准确的野火检测对于应急响应和环境管理至关重要。在机载和星载任务中，实时算法必须区分无火、主动火和火后情况，并估计火势强度。多光谱和高光谱热像仪提供丰富的光谱信息，但高数据维度和有限的机载资源使实时处理面临挑战。随着野火发生频率和严重程度的增加，对低延迟且计算效率高的机载检测方法的需求至关重要。我们对多种深度学习架构进行了系统评估，包括用于多类火灾分类的定制卷积神经网络 (CNN) 和基于 Transformer 的模型。我们还推出了 PyroFocus，这是一个两级管道，可执行火灾分类，然后执行火灾辐射功率 (FRP) 回归或分段，以减少机载部署的推理时间和计算成本。使用来自 NASA 的 MODIS/ASTER 机载模拟器 (MASTER)（类似于下一代火灾探测传感器）的数据，我们比较了准确性、推理延迟和资源效率。实验结果表明，所提出的两级管道在速度和准确性之间实现了强有力的权衡，展示了在未来野火监测任务中实时边缘部署的巨大潜力。

Title: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03324
Pdf URL: https://arxiv.org/pdf/2512.03324
Copy Paste: [[2512.03324]] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs(https://arxiv.org/abs/2512.03324)
Keywords: generation
Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
摘要：由于自注意力的二次成本和不断增长的键值（KV）缓存，内存和计算仍然是长期 LLM 推理中的核心瓶颈。现有的内存限制推理策略，例如量化、卸载或启发式 KV 驱逐，要么会产生高昂的编排成本，要么依赖于不可靠的基于注意力的重要性代理。我们提出 TRIM-KV，这是一种新颖的方法，可以通过轻量级保留门了解每个代币在创建时的内在重要性。每个门都预测一个标量保留分数，该分数随着时间的推移而衰减，反映了代币对于特定层和头部的长期效用。当超出内存预算时，低分的令牌将被逐出，确保缓存始终包含最关键的令牌。 TRIM-KV 通过从冻结的 LLM 中蒸馏并结合容量损失来进行有效训练，只需要门微调并增加可忽略不计的推理开销。在数学推理（GSM8K、MATH-500、AIME24）、程序生成（LongProc）、对话式长记忆基准（LongMemEval）和长上下文理解（LongBench 和 SCBench）方面，TRIM-KV 始终优于强驱逐和可学习检索基线，尤其是在低记忆情况下。值得注意的是，它在某些设置中甚至超越了全缓存模型，这表明选择性保留可以作为正则化的一种形式，抑制来自无信息标记的噪音。定性分析进一步表明，学习到的保留分数与人类直觉一致，自然地恢复启发式方法，例如水槽标记、滑动窗口和要点压缩，而无需明确的设计。除了效率之外，保留分数还提供了对特定层级和负责人角色的洞察，为 LLM 可解释性提供了一条新途径。

Title: Step-by-step Layered Design Generation

Authors: Faizan Farooq Khan, K J Joseph, Koustava Goswami, Mohamed Elhoseiny, Balaji Vasan Srinivasan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.03335
Pdf URL: https://arxiv.org/pdf/2512.03335
Copy Paste: [[2512.03335]] Step-by-step Layered Design Generation(https://arxiv.org/abs/2512.03335)
Keywords: generation
Abstract: Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.
摘要：设计生成本质上是一个循序渐进的过程，设计师通过仔细的修改逐步完善和增强他们的工作。尽管存在这一基本特征，但现有方法主要将设计综合视为单步生成问题，大大低估了创意过程的固有复杂性。为了弥补这一差距，我们提出了一种称为“逐步分层设计生成”的新颖问题设置，它要求机器学习模型生成遵循设计师指令序列的设计。利用多模式法学硕士的最新进展，我们提出了 SLEDGE：逐步分层设计生成器，将设计的每次更新建模为对其先前状态的原子、分层更改，同时以指令为基础。为了补充我们的新问题设置，我们引入了一个新的评估套件，包括数据集和基准。我们详尽的实验分析以及与针对我们的新设置量身定制的最先进方法的比较证明了我们方法的有效性。我们希望我们的工作能够引起人们对这个务实且尚未探索的研究领域的关注。

Title: HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration

Authors: Seunghoi Kim, Henry F. J. Tregidgo, Chen Jin, Matteo Figini, Daniel C. Alexander
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03345
Pdf URL: https://arxiv.org/pdf/2512.03345
Copy Paste: [[2512.03345]] HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration(https://arxiv.org/abs/2512.03345)
Keywords: restoration, generative
Abstract: Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
摘要：生成模型很容易产生幻觉：真实情况中不存在看似合理但不正确的结构。这个问题在医学成像、工业检测和遥感等安全关键领域的图像恢复中存在问题，这些错误会破坏可靠性和信任度。例如，在资源有限的环境中广泛使用的低场 MRI 中，恢复模型对于增强低质量扫描至关重要，但幻觉可能会导致严重的诊断错误。循环依赖阻碍了进展：评估幻觉需要标记数据，但此类标签成本高昂且主观。我们引入了 HalluGen，一个基于扩散的框架，可以合成具有可控类型、位置和严重程度的真实幻觉，产生感知上真实但语义上不正确的输出（分割 IoU 从 0.86 下降到 0.36）。使用 HalluGen，我们构建了第一个大规模幻觉数据集，其中包含 4,350 个带注释的图像，这些图像源自 1,450 张大脑 MR 图像，用于低场增强，从而能够对幻觉检测和缓解进行系统评估。我们在两个应用中展示了它的实用性：（1）对图像质量指标进行基准测试并通过特征评估（SHAFE）开发语义幻觉评估，这是一种基于特征的指标，具有软注意力池，可比传统指标提高幻觉敏感性； (2) 训练无参考幻觉检测器，泛化到真实的恢复失败。 HalluGen 及其开放数据集共同为评估安全关键图像恢复中的幻觉建立了第一个可扩展的基础。

Title: SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation

Authors: Yu Yuan, Tharindu Wickremasinghe, Zeeshan Nadir, Xijun Wang, Yiheng Chi, Stanley H. Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03350
Pdf URL: https://arxiv.org/pdf/2512.03350
Copy Paste: [[2512.03350]] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation(https://arxiv.org/abs/2512.03350)
Keywords: generation
Abstract: Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\to$4D$\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.
摘要：图像和视频是 4D 世界（3D 空间 + 时间）的离散 2D 投影。大多数视觉理解、预测和生成直接基于 2D 观察进行操作，导致性能不佳。我们提出了 SeeU，这是一种学习连续 4D 动态并生成看不见的视觉内容的新颖方法。 SeeU 背后的原理是一个新的 2D$\to$4D$\to$2D 学习框架。 SeeU 首先从稀疏的单眼 2D 帧（2D$\to$4D）重建 4D 世界。然后，它学习低秩表示和物理约束的连续 4D 动力学（离散 4D$\to$连续 4D）。最后，SeeU 将世界及时向前滚动，在采样时间和视点将其重新投影回 2D，并基于时空上下文感知（4D$\to$2D）生成看不见的区域。通过 4D 动态建模，SeeU 实现了连续且物理一致的新颖视觉生成，在多种任务中展示了强大的潜力，包括看不见的时间生成、看不见的空间生成和视频编辑。

Title: FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting

Authors: Nan Zhou, Huandong Wang, Jiahao Li, Han Li, Yali Song, Qiuhua Wang, Yong Li, Xinlei Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03369
Pdf URL: https://arxiv.org/pdf/2512.03369
Copy Paste: [[2512.03369]] FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting(https://arxiv.org/abs/2512.03369)
Keywords: generative
Abstract: Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: this https URL.
摘要：细粒度的山火蔓延预测对于提高应急响应效率和决策精度至关重要。然而，现有研究主要集中在粗时空尺度上，依赖于低分辨率卫星数据，仅捕获宏观火灾状态，从根本上限制了高精度局部火灾动力学建模能力。为了弥补这一差距，我们提出了 FireSentry，这是一个省级多模式野火数据集，其特点是亚米级空间分辨率和亚秒级时间分辨率。 FireSentry 使用同步无人机平台收集数据，提供可见光和红外视频流、现场环境测量以及手动验证的消防面罩。在 FireSentry 的基础上，我们建立了一个全面的基准，涵盖基于物理、数据驱动和生成模型，揭示了现有仅掩模方法的局限性。我们的分析提出了 FiReDiff，这是一种新颖的双模态范式，它首先预测红外模态中的未来视频序列，然后根据生成的动态精确分割面罩模态中的消防面罩。 FiReDiff 实现了最先进的性能，应用于生成模型时，视频质量在 PSNR 中提高了 39.2%，在 SSIM 中提高了 36.1%，在 LPIPS 中提高了 50.0%，在 FVD 中提高了 29.4%，在 AUPRC 中掩模精度提高了 3.3%，在 F1 分数中提高了 59.1%，在 IoU 中提高了 42.9%，在 MSE 中提高了 62.5%。 FireSentry 基准数据集和 FiReDiff 范式共同推进了细粒度野火预测和动态灾害模拟。处理后的基准数据集可在以下网址公开获取：此 https URL。

Title: MAGE-ID: A Multimodal Generative Framework for Intrusion Detection Systems

Authors: Mahdi Arab Loodaricheh, Mohammad Hossein Manshaei, Anita Raja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.03375
Pdf URL: https://arxiv.org/pdf/2512.03375
Copy Paste: [[2512.03375]] MAGE-ID: A Multimodal Generative Framework for Intrusion Detection Systems(https://arxiv.org/abs/2512.03375)
Keywords: generative
Abstract: Modern Intrusion Detection Systems (IDS) face severe challenges due to heterogeneous network traffic, evolving cyber threats, and pronounced data imbalance between benign and attack flows. While generative models have shown promise in data augmentation, existing approaches are limited to single modalities and fail to capture cross-domain dependencies. This paper introduces MAGE-ID (Multimodal Attack Generator for Intrusion Detection), a diffusion-based generative framework that couples tabular flow features with their transformed images through a unified latent prior. By jointly training Transformer and CNN-based variational encoders with an EDM style denoiser, MAGE-ID achieves balanced and coherent multimodal synthesis. Evaluations on CIC-IDS-2017 and NSL-KDD demonstrate significant improvements in fidelity, diversity, and downstream detection performance over TabSyn and TabDDPM, highlighting the effectiveness of MAGE-ID for multimodal IDS augmentation.
摘要：由于异构网络流量、不断变化的网络威胁以及良性流和攻击流之间明显的数据不平衡，现代入侵检测系统 (IDS) 面临着严峻的挑战。虽然生成模型在数据增强方面显示出了希望，但现有方法仅限于单一模式，无法捕获跨域依赖关系。本文介绍了 MAGE-ID（用于入侵检测的多模态攻击生成器），这是一种基于扩散的生成框架，通过统一的潜在先验将表格流特征与其变换后的图像耦合起来。通过使用 EDM 型降噪器联合训练 Transformer 和基于 CNN 的变分编码器，MAGE-ID 实现了平衡且连贯的多模态合成。对 CIC-IDS-2017 和 NSL-KDD 的评估表明，与 TabSyn 和 TabDDPM 相比，保真度、多样性和下游检测性能有了显着改进，凸显了 MAGE-ID 对于多模式 IDS 增强的有效性。

Title: MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

Authors: Yujian Zhao, Hankun Liu, Guanglin Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03404
Pdf URL: https://arxiv.org/pdf/2512.03404
Copy Paste: [[2512.03404]] MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification(https://arxiv.org/abs/2512.03404)
Keywords: generation
Abstract: Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.
摘要：光学和合成孔径雷达（SAR）图像之间的跨模式船舶重新识别（ReID）最近已成为海事情报和监视领域的一项关键但尚未充分探索的任务。然而，光学图像和 SAR 图像之间巨大的模态差距对鲁棒识别提出了重大挑战。为了解决这个问题，我们提出了 MOS，这是一种新颖的框架，旨在缩小光学 SAR 模态差距，并为光学 SAR 跨模态船舶 ReID 实现模态一致的特征学习。 MOS 由两个核心组件组成：(1) 模态一致表示学习 (MCRL) 应用去噪 SAR 图像处理和逐类模态对齐损失来对齐跨模态的内部身份特征分布。（2）跨模态数据生成和特征融合（CDGF）利用布朗桥扩散模型来合成跨模态样本，随后在推理过程中将其与原始特征融合，以增强对齐和辨别能力。 HOSS ReID 数据集上的大量实验表明，MOS 在所有评估协议中均显着超越了最先进的方法，在 ALL 到 ALL、光学到 SAR 和 SAR 到光学设置下，R1 精度分别显着提高了 +3.0%、+6.2% 和 +16.4%。代码和经过训练的模型将在发布后发布。

Title: Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Authors: Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda, Victoria Mar, Zongyuan Ge
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03445
Pdf URL: https://arxiv.org/pdf/2512.03445
Copy Paste: [[2512.03445]] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation(https://arxiv.org/abs/2512.03445)
Keywords: generation
Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at this https URL.
摘要：视觉语言预训练（VLP）已成为医学图像分析中的强大范例，可以从大规模图像文本对中进行表示学习，而无需依赖昂贵的手动注释。然而，现有的方法经常与网络收集数据中固有的噪声和非结构化长医学文本的复杂性作斗争。为了应对这些挑战，我们提出了一种新颖的 VLP 框架，集成了多智能体数据生成（MAGEN）系统和基于本体的多方面知识增强（O-MAKE）预训练。首先，MAGEN 通过基础模型辅助字幕和基于检索的验证管道合成知识丰富的描述，从而提高数据质量。其次，O-MAKE 通过将长的、非结构化的文本分解为不同的知识方面来解决从这些文本中学习的困难。这有助于在全局和补丁级别上进行细粒度的对齐，同时通过本体引导的机制显式地建模医学概念关系。我们在皮肤病学领域验证了我们的框架，综合实验证明了每个组件的有效性。我们的方法在八个数据集的疾病分类和跨模式检索任务上实现了最先进的零样本性能。我们的代码和增强数据集 Derm1M-AgentAug（包含超过 40 万个皮肤-图像-文本对）将在此 https URL 发布。

Title: KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models

Authors: Rhys Newbury, Juyan Zhang, Tin Tran, Hanna Kurniawati, Dana Kulić
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.03450
Pdf URL: https://arxiv.org/pdf/2512.03450
Copy Paste: [[2512.03450]] KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models(https://arxiv.org/abs/2512.03450)
Keywords: generative
Abstract: Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
摘要：以无监督的方式理解和表示 3D 对象的结构仍然是计算机视觉和图形领域的核心挑战。大多数现有的无监督关键点方法并不是为无条件生成设置而设计的，限制了它们在现代 3D 生成管道中的使用；我们的表述明确地弥补了这一差距。我们提出了一个无监督框架，用于从点云数据中学习空间结构化 3D 关键点。这些关键点作为紧凑且可解释的表示，为阐明的扩散模型 (EDM) 提供条件以重建完整的形状。学习到的关键点在对象实例上表现出可重复的空间结构，并支持关键点空间中的平滑插值，这表明它们捕获了几何变化。我们的方法在不同的对象类别中实现了强大的性能，与之前的方法相比，关键点一致性提高了 6 个百分点。

Title: GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

Authors: Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.03451
Pdf URL: https://arxiv.org/pdf/2512.03451
Copy Paste: [[2512.03451]] GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers(https://arxiv.org/abs/2512.03451)
Keywords: generation
Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
摘要：扩散模型彻底改变了视频生成，成为创意内容生成和物理模拟的重要工具。基于变压器的架构 (DiT) 和无分类器引导 (CFG) 是这一成功的两个基石，可实现强大的即时依从性和逼真的视频质量。尽管它们具有多功能性和卓越的性能，但这些模型需要大量的计算。每个视频生成都需要数十个迭代步骤，而 CFG 使所需的计算量增加了一倍。这种低效率阻碍了下游应用的更广泛采用。我们引入了 GalaxyDiT，这是一种无需训练的方法，可通过指导对齐和系统代理选择来加速视频生成以重用指标。通过排序相关性分析，我们的技术跨模型系列和参数尺度识别每个视频模型的最佳代理，从而确保最佳的计算重用。我们在 Wan2.1-1.3B 和 Wan2.1-14B 上实现了 1.87\times$ 和 2.37\times$ 加速，而在 VBench-2.0 基准测试中仅下降了 0.97% 和 0.72%。在高加速率下，我们的方法保持了对基本模型的卓越保真度，峰值信噪比 (PSNR) 比先前最先进的方法高出 5 至 10 dB。

Title: GeoVideo: Introducing Geometric Regularization into Video Generation Model

Authors: Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03453
Pdf URL: https://arxiv.org/pdf/2512.03453
Copy Paste: [[2512.03453]] GeoVideo: Introducing Geometric Regularization into Video Generation Model(https://arxiv.org/abs/2512.03453)
Keywords: generation
Abstract: Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.
摘要：视频生成领域的最新进展使得使用扩散变压器模型能够合成高质量且视觉逼真的剪辑。然而，大多数现有方法纯粹在 2D 像素空间中运行，缺乏用于建模 3D 结构的明确机制，通常会导致时间上不一致的几何形状、不可信的运动和结构伪影。在这项工作中，我们通过利用每帧深度预测增强潜在扩散模型，将几何正则化损失引入视频生成中。由于深度预测的巨大进步及其与基于图像的潜在编码器的兼容性，我们采用深度作为几何表示。具体来说，为了随着时间的推移加强结构一致性，我们提出了一种多视图几何损失，可以在共享 3D 坐标系内跨帧对齐预测的深度图。我们的方法弥合了外观生成和 3D 结构建模之间的差距，从而提高了时空一致性、形状一致性和物理合理性。跨多个数据集的实验表明，我们的方法比现有基线产生明显更稳定和几何一致的结果。

Title: Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Authors: Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03454
Pdf URL: https://arxiv.org/pdf/2512.03454
Copy Paste: [[2512.03454]] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles(https://arxiv.org/abs/2512.03454)
Keywords: generation
Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
摘要：解释自然语言命令以定位目标对象对于自动驾驶 (AD) 至关重要。现有的自动驾驶车辆 (AV) 视觉基础 (VG) 方法通常会遇到模糊的、依赖于上下文的指令，因为它们缺乏对 3D 空间关系和预期场景演变的推理。基于世界模型的原则，我们提出了 ThinkDeeper，一个在做出基础决策之前推理未来空间状态的框架。其核心是空间感知世界模型（SA-WM），它通过将当前场景提炼为命令感知潜在状态并推出一系列未来潜在状态来学习提前推理，从而为消除歧义提供前瞻性线索。作为补充，超图引导解码器将这些状态与多模态输入分层融合，捕获高阶空间依赖性以实现稳健的定位。此外，我们还推出了 DrivePilot，这是 AD 中的多源 VG 数据集，具有由检索增强生成 (RAG) 和思想链 (CoT) 提示的 LLM 管道生成的语义注释。经过对六项基准的广泛评估，ThinkDeeper 在 Talk2Car 排行榜上排名第一，并超越了 DrivePilot、MoCAD 和 RefCOCO/+/g 基准的最先进基准。值得注意的是，它在具有挑战性的场景（长文本、多智能体、模糊性）中表现出强大的鲁棒性和效率，并且即使在 50% 的数据上进行训练时也能保持卓越的性能。

Title: Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Authors: Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.03463
Pdf URL: https://arxiv.org/pdf/2512.03463
Copy Paste: [[2512.03463]] Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models(https://arxiv.org/abs/2512.03463)
Keywords: generation
Abstract: Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
摘要：最近的大型视觉语言模型 (LVLM) 已应用于各种 VQA 任务。然而，实现实际性能通常需要对大量图像-文本对进行特定于任务的微调，而收集这些图像-文本对的成本很高。在这项工作中，我们研究以文本为中心的训练，这是一种仅提供文本描述且不提供真实图像的设置，作为低成本数据扩展的范例。与图像不同，图像的收集通常受到隐私限制和利基领域稀缺性的限制，而文本则可以广泛使用。此外，文本易于编辑，可以通过法学硕士以最少的人力实现自动多样化和扩展。虽然这在可扩展性和成本方面比图像收集具有明显的优势，但由于图像-文本模态差距，对没有图像的原始文本进行训练在 VQA 任务上的收益仍然有限。为了解决这个问题，我们提出了一种文本打印图像（TPI），它通过直接在纯白色画布上渲染给定的文本描述来生成合成图像。这种简单的渲染将文本投影到图像模态中，并且可以低成本集成到任意现有的 LVLM 训练管道中。此外，TPI 保留了文本的语义，而文本到图像模型通常无法做到这一点。在四个模型和七个基准中，我们的系统实验表明，与扩散模型生成的合成图像相比，TPI 能够实现更有效的以文本为中心的训练。我们进一步探索 TPI 作为一种低成本的数据增强策略，并展示其实用性。总的来说，我们的研究结果凸显了以文本为中心的训练的巨大潜力，更广泛地说，为 LVLM 的全自动数据生成绘制了一条道路。

Title: Joint Progression Modeling (JPM): A Probabilistic Framework for Mixed-Pathology Progression

Authors: Hongtao Hao, Joseph L. Austerweil
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2512.03475
Pdf URL: https://arxiv.org/pdf/2512.03475
Copy Paste: [[2512.03475]] Joint Progression Modeling (JPM): A Probabilistic Framework for Mixed-Pathology Progression(https://arxiv.org/abs/2512.03475)
Keywords: generation
Abstract: Event-based models (EBMs) infer disease progression from cross-sectional data, and standard EBMs assume a single underlying disease per individual. In contrast, mixed pathologies are common in neurodegeneration. We introduce the Joint Progression Model (JPM), a probabilistic framework that treats single-disease trajectories as partial rankings and builds a prior over joint progressions. We study several JPM variants (Pairwise, Bradley-Terry, Plackett-Luce, and Mallows) and analyze three properties: (i) calibration -- whether lower model energy predicts smaller distance to the ground truth ordering; (ii) separation -- the degree to which sampled rankings are distinguishable from random permutations; and (iii) sharpness -- the stability of sampled aggregate rankings. All variants are calibrated, and all achieve near-perfect separation; sharpness varies by variant and is well-predicted by simple features of the input partial rankings (number and length of rankings, conflict, and overlap). In synthetic experiments, JPM improves ordering accuracy by roughly 21 percent over a strong EBM baseline (SA-EBM) that treats the joint disease as a single condition. Finally, using NACC, we find that the Mallows variant of JPM and the baseline model (SA-EBM) have results that are more consistent with prior literature on the possible disease progression of the mixed pathology of AD and VaD.
摘要：基于事件的模型 (EBM) 从横截面数据推断疾病进展，标准 EBM 假设每个人都有一种潜在疾病。相反，混合病理在神经变性中很常见。我们引入了联合进展模型（JPM），这是一种概率框架，将单一疾病轨迹视为部分排名，并建立了联合进展的先验。我们研究了几种 JPM 变体（Pairwise、Bradley-Terry、Plackett-Luce 和 Mallows）并分析了三个属性：（i）校准——较低的模型能量是否预测与地面真实排序的距离较小； (ii) 分离度——抽样排名与随机排列的区别程度； (iii) 清晰度——抽样总体排名的稳定性。所有变体都经过校准，并且都实现了近乎完美的分离；清晰度因变体而异，并且可以通过输入部分排名的简单特征（排名的数量和长度、冲突和重叠）来很好地预测。在综合实验中，与将关节疾病视为单一病症的强 EBM 基线 (SA-EBM) 相比，JPM 将排序准确性提高了约 21%。最后，使用 NACC，我们发现 JPM 的 Mallows 变体和基线模型 (SA-EBM) 的结果与先前关于 AD 和 VaD 混合病理学可能的疾病进展的文献更加一致。

Title: Towards Object-centric Understanding for Instructional Videos

Authors: Wenliang Guo, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03479
Pdf URL: https://arxiv.org/pdf/2512.03479
Copy Paste: [[2512.03479]] Towards Object-centric Understanding for Instructional Videos(https://arxiv.org/abs/2512.03479)
Keywords: generation
Abstract: Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.
摘要：了解程序活动对于开发未来能够推理复杂现实世界任务的辅助人工智能至关重要。现有的以动作为中心的方法难以满足实际过程的灵活性，其中步骤顺序根据对象状态而变化。在这项工作中，我们建议通过将行为视为驱动状态转换的机制，将重点转移到以对象为中心的范式。为了推进这个方向，我们引入了 Object-IVQA，这是一个长格式的教学视频基准，包含 107 个视频和 514 个开放式问答对，并附有基于时间的证据。该基准评估了以对象为中心的推理的四个维度，包括状态演化、前提条件验证、反事实推理和错误识别。我们进一步提出了一个代理框架，该框架协调以对象为中心的规划、感知、分析和生成工具，从而实现跨不相交段的明确证据检索和多跳推理。实验表明，现有的大型视觉语言模型在对象级识别和推理方面存在困难，而我们的框架取得了实质性的改进。

Title: CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving

Authors: Zhijian Qiao, Zehuan Yu, Tong Li, Chih-Chung Chou, Wenchao Ding, Shaojie Shen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.03510
Pdf URL: https://arxiv.org/pdf/2512.03510
Copy Paste: [[2512.03510]] CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving(https://arxiv.org/abs/2512.03510)
Keywords: generative
Abstract: Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
摘要：众包可以实现可扩展的自动驾驶地图构建，但低成本传感器噪声阻碍了质量随数据量的提高。我们提出了 CSMapping，这是一个能够生成准确的语义地图和拓扑道路中心线的系统，其质量随着更多的众包数据而不断提高。对于语义映射，我们在高清地图（可选地以标清地图为条件）上训练潜在扩散模型，以学习现实世界地图结构的生成先验，而不需要配对的众包/高清地图监督。该先验通过潜在空间中的约束 MAP 优化来合并，确保对严重噪声的鲁棒性和未观察区域的合理完成。初始化使用稳健的矢量化映射模块，然后进行扩散反演；优化采用高效的高斯基重新参数化、投影梯度下降 zobracket 多起点和潜在空间因子图来实现全局一致性。对于拓扑映射，我们将置信加权 k 中心点聚类和运动学细化应用于轨迹，产生平滑的、类似人类的中心线，对轨迹变化具有鲁棒性。 nuScenes、Argoverse 2 和大型专有数据集上的实验通过彻底的消融和可扩展性研究实现了最先进的语义和拓扑映射性能。

Title: FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Authors: Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03520
Pdf URL: https://arxiv.org/pdf/2512.03520
Copy Paste: [[2512.03520]] FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation(https://arxiv.org/abs/2512.03520)
Keywords: generation
Abstract: We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. this https URL
摘要：我们推出了 FloodDiffusion，这是一个用于文本驱动的流式人体运动生成的新框架。给定随时间变化的文本提示，FloodDiffusion 会生成具有实时延迟的文本对齐、无缝运动序列。与依赖于逐块或具有扩散头的自回归模型的现有方法不同，我们采用扩散强迫框架来对时变控制事件下的时间序列生成任务进行建模。我们发现，普通扩散强迫的直接实现（如针对视频模型所建议的）无法对真实运动分布进行建模。我们证明，为了保证对输出分布进行建模，必须调整普通扩散强迫以：（i）使用双向注意力而不是随意注意力进行训练； (ii) 实施下三角时间调度器而不是随机调度器； (iii)利用连续时变的方式引入文本调节。通过这些改进，我们首次证明基于扩散力的框架在流运动生成任务上实现了最先进的性能，在 HumanML3D 基准上达到了 0.057 的 FID。提供型号、代码和重量。这个 https 网址

Title: Adaptive sampling using variational autoencoder and reinforcement learning

Authors: Adil Rasheed, Mikael Aleksander Jansen Shahly, Muhammad Faisal Aftab
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.03525
Pdf URL: https://arxiv.org/pdf/2512.03525
Copy Paste: [[2512.03525]] Adaptive sampling using variational autoencoder and reinforcement learning(https://arxiv.org/abs/2512.03525)
Keywords: generative
Abstract: Compressed sensing enables sparse sampling but relies on generic bases and random measurements, limiting efficiency and reconstruction quality. Optimal sensor placement uses historcal data to design tailored sampling patterns, yet its fixed, linear bases cannot adapt to nonlinear or sample-specific variations. Generative model-based compressed sensing improves reconstruction using deep generative priors but still employs suboptimal random sampling. We propose an adaptive sparse sensing framework that couples a variational autoencoder prior with reinforcement learning to select measurements sequentially. Experiments show that this approach outperforms CS, OSP, and Generative model-based reconstruction from sparse measurements.
摘要：压缩感知可以实现稀疏采样，但依赖于通用基础和随机测量，限制了效率和重建质量。最佳传感器放置使用历史数据来设计定制的采样模式，但其固定的线性基础无法适应非线性或样本特定的变化。基于生成模型的压缩感知使用深度生成先验改进了重建，但仍然采用次优随机采样。我们提出了一种自适应稀疏传感框架，它将变分自动编码器与强化学习结合起来，以顺序选择测量值。实验表明，这种方法优于 CS、OSP 和基于稀疏测量的生成模型重建。

Title: OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03532
Pdf URL: https://arxiv.org/pdf/2512.03532
Copy Paste: [[2512.03532]] OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation(https://arxiv.org/abs/2512.03532)
Keywords: generation
Abstract: Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
摘要：将开放词汇 3D 实例分割 (OV-3DIS) 推广到多样化、非结构化和无网格环境对于机器人技术和 AR/VR 至关重要，但仍然是一个重大挑战。我们将此归因于现有方法的两个关键限制：（1）提案生成依赖于特定于数据集的提案网络或基于网格的超级点，使得它们不适用于无网格场景，并限制了对新场景的泛化； (2) 基于 CLIP 的分类器的文本推理能力较弱，难以识别组合和功能用户查询。为了解决这些问题，我们引入了 OpenTrack3D，一个通用且准确的框架。与依赖预生成提案的方法不同，OpenTrack3D 采用新颖的视觉空间跟踪器在线构建跨视图一致的对象提案。给定 RGB-D 流，我们的管道首先利用 2D 开放词汇分段器来生成掩模，然后使用深度将其提升为 3D 点云。然后使用 DINO 特征图提取掩模引导的实例特征，并且我们的跟踪器融合视觉和空间线索以保持实例一致性。核心管道完全无网格，但我们还提供可选的超级点细化模块，以在场景网格可用时进一步增强性能。最后，我们用多模态大语言模型（MLLM）取代 CLIP，显着增强了复杂用户查询的组合推理。对包括 ScanNet200、Replica、ScanNet++ 和 SceneFun3D 在内的各种基准进行的大量实验证明了最先进的性能和强大的泛化能力。

Title: Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Authors: Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03534
Pdf URL: https://arxiv.org/pdf/2512.03534
Copy Paste: [[2512.03534]] Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation(https://arxiv.org/abs/2512.03534)
Keywords: generation
Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: this https URL.
摘要：在文本到视觉生成中实现用户意图和生成的视觉效果之间的精确对齐仍然是一个核心挑战，因为单次尝试通常无法产生所需的输出。为了解决这个问题，先前的方法主要是扩展视觉生成过程（例如，增加采样步骤或种子），但这很快就会导致质量停滞不前。出现这种限制是因为对于指导生成至关重要的提示是固定的。为了解决这个问题，我们提出了推理时间缩放的提示重新设计（Prompt Redesign for Inference-time Scaling），创造了 PRIS，这是一个框架，可以在推理过程中自适应地修改提示，以响应缩放的视觉生成。 PRIS 的核心思想是审查生成的视觉效果，识别视觉效果中重复出现的故障模式，并相应地重新设计提示，然后使用修改后的提示重新生成视觉效果。为了为及时修订提供精确的对齐反馈，我们引入了一种新的验证器，即元素级事实校正，它可以在细粒度级别评估提示属性和生成的视觉效果之间的对齐情况，从而实现比整体测量更准确和可解释的评估。对文本到图像和文本到视频基准的大量实验证明了我们方法的有效性，包括在 VBench 2.0 上提高了 15%。这些结果强调，联合缩放提示和视觉效果是在推理时充分利用缩放定律的关键。可视化可在网站上获得：此 https URL。

Title: CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Authors: Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03540
Pdf URL: https://arxiv.org/pdf/2512.03540
Copy Paste: [[2512.03540]] CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation(https://arxiv.org/abs/2512.03540)
Keywords: generation
Abstract: Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
摘要：烹饪是一种连续且以视觉为基础的活动，切碎、混合或煎炸等每个步骤都带有程序逻辑和视觉语义。虽然最近的扩散模型在文本到图像生成方面表现出了强大的能力，但它们很难处理结构化的多步骤场景，例如食谱插图。此外，当前的菜谱插图方法无法适应菜谱长度的自然变化，无论实际的指令结构如何，都无法生成固定数量的图像。为了解决这些限制，我们提出了 CookAnything，这是一种灵活且一致的基于扩散的框架，可以从任意长度的文本烹饪指令生成连贯的、语义上不同的图像序列。该框架引入了三个关键组件：（1）逐步区域控制（SRC），它将文本步骤与单个去噪过程中的相应图像区域对齐； (2)灵活的RoPE，一种步进感知位置编码机制，可增强时间相干性和空间多样性； (3) 跨步骤一致性控制 (CSCC)，保持跨步骤的细粒度成分一致性。食谱插图基准的实验结果表明，CookAnything 在基于训练和无训练的设置中比现有方法表现更好。所提出的框架支持复杂多步骤指令的可扩展、高质量视觉合成，并在教学媒体和程序内容创建方面具有广泛应用的巨大潜力。

Title: Towards Irreversible Machine Unlearning for Diffusion Models

Authors: Xun Yuan, Zilong Zhao, Jiayu Li, Aryan Pasikhani, Prosanta Gope, Biplab Sikdar
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.03564
Pdf URL: https://arxiv.org/pdf/2512.03564
Copy Paste: [[2512.03564]] Towards Irreversible Machine Unlearning for Diffusion Models(https://arxiv.org/abs/2512.03564)
Keywords: generation, generative
Abstract: Diffusion models are renowned for their state-of-the-art performance in generating synthetic images. However, concerns related to safety, privacy, and copyright highlight the need for machine unlearning, which can make diffusion models forget specific training data and prevent the generation of sensitive or unwanted content. Current machine unlearning methods for diffusion models are primarily designed for conditional diffusion models and focus on unlearning specific data classes or features. Among these methods, finetuning-based machine unlearning methods are recognized for their efficiency and effectiveness, which update the parameters of pre-trained diffusion models by minimizing carefully designed loss functions. However, in this paper, we propose a novel attack named Diffusion Model Relearning Attack (DiMRA), which can reverse the finetuning-based machine unlearning methods, posing a significant vulnerability of this kind of technique. Without prior knowledge of the unlearning elements, DiMRA optimizes the unlearned diffusion model on an auxiliary dataset to reverse the unlearning, enabling the model to regenerate previously unlearned elements. To mitigate this vulnerability, we propose a novel machine unlearning method for diffusion models, termed as Diffusion Model Unlearning by Memorization (DiMUM). Unlike traditional methods that focus on forgetting, DiMUM memorizes alternative data or features to replace targeted unlearning data or features in order to prevent generating such elements. In our experiments, we demonstrate the effectiveness of DiMRA in reversing state-of-the-art finetuning-based machine unlearning methods for diffusion models, highlighting the need for more robust solutions. We extensively evaluate DiMUM, demonstrating its superior ability to preserve the generative performance of diffusion models while enhancing robustness against DiMRA.
摘要：扩散模型以其在生成合成图像方面最先进的性能而闻名。然而，与安全、隐私和版权相关的担忧凸显了机器取消学习的必要性，这可能使扩散模型忘记特定的训练数据并防止生成敏感或不需要的内容。当前用于扩散模型的机器遗忘方法主要是针对条件扩散模型而设计的，并且专注于遗忘特定的数据类或特征。在这些方法中，基于微调的机器去学习方法因其效率和有效性而受到认可，它们通过最小化精心设计的损失函数来更新预训练扩散模型的参数。然而，在本文中，我们提出了一种名为扩散模型重新学习攻击（DiMRA）的新型攻击，它可以逆转基于微调的机器取消学习方法，从而构成了此类技术的重大漏洞。在不了解失学习元素的情况下，DiMRA 会在辅助数据集上优化失学习扩散模型以逆转失学习，使模型能够重新生成之前失学习的元素。为了缓解这一漏洞，我们提出了一种新颖的扩散模型机器取消学习方法，称为通过记忆扩散模型取消学习（DiMUM）。与专注于遗忘的传统方法不同，DiMUM 会记住替代数据或特征来替换目标未学习的数据或特征，以防止生成此类元素。在我们的实验中，我们证明了 DiMRA 在逆转最先进的基于微调的扩散模型机器取消学习方法方面的有效性，强调了对更强大解决方案的需求。我们广泛评估了 DiMUM，证明了其在保持扩散模型生成性能的同时增强对 DiMRA 的鲁棒性的卓越能力。

Title: GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models

Authors: Hao Sun, Lei Fan, Donglin Di, Shaohui Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2512.03566
Pdf URL: https://arxiv.org/pdf/2512.03566
Copy Paste: [[2512.03566]] GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models(https://arxiv.org/abs/2512.03566)
Keywords: generation
Abstract: Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
摘要：铰接对象生成已经取得了越来越大的进步，但现有模型通常缺乏以文本提示为条件的能力。为了解决文本描述和 3D 铰接对象表示之间的巨大差距，我们提出了 GAOT，这是一个三阶段框架，可以从文本提示生成铰接对象，在三步过程中利用扩散模型和超图学习。首先，我们微调点云生成模型，以根据文本提示生成对象的粗略表示。考虑到铰接对象和图结构之间的内在联系，我们设计了一种基于超图的学习方法来细化这些粗略表示，将对象部分表示为图顶点。最后，利用扩散模型，根据对象部分生成铰接对象的关节（表示为图形边缘）。对 PartNet-Mobility 数据集进行的广泛定性和定量实验证明了我们方法的有效性，实现了优于以前方法的性能。

Title: Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation

Authors: Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03590
Pdf URL: https://arxiv.org/pdf/2512.03590
Copy Paste: [[2512.03590]] Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation(https://arxiv.org/abs/2512.03590)
Keywords: generation
Abstract: Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.
摘要：处理快速、复杂和高度非线性的运动模式长期以来一直给视频帧插值带来挑战。尽管最近的基于扩散的方法改进了传统的基于光流的方法，但它们仍然难以覆盖不同的应用场景，并且通常无法在细粒度运动任务（例如视听同步插值）中产生清晰的、时间一致的帧。为了解决这些限制，我们引入了 BBF（超越边界帧），这是一种上下文感知视频帧插值框架，可以通过音频/视觉语义来指导。首先，我们增强了插值模型的输入设计，使其能够灵活处理多种条件模态，包括文本、音频、图像和视频。其次，我们提出了一种解耦的多模态融合机制，该机制将不同的条件信号顺序注入 DiT 主干。最后，为了保持基础模型的生成能力，我们采用渐进式多阶段训练范例，其中使用起始端帧差异嵌入来动态调整数据采样和损失权重。大量的实验结果表明，BBF 在通用插值和视听同步插值任务上都优于专门的最先进方法，为协调多通道条件下的视频帧插值建立了统一的框架。

Title: Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding

Authors: Guang Yang, Lei Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03592
Pdf URL: https://arxiv.org/pdf/2512.03592
Copy Paste: [[2512.03592]] Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding(https://arxiv.org/abs/2512.03592)
Keywords: generation, generative
Abstract: The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding. In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.
摘要：RNA反向折叠问题是RNA设计中的一个关键挑战，涉及识别可以折叠成所需二级结构的核苷酸序列，这对于确保分子稳定性和功能至关重要。这项任务固有的复杂性源于序列和结构之间错综复杂的关系，使其特别具有挑战性。在本文中，我们提出了一个名为 HyperRNA 的框架，这是一种具有编码器-解码器架构的生成模型，利用超图来设计 RNA 序列。具体来说，我们的 HyperRNA 模型由三个主要部分组成：预处理、编码和解码。在预处理阶段，通过基于3珠粗粒度表示提取RNA主链的原子坐标来构建图结构。编码阶段处理这些图，使用注意力嵌入模块和基于超图的编码器捕获更高阶的依赖性和复杂的生物分子相互作用。最后，解码阶段以自回归方式生成 RNA 序列。我们对 PDBBind 和 RNAsolo 数据集进行了定量和定性实验，以评估 RNA 序列生成和 RNA-蛋白质复合物序列生成的反向折叠任务。实验结果表明，HyperRNA 不仅优于现有的 RNA 设计方法，而且凸显了在 RNA 工程中利用超图的潜力。

Title: LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Authors: Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03619
Pdf URL: https://arxiv.org/pdf/2512.03619
Copy Paste: [[2512.03619]] LAMP: Language-Assisted Motion Planning for Controllable Video Generation(https://arxiv.org/abs/2512.03619)
Keywords: generation
Abstract: Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.
摘要：视频生成在视觉保真度和可控性方面取得了显着进步，能够对文本、布局或运动进行调节。其中，运动控制（指定对象动态和摄像机轨迹）对于构建复杂的电影场景至关重要，但现有的界面仍然有限。我们引入了 LAMP，它利用大型语言模型 (LLM) 作为运动规划器，将自然语言描述转换为动态对象和（相对定义的）相机的显式 3D 轨迹。受电影摄影惯例的启发，LAMP 定义了运动领域特定语言 (DSL)。通过利用法学硕士的程序合成功能，LAMP 从自然语言生成结构化运动程序，并确定性地映射到 3D 轨迹。我们构建了一个大规模程序数据集，将自然文本描述与相应的运动程序和 3D 轨迹配对。实验证明，与最先进的替代方案相比，LAMP 在运动可控性和与用户意图的一致性方面具有改进的性能，建立了第一个直接从自然语言规范生成对象和相机运动的框架。

Title: ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation

Authors: Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaojie Shen, Guang Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03621
Pdf URL: https://arxiv.org/pdf/2512.03621
Copy Paste: [[2512.03621]] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation(https://arxiv.org/abs/2512.03621)
Keywords: restoration, generation
Abstract: We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.
摘要：我们提出了 ReCamDriving，一个纯粹基于视觉、摄像头控制的新颖轨迹视频生成框架。虽然基于修复的方法无法恢复复杂的伪影，而基于 LiDAR 的方法依赖于稀疏和不完整的线索，但 ReCamDriving 利用密集且场景完整的 3DGS 渲染来进行显式几何引导，从而实现精确的相机可控生成。为了减轻以 3DGS 渲染为条件时对恢复行为的过度拟合，ReCamDriving 采用两阶段训练范例：第一阶段使用相机姿势进行粗略控制，而第二阶段结合 3DGS 渲染进行细粒度视点和几何引导。此外，我们提出了一种基于 3DGS 的跨轨迹数据管理策略，以消除相机转换模式中的训练-测试间隙，从而实现单目视频的可扩展多轨迹监督。基于此策略，我们构建了 ParaDrive 数据集，其中包含超过 110K 平行轨迹视频对。大量实验表明，ReCamDriving 实现了最先进的相机可控性和结构一致性。

Title: The promising potential of vision language models for the generation of textual weather forecasts

Authors: Edward C. C. Steele, Dinesh Mane, Emilio Monti, Luis Orus, Rebecca Chantrill-Cheyette, Matthew Couch, Kirstine I. Dale, Simon Eaton, Govindarajan Rangarajan, Amir Majlesi, Steven Ramsdale, Michael Sharpe, Craig Smith, Jonathan Smith, Rebecca Yates, Holly Ellis, Charles Ewen
Subjects: cs.LG, cs.AI, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2512.03623
Pdf URL: https://arxiv.org/pdf/2512.03623
Copy Paste: [[2512.03623]] The promising potential of vision language models for the generation of textual weather forecasts(https://arxiv.org/abs/2512.03623)
Keywords: generation
Abstract: Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
摘要：尽管多模式基础模型的能力很有前景，但它们在气象产品和服务生成方面的应用仍处于起步阶段。为了加速愿望和采用，我们探索了视觉语言模型的新颖用途，直接从视频编码的网格天气数据编写标志性的航运预测文本。这些早期结果表明，在提高气象企业内外的生产效率和服务创新方面，有广阔的可扩展技术机会。

Title: Dynamically Scaled Activation Steering

Authors: Alex Ferrando, Xavier Suau, Jordi Gonzàlez, Pau Rodriguez
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03661
Pdf URL: https://arxiv.org/pdf/2512.03661
Copy Paste: [[2512.03661]] Dynamically Scaled Activation Steering(https://arxiv.org/abs/2512.03661)
Keywords: generation, generative
Abstract: Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS's generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.
摘要：激活引导已成为引导生成模型的行为达到预期结果（例如毒性减轻）的强大方法。然而，大多数现有方法对所有输入统一应用干预措施，当不需要转向时会降低模型性能。我们引入动态缩放激活转向 (DSAS)，这是一种与方法无关的转向框架，可将何时转向与如何转向脱钩。 DSAS 自适应地调节跨层和输入的现有转向转换的强度，仅在检测到不良行为时进行强力干预。在生成时，DSAS 计算上下文相关的缩放因子，有选择地调整任何转向方法的强度。我们还展示了如何将 DSAS 与转向功能一起进行端到端联合优化。当与现有的转向方法相结合时，DSAS 相对于单独的转向不断改进帕累托前沿，从而在毒性减轻和效用保留之间实现更好的权衡。我们通过将 DSAS 应用于文本到图像扩散模型来进一步证明 DSAS 的通用性，展示自适应转向如何允许对特定概念进行调制。最后，DSAS 引入了最小的计算开销，同时提高了可解释性，查明哪些令牌需要转向以及需要转向多少。

Title: Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images

Authors: Paula Seidler, Neill D. F. Campbell, Ivor J A Simpson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03701
Pdf URL: https://arxiv.org/pdf/2512.03701
Copy Paste: [[2512.03701]] Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images(https://arxiv.org/abs/2512.03701)
Keywords: generative
Abstract: Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
摘要：与人类视觉一致的感知相似性分数对于训练和评估计算机视觉模型至关重要。深度感知损失（例如 LPIPS）实现了良好的对齐，但依赖于具有未知不变性的复杂、高度非线性的判别特征，而像 SSIM 这样的手工测量是可解释的，但错过了关键的感知属性。我们引入了结构化不确定性相似度评分（SUSS）；它通过一组感知组件对每个图像进行建模，每个组件都由结构化多元正态分布表示。它们以生成式、自我监督的方式进行训练，以将高可能性分配给人类难以察觉的增强。最终分数是分量对数概率与从人类感知数据集中学习的权重的加权和。与基于特征的方法不同，SUSS 学习像素空间中残差的图像特定线性变换，从而通过去相关残差和采样实现透明检查。 SUSS 与人类感知判断紧密结合，在不同的失真类型中表现出强大的感知校准，并为其相似性评估提供本地化的、可解释的解释。当使用 SUSS 作为下游成像任务的感知损失时，我们进一步证明了稳定的优化行为和竞争性能。

Title: PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

Authors: Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.03724
Pdf URL: https://arxiv.org/pdf/2512.03724
Copy Paste: [[2512.03724]] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention(https://arxiv.org/abs/2512.03724)
Keywords: generation
Abstract: The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive this http URL this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex this http URL address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
摘要：视觉-语言-动作（VLA）模型在具体任务上表现出了卓越的性能，并显示出在现实世界应用中的巨大潜力。然而，当前的 VLA 仍然难以产生一致且精确的面向目标的动作，因为它们经常沿着轨迹产生冗余或不稳定的运动，限制了它们在时间敏感的 http URL 中的适用性。在这项工作中，我们将这些冗余动作归因于现有 VLA 的空间统一感知场，这导致它们被与目标无关的对象分散注意力，特别是在复杂的 http URL 中解决这个问题。模型对任务相关区域的感知。姿势条件锚点注意力机制使模型能够更好地将指令语义与可操作的视觉线索结合起来，从而提高动作生成的精度和效率。此外，我们的框架采用轻量级架构，不需要辅助感知模块（例如分段或接地网络），确保高效推理。大量的实验验证了我们的方法在不同的机器人操作基准上以精确且高效的行为执行具体任务，并在各种具有挑战性的环境中显示出强大的泛化能力。

Title: LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling

Authors: Hong-Kai Zheng, Piji Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03796
Pdf URL: https://arxiv.org/pdf/2512.03796
Copy Paste: [[2512.03796]] LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling(https://arxiv.org/abs/2512.03796)
Keywords: generation
Abstract: Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.
摘要：用于图像生成的视觉自回归（VAR）建模方法提出了跨层次尺度的自回归处理，并行解码每个尺度的多个标记。该方法在加速合成的同时实现了高质量的生成。然而，规模内的并行令牌采样可能会导致结构错误，从而导致生成的图像不理想。为了缓解这个问题，我们提出了潜在尺度拒绝采样（LSRS），这是一种在推理过程中逐步细化潜在尺度中的标记图以增强 VAR 模型的方法。我们的方法使用轻量级评分模型来评估每个尺度采样的多个候选标记图，选择高质量的图来指导后续尺度的生成。通过优先考虑对结构一致性至关重要的早期尺度，LSRS 有效地减轻了自回归误差累积，同时保持了计算效率。实验表明，LSRS 以最小的额外计算开销显着提高了 VAR 的生成质量。对于 VAR-d30 模型，LSRS 仅增加了 1% 的推理时间，同时将其 FID 分数从 1.95 降低到 1.78。当推理时间增加15%时，FID分数可进一步降低至1.66。 LSRS 提供了一种高效的测试时间扩展解决方案，用于增强基于 VAR 的生成。

Title: CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation

Authors: Letian Zhou, Songhua Liu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03844
Pdf URL: https://arxiv.org/pdf/2512.03844
Copy Paste: [[2512.03844]] CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation(https://arxiv.org/abs/2512.03844)
Keywords: generative
Abstract: Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: this https URL
摘要：利用生成模型的流行数据集蒸馏（DD）方法面临两个基本限制。首先，尽管在 DD 中率先使用扩散模型并提供了令人印象深刻的性能，但绝大多数方法自相矛盾地需要在完整目标数据集上预先训练扩散模型，这破坏了 DD 的真正目的并产生了高昂的训练成本。其次，尽管一些方法转向通用的文本到图像模型而不依赖于此类特定于目标的训练，但它们存在显着的分布不匹配，因为这些基础模型中封装的网络规模先验无法忠实地捕获特定于目标的语义，从而导致性能不佳。为了应对这些挑战，我们提出了核心分布对齐（CoDA），这是一个仅使用现成的文本到图像模型即可实现有效DD的框架。我们的关键思想是首先使用强大的基于密度的发现机制来识别目标数据集的“内在核心分布”。然后，我们引导生成过程，使生成的样本与该核心分布保持一致。通过这样做，CoDA 有效地弥合了通用生成先验和目标语义之间的差距，产生了高度代表性的蒸馏数据集。大量实验表明，在不依赖于在目标数据集上专门训练的生成模型的情况下，CoDA 的性能与以前的方法相当甚至优于以前的方法，并且在所有基准测试（包括 ImageNet-1K 及其子集）上都具有如此的依赖性。值得注意的是，它在 ImageNet-1K 上的每类 50 个图像 (IPC) 设置中建立了 60.4% 的最先进准确率。我们的代码可以在项目网页上找到：此 https URL

Title: PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation

Authors: Hania Ghouse, Maryam Alsharqi, Farhad R. Nezami, Muzammil Behzad
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03848
Pdf URL: https://arxiv.org/pdf/2512.03848
Copy Paste: [[2512.03848]] PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation(https://arxiv.org/abs/2512.03848)
Keywords: generation
Abstract: Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
摘要：心脏图像分析的任务仍然分散：解剖分割、疾病分类和基础临床报告生成通常由在不同数据体系下训练的单独网络处理。现有的框架无法将这些目标统一在一个架构中，同时保留跨成像模式和数据集的泛化性。我们引入了 PULSE，这是一种基于自监督表示的多任务视觉语言框架，并通过平衡区域重叠学习、像素级分类保真度和边界感知 IoU 细化的复合监督策略进行优化。多尺度令牌重建解码器支持解剖分割，而共享全局表示支持疾病分类和临床基础文本输出，允许模型从像素过渡到结构，最后在一个架构内进行临床推理。与之前特定于任务的流程不同，PULSE 可以学习任务不变的心脏先验，在数据集中进行稳健泛化，并且可以在最少的监督下适应新的成像模式。这使得该领域更接近可扩展的基础型心脏分析框架。

Title: Traffic Image Restoration under Adverse Weather via Frequency-Aware Mamba

Authors: Liwen Pan, Longguang Wang, Guangwei Gao, Jun Wang, Jun Shi, Juncheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03852
Pdf URL: https://arxiv.org/pdf/2512.03852
Copy Paste: [[2512.03852]] Traffic Image Restoration under Adverse Weather via Frequency-Aware Mamba(https://arxiv.org/abs/2512.03852)
Keywords: restoration
Abstract: Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.
摘要：恶劣天气条件下的交通图像恢复仍然是智能交通系统的关键挑战。现有方法主要关注空间域建模，但忽略频域先验。尽管新兴的 Mamba 架构擅长通过补丁相关性分析进行远程依赖建模，但其频域特征提取的潜力仍未得到开发。为了解决这个问题，我们提出了频率感知 Mamba (FAMamba)，这是一种新颖的框架，它将频率引导与序列建模相结合，以实现高效的图像恢复。我们的架构由两个关键组件组成：（1）双分支特征提取块（DFEB），通过双向2D频率自适应扫描增强局部全局交互，根据子带纹理分布动态调整遍历路径； (2) 先验引导块 (PGB)，通过基于小波的高频残差学习细化纹理细节，从而实现具有精确细节的高质量图像重建。同时，我们为 Mamba 架构设计了一种新颖的自适应频率扫描机制（AFSM），使 Mamba 能够实现跨不同子图的频域扫描，从而充分利用子图结构固有的纹理分布特征。大量实验证明了 FAMamba 的效率和有效性。

Title: Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models

Authors: Haidong Kang, Wei Wu, Hanling Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.03882
Pdf URL: https://arxiv.org/pdf/2512.03882
Copy Paste: [[2512.03882]] Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models(https://arxiv.org/abs/2512.03882)
Keywords: generation
Abstract: Few-shot class incremental learning (FSCIL) is a more realistic and challenging paradigm in continual learning to incrementally learn unseen classes and overcome catastrophic forgetting on base classes with only a few training examples. Previous efforts have primarily centered around studying more effective FSCIL approaches. By contrast, less attention was devoted to thinking the security issues in contributing to FSCIL. This paper aims to provide a holistic study of the impact of attacks on FSCIL. We first derive insights by systematically exploring how human expert-designed attack methods (i.e., PGD, FGSM) affect FSCIL. We find that those methods either fail to attack base classes, or suffer from huge labor costs due to relying on huge expert knowledge. This highlights the need to craft a specialized attack method for FSCIL. Grounded in these insights, in this paper, we propose a simple yet effective ACraft method to automatically steer and discover optimal attack methods targeted at FSCIL by leveraging Large Language Models (LLMs) without human experts. Moreover, to improve the reasoning between LLMs and FSCIL, we introduce a novel Proximal Policy Optimization (PPO) based reinforcement learning to optimize learning, making LLMs generate better attack methods in the next generation by establishing positive feedback. Experiments on mainstream benchmarks show that our ACraft significantly degrades the performance of state-of-the-art FSCIL methods and dramatically beyond human expert-designed attack methods while maintaining the lowest costs of attack.
摘要：小样本类增量学习（FSCIL）是持续学习中更现实和更具挑战性的范式，只需少量训练示例即可增量学习未见过的类并克服基类上的灾难性遗忘。之前的努力主要集中在研究更有效的 FSCIL 方法。相比之下，很少有人关注 FSCIL 中的安全问题。本文旨在对攻击对 FSCIL 的影响进行全面研究。我们首先通过系统地探索人类专家设计的攻击方法（即 PGD、FGSM）如何影响 FSCIL 来获得见解。我们发现这些方法要么无法攻击基类，要么由于依赖大量的专业知识而承受巨大的劳动力成本。这凸显了为 FSCIL 设计专门的攻击方法的必要性。基于这些见解，在本文中，我们提出了一种简单而有效的 ACraft 方法，通过利用大型语言模型 (LLM)，无需人类专家即可自动引导和发现针对 FSCIL 的最佳攻击方法。此外，为了改进LLM和FSCIL之间的推理，我们引入了一种新颖的基于强化学习的近端策略优化（PPO）来优化学习，使LLM通过建立正反馈在下一代产生更好的攻击方法。主流基准测试表明，我们的 ACraft 显着降低了最先进的 FSCIL 方法的性能，并大大超越了人类专家设计的攻击方法，同时保持了最低的攻击成本。

Title: Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction

Authors: Janis Keck, Lukas Silvester Barth, Fatemeh (Hannaneh)Fahimi, Parvaneh Joharinad, Jürgen Jost
Subjects: cs.LG, math.AT, stat.ML
Abstract URL: https://arxiv.org/abs/2512.03899
Pdf URL: https://arxiv.org/pdf/2512.03899
Copy Paste: [[2512.03899]] Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction(https://arxiv.org/abs/2512.03899)
Keywords: generative
Abstract: Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from algebraic topology without a clear probabilistic interpretation detaches them from commonly used theoretical frameworks in those areas. In this work we introduce a framework that explains fuzzy simplicial sets as marginals of probability measures on simplicial sets. In particular, this perspective shows that the fuzzy weights of UMAP arise from a generative model that samples Vietoris-Rips filtrations at random scales, yielding cumulative distribution functions of pairwise distances. More generally, the framework connects fuzzy simplicial sets to probabilistic models on the face poset, clarifies the relation between Kullback-Leibler divergence and fuzzy cross-entropy in this setting, and recovers standard t-norms and t-conorms via Boolean operations on the underlying simplicial sets. We then show how new embedding methods may be derived from this framework and illustrate this on an example where we generalize UMAP using Čech filtrations with triplet sampling. In summary, this probabilistic viewpoint provides a unified probabilistic theoretical foundation for fuzzy simplicial sets, clarifies the role of UMAP within this framework, and enables the systematic derivation of new dimensionality reduction methods.
摘要：模糊单纯形集已成为降维和流形学习中令人感兴趣的对象，最突出的是它们在 UMAP 中的作用。然而，它们的定义是通过代数拓扑工具进行的，没有明确的概率解释，使它们脱离了这些领域常用的理论框架。在这项工作中，我们引入了一个框架，它将模糊单纯集解释为单纯集概率测度的边际。特别是，这个观点表明 UMAP 的模糊权重来自于以随机尺度对 Vietoris-Rips 过滤进行采样的生成模型，产生成对距离的累积分布函数。更一般地说，该框架将模糊单纯形集连接到人脸偏序集上的概率模型，阐明了该设置中 Kullback-Leibler 散度和模糊交叉熵之间的关系，并通过对基础单纯形集进行布尔运算来恢复标准 t 范数和 t conorms。然后，我们展示如何从该框架中衍生出新的嵌入方法，并通过一个示例进行说明，在该示例中，我们使用带有三元组采样的 Čech 过滤来推广 UMAP。综上所述，这种概率观点为模糊单纯集提供了统一的概率理论基础，阐明了UMAP在此框架内的作用，并使得新的降维方法的系统推导成为可能。

Title: UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework

Authors: Youxin Pang, Yong Zhang, Ruizhi Shao, Xiang Deng, Feng Gao, Xu Xiaoming, Xiaoming Wei, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03918
Pdf URL: https://arxiv.org/pdf/2512.03918
Copy Paste: [[2512.03918]] UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework(https://arxiv.org/abs/2512.03918)
Keywords: generation
Abstract: We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
摘要：我们提出了 UniMo，一种创新的自回归模型，用于在统一框架内对 2D 人体视频和 3D 人体运动进行联合建模，首次实现了这两种模式的同时生成和理解。当前的方法主要侧重于在给定另一种模态作为条件的情况下生成一种模态，或者将其中一种模态与其他模态（例如文本和音频）集成。统一 2D 视频和 3D 运动以同时优化和生成在很大程度上仍未得到探索，由于它们在结构和分布上存在巨大差异，因此提出了重大挑战。受到 LLM 统一不同模态能力的启发，我们的方法将视频和 3D 运动建模为统一的标记序列，利用单独的嵌入层来缩小分布差距。此外，我们设计了一种序列建模策略，将两个不同的任务集成在一个框架内，证明了统一建模的有效性。此外，为了有效地与视觉标记对齐并保留 3D 空间信息，我们设计了一种具有时间扩展策略的新型 3D 运动标记器，使用单个 VQ-VAE 来生成量化的运动标记。它具有多个专家解码器，可处理身体形状、平移、全局方向和身体姿势，以实现可靠的 3D 运动重建。大量的实验表明，我们的方法在执行准确的动作捕捉时同时生成相应的视频和动作。这项工作利用了法学硕士融合不同数据类型的能力，为将以人为中心的信息集成到现有模型中铺平了道路，并有可能实现人类、物体和场景的多模式、可控联合建模。

Title: Beyond the Ground Truth: Enhanced Supervision for Image Restoration

Authors: Donghun Ryou, Inju Ha, Sanghyeok Chu, Bohyung Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03932
Pdf URL: https://arxiv.org/pdf/2512.03932
Copy Paste: [[2512.03932]] Beyond the Ground Truth: Enhanced Supervision for Image Restoration(https://arxiv.org/abs/2512.03932)
Keywords: restoration, super-resolution
Abstract: Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at this https URL.
摘要：基于深度学习的图像修复取得了显着的成功。然而，在解决现实世界的退化问题时，由于数据采集的实际限制，模型性能受到数据集中真实图像质量的限制。为了解决这一限制，我们提出了一种新颖的框架，可以增强现有的地面实况图像，为现实世界的恢复提供更高质量的监督。我们的框架通过结合自适应频率掩模（由条件频率掩模生成器学习），使用超分辨率生成感知增强的地面实况图像。这些掩模指导原始地面实况及其超分辨率变体的频率分量的最佳融合，从而产生增强的地面实况图像。这种频域混合保留了原始内容的语义一致性，同时有选择地丰富感知细节，防止可能损害保真度的幻觉伪像。增强的地面实况图像用于训练轻量级输出细化网络，该网络可以与现有的恢复模型无缝集成。大量的实验表明，我们的方法持续提高了恢复图像的质量。我们通过用户研究进一步验证了监督增强和输出细化的有效性。代码可从此 https URL 获取。

Title: Technical Report on Text Dataset Distillation

Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis, Lucas Pellicer, Rosimeire Pereira Costa, Anna Helena Reali Costa, Artur Jordao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.03967
Pdf URL: https://arxiv.org/pdf/2512.03967
Copy Paste: [[2512.03967]] Technical Report on Text Dataset Distillation(https://arxiv.org/abs/2512.03967)
Keywords: generation
Abstract: In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
摘要：在视觉领域，数据集蒸馏作为一种将大型数据集压缩为较小的合成数据集的技术而出现，该合成数据集在训练过程中表现出类似的结果。虽然图像数据提供了大量关于蒸馏方法的文献，但相比之下，文本数据集蒸馏的工作较少。文本数据集蒸馏最初是作为视觉宇宙的努力的适应而发展的，随着模式的特殊性成为明显的障碍，它上升为一个独立的研究分支。几个里程碑标志着该领域的发展，例如引入使用 Transformer 模型的方法、生成离散合成文本以及扩展到具有超过 1B 参数的仅解码器模型。尽管现代方法取得了重大进展，但该领域仍处于成熟阶段，在基准标准化、克服文本离散性的方法、处理复杂任务以及提供实际应用的明确示例方面仍有改进的空间。在本报告中，我们回顾了文本数据集蒸馏的过去和最近的进展，强调了不同的蒸馏策略、关键贡献和一般挑战。

Title: BlurDM: A Blur Diffusion Model for Image Deblurring

Authors: Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03979
Pdf URL: https://arxiv.org/pdf/2512.03979
Copy Paste: [[2512.03979]] BlurDM: A Blur Diffusion Model for Image Deblurring(https://arxiv.org/abs/2512.03979)
Keywords: generation
Abstract: Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at this https URL.
摘要：扩散模型显示出动态场景去模糊的前景；然而，现有的研究往往无法利用扩散模型中模糊过程的内在本质，从而限制了它们的全部潜力。为了解决这个问题，我们提出了模糊扩散模型（BlurDM），它将模糊形成过程无缝集成到图像去模糊的扩散中。观察到运动模糊源于连续曝光，BlurDM 通过双扩散前向方案隐式模拟模糊形成过程，将噪声和模糊扩散到清晰的图像上。在反向生成过程中，我们推导出双重去噪和去模糊公式，使 BlurDM 能够通过同时去噪和去模糊来恢复清晰的图像，并将模糊图像作为输入条件的纯高斯噪声。此外，为了有效地将 BlurDM 集成到去模糊网络中，我们在潜在空间中执行 BlurDM，形成灵活的前一代去模糊网络。大量实验表明，BlurDM 在四个基准数据集上显着且一致地增强了现有的去模糊方法。源代码可从此 https URL 获取。

Title: DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment

Authors: Sheng-Hao Liao, Shang-Fu Chen, Tai-Ming Huang, Wen-Huang Cheng, Kai-Lung Hua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03981
Pdf URL: https://arxiv.org/pdf/2512.03981
Copy Paste: [[2512.03981]] DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment(https://arxiv.org/abs/2512.03981)
Keywords: generation, generative
Abstract: Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: this https URL. Code is available at: this https URL.
摘要：使用生成模型的基于拖动的图像编辑提供了对图像结构的直观控制。然而，现有方法严重依赖手动提供的掩码和文本提示来保持语义保真度和运动精度。消除这些限制会产生一个基本的权衡：没有遮罩的视觉伪影和没有提示的糟糕的空间控制。为了解决这些限制，我们提出了 DirectDrag，一种新颖的无遮罩和无提示的编辑框架。 DirectDrag 能够以最少的用户输入实现精确、高效的操作，同时保持高图像保真度和精确的点对齐。 DirectDrag 引入了两项关键创新。首先，我们设计了一个自动软掩模生成模块，该模块可以根据点位移智能地推断可编辑区域，自动定位沿运动路径的变形，同时通过生成模型的固有能力保持上下文完整性。其次，我们开发了一种读出引导的特征对齐机制，该机制利用中间扩散激活在基于点的编辑过程中保持结构一致性，从而显着提高视觉保真度。尽管在没有手动遮罩或提示的情况下进行操作，DirectDrag 与现有方法相比仍能实现卓越的图像质量，同时保持具有竞争力的拖动精度。 DragBench 和实际场景的大量实验证明了 DirectDrag 对于高质量、交互式图像操作的有效性和实用性。项目页面：此 https URL。代码可在以下位置获取：此 https URL。

Title: Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation

Authors: Hang Xu, Linjiang Huang, Feng Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03996
Pdf URL: https://arxiv.org/pdf/2512.03996
Copy Paste: [[2512.03996]] Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation(https://arxiv.org/abs/2512.03996)
Keywords: generation, generative
Abstract: Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{this https URL}{this https URL}.
摘要：测试时间缩放（TTS）旨在通过增加随机采样并根据规则和指标评估样本来获得更好的结果。然而，在文本到图像（T2I）扩散模型中，大多数相关工作都集中在搜索策略和奖励模型上，而T2I扩散模型中噪声的随机特性对该方法性能的影响仍有待探索。在这项工作中，我们分析了 T2I 扩散模型中随机性的影响，并探索了一种新的 TTS 随机性格式：文本嵌入扰动，它与 SDE 注入噪声等现有随机性相结合，以增强生成多样性和质量。我们首先对这些随机性格式及其对生成的影响进行频域分析，发现这两种随机性在频域中表现出互补的行为：空间噪声有利于低频分量（早期步骤），而文本嵌入扰动增强高频细节（后面的步骤），从而补偿高频操作中空间噪声随机性的潜在局限性。同时，文本嵌入在生成过程的不同维度上表现出对扰动的不同程度的容忍度。具体来说，我们的方法由两个关键设计组成：（1）引入基于步骤的文本嵌入扰动，将频率引导噪声调度与空间噪声扰动相结合。 (2) 根据扰动对生成的特定频率贡献和对扰动的耐受性，有选择地调整扰动强度。我们的方法可以无缝集成到现有的 TTS 方法中，并在多个基准上展示出显着的改进，几乎不需要额外的计算。代码可在 \href{此 https URL}{此 https URL} 中找到。

Title: C3G: Learning Compact 3D Representations with 2K Gaussians

Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim, Kazumi Fukuda, Minkyeong Jeon, Jisang Han, Takuya Narihira, Hyuna Ko, Junsu Kim, Yuki Mitsufuji, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04021
Pdf URL: https://arxiv.org/pdf/2512.04021
Copy Paste: [[2512.04021]] C3G: Learning Compact 3D Representations with 2K Gaussians(https://arxiv.org/abs/2512.04021)
Keywords: generation
Abstract: Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
摘要：以前馈方式从未设置的稀疏视图中重建和理解 3D 场景仍然是 3D 计算机视觉中的一项具有挑战性的任务。最近的方法使用每像素 3D 高斯分布进行重建，然后使用 2D 到 3D 特征提升阶段进行场景理解。然而，它们生成过多的冗余高斯，导致高内存开销和次优的多视图特征聚合，导致新视图合成和场景理解性能下降。我们提出了 C3G，一种新颖的前馈框架，仅在必要的空间位置估计紧凑的 3D 高斯，最大限度地减少冗余，同时实现有效的特征提升。我们引入了可学习的标记，通过自注意力聚合多视图特征来指导高斯生成，确保每个高斯集成跨视图的相关视觉特征。然后，我们利用学习到的注意力模式进行高斯解码，以有效提升特征。关于无姿势新颖视图合成、3D 开放词汇分割和视图不变特征聚合的大量实验证明了我们方法的有效性。结果表明，紧凑但具有几何意义的表示足以进行高质量的场景重建和理解，与现有方法相比，实现卓越的内存效率和特征保真度。

Title: PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04025
Pdf URL: https://arxiv.org/pdf/2512.04025
Copy Paste: [[2512.04025]] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation(https://arxiv.org/abs/2512.04025)
Keywords: generation
Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: this http URL
摘要：注意力机制是基础模型的核心，但其二次复杂度仍然是扩展的关键瓶颈。这一挑战推动了有效注意力机制的发展，稀疏性成为主导范式。当前的方法通常保留或丢弃具有二进制掩码的整个键值块，导致高稀疏性下的大量信息丢失。为了弥补这一差距，我们提出了金字塔稀疏注意力（PSA），这是一个适用于视频理解和生成任务的多功能模块。 PSA 引入了多级池化 KV 表示，而不是二进制掩码，从而实现更精细的掩码粒度。具体来说，每个查询块动态地将较低的池化级别分配给关键的 KV 块，将较高的池化级别分配给不太重要的块，从而在完全保留和完全修剪之间创建信息插值。这种设计类似于计算机视觉中的定点量化和经典特征金字塔网络，可以有效减少信息丢失，同时在低计算预算下保持计算效率。它与本地硬件友好的内核配合使用，利用解耦的块瓦片设计来确保高效执行。在视频理解和生成基准中，PSA 保留了上下文信息和视觉保真度，始终优于现有的稀疏注意力基线或实现了与现有稀疏注意力基线相当的性能，并具有卓越的效率与质量权衡。我们的代码和模型权重可在以下位置公开获取：此 http URL

Title: Fast & Efficient Normalizing Flows and Applications of Image Generative Models

Authors: Sandeep Nagar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04039
Pdf URL: https://arxiv.org/pdf/2512.04039
Copy Paste: [[2512.04039]] Fast & Efficient Normalizing Flows and Applications of Image Generative Models(https://arxiv.org/abs/2512.04039)
Keywords: restoration, super-resolution, generative, quality assessment
Abstract: This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
摘要：本论文在两个主要领域做出了新颖的贡献：提高生成模型的效率，特别是规范化流程，以及应用生成模型来解决现实世界的计算机视觉挑战。第一部分通过六项关键创新介绍了归一化流架构的重大改进：1) 开发可逆 3x3 卷积层，并具有经过数学证明的可逆性必要和充分条件，(2) 引入更高效的四耦合层，3) 为 kxk 卷积层设计快速高效的并行反转算法，4) 用于逆卷积的快速高效反向传播算法，5) 使用卷积逆， Inverse-Flow，用于前向传播并使用提出的反向传播算法对其进行训练，以及 6) Affine-StableSR，一种紧凑且高效的超分辨率模型，利用预训练的权重和归一化流层来减少参数数量，同时保持性能。第二部分：1）使用条件 GAN 的农产品自动质量评估系统，解决类别不平衡、数据稀缺和注释挑战，在种子纯度测试中实现良好的准确性； 2）利用堆叠自动编码器进行降维的无监督地质测绘框架，与传统方法相比，显示出改进的特征提取； 3）我们提出了一种用于自动驾驶数据集的隐私保护方法，用于人脸检测和图像修复； 4）利用基于稳定扩散的图像修复来替换检测到的面部和车牌，以推进该领域的隐私保护技术和伦理考虑。 5）用于艺术修复的自适应扩散模型，通过统一微调有效处理多种类型的退化。

Title: RELIC: Interactive Video World Model with Long-Horizon Memory

Authors: Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04040
Pdf URL: https://arxiv.org/pdf/2512.04040
Copy Paste: [[2512.04040]] RELIC: Interactive Video World Model with Long-Horizon Memory(https://arxiv.org/abs/2512.04040)
Keywords: generation
Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
摘要：真正的交互式世界模型需要三个关键要素：实时长视界流、一致的空间记忆和精确的用户控制。然而，大多数现有方法仅单独解决这些方面之一，因为同时实现所有三个方面非常具有挑战性，例如，长期记忆机制通常会降低实时性能。在这项工作中，我们提出了 RELIC，一个可以共同解决这三个挑战的统一框架。给定单个图像和文本描述，RELIC 可以实时对任意场景进行记忆感知、长时间探索。我们的模型基于最近的自回归视频扩散蒸馏技术，使用高度压缩的历史潜在标记来表示长视野内存，这些标记在 KV 缓存中用相对动作和绝对相机姿势进行编码。这种紧凑的相机感知内存结构支持隐式 3D 一致内容检索，并以最小的计算开销实现长期一致性。与此同时，我们对双向教师视频模型进行微调，以生成超出其原始 5 秒训练范围的序列，并使用新的内存高效的自我强制范式将其转换为因果学生生成器，该范式能够在长时间的教师以及长时间的学生自我部署中实现全上下文蒸馏。 RELIC 作为 14B 参数模型实现，并在精心策划的虚幻引擎渲染数据集上进行训练，实现了 16 FPS 的实时生成，同时与之前的工作相比，展示了更准确的动作跟踪、更稳定的长视野流和更强大的空间记忆检索。这些功能为 RELIC 奠定了下一代交互式世界建模的坚实基础。

Title: MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Authors: Yizhou Zhao, Zhiwei Steven Wu, Adam Block
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2512.04044
Pdf URL: https://arxiv.org/pdf/2512.04044
Copy Paste: [[2512.04044]] MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking(https://arxiv.org/abs/2512.04044)
Keywords: generation
Abstract: Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
摘要：水印旨在将隐藏信号嵌入生成的文本中，当访问密钥时可以可靠地检测到这些信号。开放权重语言模型对此类水印方案提出了严峻的挑战，因为一旦模型权重公开，当代方法中占主导地位的推理时间干预就无法强制执行。现有的开放权重模型造水技术，例如最近提出的 GaussMark，通常依赖于对模型权重的微小修改，这可以产生那些配备密钥的人可检测到的信号，但要实现与推理时间水印相当的检测能力，通常需要权重扰动，这会显着降低生成质量。我们引入了 MarkTune，这是一个理论上有原则的、策略上的微调框架，它将 GaussMark 信号视为奖励，同时针对文本质量的下降进行正则化。我们推导出 MarkTune 作为 GaussMark 的改进，并证明 MarkTune 通过在模型表示空间内引导更细粒度、水印感知的权重更新，同时保持生成质量，持续改进 GaussMark 的质量与可检测性权衡。根据经验，我们表明 MarkTune 将 GaussMark 的质量可检测性边界推向接近推理时间水印的边界，对释义和微调攻击保持鲁棒性，并表现出很强的泛化性：在一个数据集上微调的模型在未见过的数据集上保留了大量的水印检测能力。总之，这些结果使 MarkTune 成为将强大的高质量水印嵌入到开放权重 LM 中的通用策略。

Title: Stable Signer: Hierarchical Sign Language Generative Model

Authors: Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas
Subjects: cs.CV, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.04048
Pdf URL: https://arxiv.org/pdf/2512.04048
Copy Paste: [[2512.04048]] Stable Signer: Hierarchical Sign Language Generative Model(https://arxiv.org/abs/2512.04048)
Keywords: generation, generative
Abstract: Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
摘要：手语制作 (SLP) 是将复杂的输入文本转换为真实视频的过程。之前的大多数作品都集中在 Text2Gloss、Gloss2Pose、Pose2Vid 阶段，还有一些集中在 Prompt2Gloss 和 Text2Avatar 阶段。然而，由于文本转换、姿势生成以及将姿势渲染成真人视频这些阶段的不准确，导致错误逐渐积累，该领域进展缓慢。因此，在本文中，我们精简了传统的冗余结构，简化和优化了任务目标，设计了一种新的手语生成模型，称为Stable Signer。它将SLP任务重新定义为仅包含文本理解（Prompt2Gloss、Text2Gloss）和Pose2Vid的分层生成端到端任务，并通过我们提出的名为SLUL的新手语理解链接器执行文本理解，并通过名为SLP-MoE手势渲染专家块生成手势，以端到端生成高质量和多风格的手语视频。 SLUL 使用新开发的语义感知光泽掩蔽损失（SAGM Loss）进行训练。与当前的SOTA生成方法相比，其性能提高了48.6%。

Title: PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Authors: Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04082
Pdf URL: https://arxiv.org/pdf/2512.04082
Copy Paste: [[2512.04082]] PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design(https://arxiv.org/abs/2512.04082)
Keywords: generative
Abstract: Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
摘要：平面设计构成了现代视觉传达的基石，是促进文化和商业活动的重要媒介。最近的进展已经探索使用大型多模态模型 (LMM) 实现此过程的自动化，但现有方法通常会产生几何不准确的布局，并且缺乏专业工作流程所需的迭代、特定于层的编辑。为了解决这些限制，我们推出了 PosterCopilot，这是一个可以推进专业图形设计的布局推理和可控编辑的框架。具体来说，我们引入了一种渐进的三阶段训练策略，使 LMM 具备布局设计的几何理解和审美推理，包括扰动监督微调、视觉现实对齐的强化学习和审美反馈的强化学习。此外，我们开发了一个完整的工作流程，将经过训练的基于 LMM 的设计模型与生成模型结合起来，实现层可控的迭代编辑，以实现精确的元素细化，同时保持全局视觉一致性。大量实验表明，PosterCopilot 实现了几何精确和美观的卓越布局，为专业迭代设计提供了前所未有的可控性。

Title: SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

Authors: Qinyu Zhao, Guangting Zheng, Tao Yang, Rui Zhu, Xingjian Leng, Stephen Gould, Liang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04084
Pdf URL: https://arxiv.org/pdf/2512.04084
Copy Paste: [[2512.04084]] SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows(https://arxiv.org/abs/2512.04084)
Keywords: generation
Abstract: Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 \times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
摘要：归一化流 (NF) 学习数据和高斯分布之间的可逆映射。先前的作品通常受到两个限制。首先，他们将随机噪声添加到训练样本或 VAE 潜伏中作为数据增强，引入复杂的管道，包括额外的噪声和去噪步骤。其次，他们使用预训练和冻结的 VAE 编码器，导致重建和生成质量不佳。在本文中，我们发现这两个问题可以通过一种非常简单的方式解决：只需将方差（否则将由 VAE 编码器预测）固定为常数（例如 0.5）。一方面，该方法允许编码器输出更广泛的令牌分布，并且解码器能够学习从增强的令牌分布重建干净的图像，避免额外的噪声或去噪设计。另一方面，固定方差简化了 VAE 证据下界，使得联合训练 NF 和 VAE 变得稳定。在 ImageNet $256 × 256$ 生成任务中，我们的模型 SimFlow 获得了 2.15 的 gFID 分数，优于最先进的方法 STARFlow (gFID 2.40)。此外，SimFlow可以与端到端表示对齐（REPA-E）方法无缝集成，并实现改进的gFID 1.91，在NF中树立了新的技术水平。