2026-01-22

Title: SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control

Authors: Ho Yin Au, Junkun Jiang, Jie Chen
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2601.14258
Pdf URL: https://arxiv.org/pdf/2601.14258
Copy Paste: [[2601.14258]] SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control(https://arxiv.org/abs/2601.14258)
Keywords: generation
Abstract: Traditional text-to-motion frameworks often lack precise control, and existing approaches based on joint keyframe locations provide only positional guidance, making it challenging and unintuitive to specify body part orientations and motion timing. To address these limitations, we introduce the Salient Orientation Symbolic (SOS) script, a programmable symbolic framework for specifying body part orientations and motion timing at keyframes. We further propose an automatic SOS extraction pipeline that employs temporally-constrained agglomerative clustering for frame saliency detection and a Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts directly from motion data. Moreover, we present the SOSControl framework, which treats the available orientation symbols in the sparse SOS script as salient and prioritizes satisfying these constraints during motion generation. By incorporating SMS-based data augmentation and gradient-based iterative optimization, the framework enhances alignment with user-specified constraints. Additionally, it employs a ControlNet-based ACTOR-PAE Decoder to ensure smooth and natural motion outputs. Extensive experiments demonstrate that the SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes, while the SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.
摘要：传统的文本到运动框架通常缺乏精确的控制，并且基于关节关键帧位置的现有方法仅提供位置指导，这使得指定身体部位方向和运动时间变得具有挑战性且不直观。为了解决这些限制，我们引入了显着方向符号（SOS）脚本，这是一个可编程符号框架，用于指定关键帧处的身体部位方向和运动时序。我们进一步提出了一种自动 SOS 提取管道，该管道采用时间约束的凝聚聚类进行帧显着性检测，并使用基于显着性的掩蔽方案（SMS）直接从运动数据生成稀疏、可解释的 SOS 脚本。此外，我们提出了 SOSControl 框架，它将稀疏 SOS 脚本中的可用方向符号视为显着的，并在运动生成过程中优先满足这些约束。通过结合基于短信的数据增强和基于梯度的迭代优化，该框架增强了与用户指定约束的一致性。此外，它还采用基于 ControlNet 的 ACTOR-PAE 解码器来确保平滑、自然的运动输出。大量实验表明，SOS 提取管道生成人类可解释的脚本，并在显着关键帧处带有符号注释，而 SOSControl 框架在运动质量、可控性和运动时序和身体部位方向控制方面的通用性方面优于现有基线。

Title: A Cloud-Based Cross-Modal Transformer for Emotion Recognition and Adaptive Human-Computer Interaction

Authors: Ziwen Zhong, Zhitao Shu, Yue Zhao
Subjects: cs.CV, cs.AI, cs.HC, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.14259
Pdf URL: https://arxiv.org/pdf/2601.14259
Copy Paste: [[2601.14259]] A Cloud-Based Cross-Modal Transformer for Emotion Recognition and Adaptive Human-Computer Interaction(https://arxiv.org/abs/2601.14259)
Keywords: generation
Abstract: Emotion recognition is a fundamental component of next-generation human-computer interaction (HCI), enabling machines to perceive, understand, and respond to users' affective states. However, existing systems often rely on single-modality analysis such as facial expressions, speech tone, or textual sentiment, resulting in limited robustness and poor generalization in real-world environments. To address these challenges, this study proposes a Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction. The proposed model integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) and employs a cross-modal attention mechanism to capture complex interdependencies among heterogeneous features. By leveraging cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving, the system enables scalable, low-latency emotion recognition for large-scale user interactions. Experiments conducted on benchmark datasets including IEMOCAP, MELD, and AffectNet demonstrate that the CMT achieves state-of-the-art performance, improving the F1-score by 3.0 percent and reducing cross-entropy loss by 12.9 percent compared to strong multimodal baselines. Additionally, cloud deployment evaluations show an average response latency of 128 ms, representing a 35 percent reduction compared with conventional transformer-based fusion systems. These results confirm that the proposed framework enables efficient, real-time emotion recognition and adaptive feedback in applications such as intelligent customer service, virtual tutoring systems, and affective computing interfaces, marking an important step toward cloud-native affective computing and emotionally intelligent interactive systems.
摘要：情感识别是下一代人机交互 (HCI) 的基本组成部分，使机器能够感知、理解和响应用户的情感状态。然而，现有系统通常依赖于单一模态分析，例如面部表情、语音语气或文本情感，导致现实环境中的稳健性有限且泛化能力差。为了应对这些挑战，本研究提出了一种基于云的跨模态转换器（CMT）框架，用于多模态情感识别和自适应人机交互。所提出的模型使用预训练编码器（Vision Transformer、Wav2Vec2 和 BERT）集成视觉、听觉和文本信号，并采用跨模式注意机制来捕获异构特征之间复杂的相互依赖性。通过利用云计算基础设施以及 Kubernetes 和 TensorFlow Serving 上的分布式训练，该系统可为大规模用户交互实现可扩展、低延迟的情感识别。在 IEMOCAP、MELD 和 AffectNet 等基准数据集上进行的实验表明，与强多模态基线相比，CMT 实现了最先进的性能，F1 分数提高了 3.0%，交叉熵损失降低了 12.9%。此外，云部署评估显示平均响应延迟为 128 毫秒，与传统基于变压器的融合系统相比减少了 35%。这些结果证实，所提出的框架能够在智能客户服务、虚拟辅导系统和情感计算接口等应用中实现高效、实时的情感识别和自适应反馈，标志着向云原生情感计算和情感智能交互系统迈出了重要一步。

Title: GCG Attack On A Diffusion LLM

Authors: Ruben Neyroud, Sam Corley
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2601.14266
Pdf URL: https://arxiv.org/pdf/2601.14266
Copy Paste: [[2601.14266]] GCG Attack On A Diffusion LLM(https://arxiv.org/abs/2601.14266)
Keywords: generation
Abstract: While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion language models remains largely unexplored. In this work, we present an exploratory study of GCG-style adversarial prompt attacks on LLaDA (Large Language Diffusion with mAsking), an open-source diffusion LLM. We evaluate multiple attack variants, including prefix perturbations and suffix-based adversarial generation, on harmful prompts drawn from the AdvBench dataset. Our study provides initial insights into the robustness and attack surface of diffusion language models and motivates the development of alternative optimization and evaluation strategies for adversarial analysis in this setting.
摘要：虽然大多数法学硕士都是自回归的，但基于扩散的法学硕士最近已成为替代的生成方法。贪婪坐标梯度（GCG）攻击已被证明对自回归模型有效，但它们在扩散语言模型中的适用性在很大程度上仍未得到探索。在这项工作中，我们对 LLaDA（Large Language Diffusion with mAsking）（一种开源扩散法学硕士）进行了 GCG 式对抗性即时攻击的探索性研究。我们根据 AdvBench 数据集提取的有害提示评估多种攻击变体，包括前缀扰动和基于后缀的对抗生成。我们的研究提供了对扩散语言模型的鲁棒性和攻击面的初步见解，并激发了在此背景下用于对抗性分析的替代优化和评估策略的开发。

Title: On the Limits of Learned Importance Scoring for KV Cache Compression

Authors: Brady Steele
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14279
Pdf URL: https://arxiv.org/pdf/2601.14279
Copy Paste: [[2601.14279]] On the Limits of Learned Importance Scoring for KV Cache Compression(https://arxiv.org/abs/2601.14279)
Keywords: generation
Abstract: We investigate learned KV cache compression through Speculative Importance Prediction (SIP), a 1.7M parameter non-query-aware scorer that predicts token importance from KV representations alone. Despite architectural sophistication (multi-horizon lookahead, cross-attention), SIP does not outperform simple baselines, including random selection, across 5 seeds, 4 retention levels, and 3 tasks. Key findings: (1) position-based heuristics (keep first 4 + last N tokens) match or exceed learned approaches; (2) prefill attention provides equivalent signal to complex learned scorers; (3) marginal information in KV representations beyond position and prefill attention appears limited for importance prediction. We hypothesize that circular dependence between future queries and generation trajectories contributes to this difficulty.
摘要：我们通过推测重要性预测 (SIP) 研究学习的 KV 缓存压缩，SIP 是一个 1.7M 参数的非查询感知评分器，仅根据 KV 表示来预测令牌重要性。尽管架构复杂（多视野前瞻、交叉注意力），SIP 的性能并不优于简单的基线，包括跨 5 个种子、4 个保留级别和 3 个任务的随机选择。主要发现：（1）基于位置的启发式（保持前 4 个 + 最后 N 个标记）匹配或超过学习方法； (2) 预填充注意力为复杂的学习评分器提供等效信号； (3) KV 表示中超出位置和预填充注意力的边缘信息对于重要性预测来说似乎很有限。我们假设未来查询和生成轨迹之间的循环依赖导致了这一困难。

Title: Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design

Authors: Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14283
Pdf URL: https://arxiv.org/pdf/2601.14283
Copy Paste: [[2601.14283]] Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design(https://arxiv.org/abs/2601.14283)
Keywords: generative
Abstract: Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in this https URL
摘要：目前，基于结构的药物设计领域主要由三类算法主导：基于搜索的算法、深度生成模型和强化学习。虽然现有的工作通常集中于比较单一算法类别内的模型，但跨算法的比较仍然很少。在本文中，为了填补这一空白，我们建立了一个基准，通过评估生成的分子的药物特性及其与特定目标蛋白的对接亲和力和姿势来评估这些不同算法基础上 15 个模型的性能。我们强调每种算法方法的独特优势，并为未来 SBDD 模型的设计提供建议。我们强调，通过将对接功能视为黑盒预言机（通常被忽视），可以在 SBDD 中使用以 1D/2D 配体为中心的药物设计方法。我们的评估揭示了不同模型类别的不同模式。基于 3D 结构的模型在结合亲和力方面表现出色，但在化学有效性和姿势质量方面表现出不一致。一维模型在标准分子指标中表现出可靠的性能，但很少实现最佳的结合亲和力。 2D 模型提供平衡的性能，保持较高的化学有效性，同时实现中等的结合分数。通过对多个蛋白质靶标的详细分析，我们确定了每个模型类别的关键改进领域，为研究人员提供了见解，以结合不同方法的优势，同时解决其局限性。用于基准测试的所有代码都可以在此 https URL 中找到

Title: Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents

Authors: Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, Huawei Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14287
Pdf URL: https://arxiv.org/pdf/2601.14287
Copy Paste: [[2601.14287]] Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents(https://arxiv.org/abs/2601.14287)
Keywords: generation
Abstract: External memory systems are pivotal for enabling Large Language Model (LLM) agents to maintain persistent knowledge and perform long-horizon decision-making. Existing paradigms typically follow a two-stage process: computationally expensive memory construction (e.g., structuring data into graphs) followed by naive retrieval-augmented generation. However, our empirical analysis reveals two fundamental limitations: complex construction incurs high costs with marginal performance gains, and simple context concatenation fails to bridge the gap between retrieval recall and reasoning accuracy. To address these challenges, we propose CoM (Chain-of-Memory), a novel framework that advocates for a paradigm shift toward lightweight construction paired with sophisticated utilization. CoM introduces a Chain-of-Memory mechanism that organizes retrieved fragments into coherent inference paths through dynamic evolution, utilizing adaptive truncation to prune irrelevant noise. Extensive experiments on the LongMemEval and LoCoMo benchmarks demonstrate that CoM outperforms strong baselines with accuracy gains of 7.5%-10.4%, while drastically reducing computational overhead to approximately 2.7% of token consumption and 6.0% of latency compared to complex memory architectures.
摘要：外部记忆系统对于大型语言模型 (LLM) 代理保持持久知识和执行长期决策至关重要。现有的范例通常遵循两个阶段的过程：计算成本高昂的内存构建（例如，将数据结构化为图形），然后是简单的检索增强生成。然而，我们的实证分析揭示了两个基本局限性：复杂的构造会产生高成本，而性能收益却有限；而简单的上下文串联无法弥合检索回忆和推理准确性之间的差距。为了应对这些挑战，我们提出了 CoM（内存链），这是一种新颖的框架，主张向轻量级结构与复杂利用相结合的范式转变。 CoM 引入了内存链机制，通过动态演化将检索到的片段组织成连贯的推理路径，利用自适应截断来修剪不相关的噪声。 LongMemEval 和 LoCoMo 基准测试的大量实验表明，CoM 的性能优于强大的基准，准确度提高了 7.5%-10.4%，同时与复杂的内存架构相比，计算开销大幅降低至约 2.7% 的令牌消耗和 6.0% 的延迟。

Title: LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

Authors: Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14330
Pdf URL: https://arxiv.org/pdf/2601.14330
Copy Paste: [[2601.14330]] LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models(https://arxiv.org/abs/2601.14330)
Keywords: generation, generative
Abstract: Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.
摘要：概念擦除旨在抑制扩散模型中的敏感内容，但最近的研究表明，被擦除的概念仍然可以被重新唤醒，揭示了擦除方法的漏洞。现有的重新唤醒方法主要依靠即时级优化来操纵采样轨迹，忽略了其他生成因素，这限制了对底层动态的全面理解。在本文中，我们将生成过程建模为隐函数，以便对文本条件、模型参数和潜在状态等多个因素进行全面的理论分析。我们从理论上证明，扰动每个因素都可以重新唤醒被删除的概念。基于这一见解，我们提出了一种新颖的概念重新唤醒方法：概念重新唤醒的潜在空间解锁（LURE），它通过重建潜在空间和引导采样轨迹来重新唤醒被擦除的概念。具体来说，我们的语义重新绑定机制通过将去噪预测与目标分布对齐来重建潜在空间，以重建断开的文本-视觉关联。然而，在多概念场景中，朴素的重建可能会导致梯度冲突和特征纠缠。为了解决这个问题，我们引入了梯度场正交化，它强制特征正交性以防止相互干扰。此外，我们的潜在语义识别引导采样（LSIS）通过后验密度验证确保重新唤醒过程的稳定性。大量实验表明，LURE 能够跨不同擦除任务和方法同时、高保真地重新唤醒多个已擦除概念。

Title: VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14354
Pdf URL: https://arxiv.org/pdf/2601.14354
Copy Paste: [[2601.14354]] VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models(https://arxiv.org/abs/2601.14354)
Keywords: generative
Abstract: Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for self-supervised learning by predicting latent representations rather than reconstructing high-entropy observations. However, existing formulations rely on \textit{deterministic} regression objectives, which mask probabilistic semantics and limit its applicability in stochastic control. In this work, we introduce \emph{Variational JEPA (VJEPA)}, a \textit{probabilistic} generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations can serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emph{Bayesian JEPA (BJEPA)}, an extension that factorizes the predictive belief into a learned dynamics expert and a modular prior expert, enabling zero-shot task transfer and constraint (e.g. goal, physics) satisfaction via a Product of Experts. Empirically, through a noisy environment experiment, we demonstrate that VJEPA and BJEPA successfully filter out high-variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood-free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional, noisy environments.
摘要：联合嵌入预测架构（JEPA）通过预测潜在表示而不是重建高熵观察结果，为自监督学习提供了可扩展的范例。然而，现有的公式依赖于\textit{确定性}回归目标，这掩盖了概率语义并限制了其在随机控制中的适用性。在这项工作中，我们引入了 \emph{变分 JEPA (VJEPA)}，这是一种 \textit{概率} 泛化，它通过变分目标学习未来潜在状态的预测分布。我们表明，VJEPA 将表示学习与预测状态表示（PSR）和贝叶斯过滤相结合，建立顺序建模不需要自回归观察可能性。从理论上讲，我们证明 VJEPA 表示可以作为无需像素重建的最优控制的充分信息状态，同时为避免崩溃提供形式保证。我们进一步提出 \emph{Bayesian JEPA (BJEPA)}，一种将预测信念分解为学习动态专家和模块化先验专家的扩展，通过专家的产品实现零样本任务转移和约束（例如目标、物理）满足。根据经验，通过噪声环境实验，我们证明 VJEPA 和 BJEPA 成功过滤掉了导致生成基线表示崩溃的高方差干扰干扰因素。通过实现有原则的不确定性估计（例如通过采样构建可信区间），同时保持观测的无似然性，VJEPA 为高维、噪声环境中的可扩展、稳健、不确定性感知规划提供了一个基础框架。

Title: Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data

Authors: Yixiong Chen, Zongwei Zhou, Wenxuan Li, Alan Yuille
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2601.14406
Pdf URL: https://arxiv.org/pdf/2601.14406
Copy Paste: [[2601.14406]] Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data(https://arxiv.org/abs/2601.14406)
Keywords: quality assessment
Abstract: Large-scale medical segmentation datasets often combine manual and pseudo-labels of uneven quality, which can compromise training and evaluation. Low-quality labels may hamper performance and make the model training less robust. To address this issue, we propose SegAE (Segmentation Assessment Engine), a lightweight vision-language model (VLM) that automatically predicts label quality across 142 anatomical structures. Trained on over four million image-label pairs with quality scores, SegAE achieves a high correlation coefficient of 0.902 with ground-truth Dice similarity and evaluates a 3D mask in 0.06s. SegAE shows several practical benefits: (I) Our analysis reveals widespread low-quality labeling across public datasets; (II) SegAE improves data efficiency and training performance in active and semi-supervised learning, reducing dataset annotation cost by one-third and quality-checking time by 70% per label. This tool provides a simple and effective solution for quality control in large-scale medical segmentation datasets. The dataset, model weights, and codes are released at this https URL.
摘要：大规模的医学分割数据集通常结合了质量参差不齐的手动标签和伪标签，这可能会影响训练和评估。低质量的标签可能会影响性能并降低模型训练的稳健性。为了解决这个问题，我们提出了 SegAE（分割评估引擎），这是一种轻量级视觉语言模型 (VLM)，可以自动预测 142 个解剖结构的标签质量。经过超过 400 万个具有质量分数的图像标签对的训练，SegAE 与真实 Dice 相似性实现了 0.902 的高相关系数，并在 0.06 秒内评估了 3D 掩模。 SegAE 显示了几个实际好处：（I）我们的分析揭示了公共数据集中普遍存在的低质量标签； (II) SegAE 提高了主动和半监督学习中的数据效率和训练性能，将数据集注释成本减少了三分之一，每个标签的质量检查时间减少了 70%。该工具为大规模医学分割数据集中的质量控制提供了简单有效的解决方案。数据集、模型权重和代码在此 https URL 发布。

Title: Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation

Authors: Danial Sadrian Zadeh, Otman A. Basir, Behzad Moshiri
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14438
Pdf URL: https://arxiv.org/pdf/2601.14438
Copy Paste: [[2601.14438]] Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation(https://arxiv.org/abs/2601.14438)
Keywords: generation
Abstract: Traffic scene understanding is essential for enabling autonomous vehicles to accurately perceive and interpret their environment, thereby ensuring safe navigation. This paper presents a novel framework that transforms a single frontal-view camera image into a concise natural language description, effectively capturing spatial layouts, semantic relationships, and driving-relevant cues. The proposed model leverages a hybrid attention mechanism to enhance spatial and semantic feature extraction and integrates these features to generate contextually rich and detailed scene descriptions. To address the limited availability of specialized datasets in this domain, a new dataset derived from the BDD100K dataset has been developed, with comprehensive guidelines provided for its construction. Furthermore, the study offers an in-depth discussion of relevant evaluation metrics, identifying the most appropriate measures for this task. Extensive quantitative evaluations using metrics such as CIDEr and SPICE, complemented by human judgment assessments, demonstrate that the proposed model achieves strong performance and effectively fulfills its intended objectives on the newly developed dataset.
摘要：交通场景理解对于自动驾驶车辆准确感知和解释其环境至关重要，从而确保安全导航。本文提出了一种新颖的框架，可将单个前视摄像头图像转换为简洁的自然语言描述，有效捕获空间布局、语义关系和驾驶相关线索。所提出的模型利用混合注意机制来增强空间和语义特征提取，并集成这些特征以生成上下文丰富且详细的场景描述。为了解决该领域专业数据集可用性有限的问题，开发了一个源自 BDD100K 数据集的新数据集，并为其构建提供了全面的指南。此外，该研究还深入讨论了相关评估指标，确定了最适合该任务的措施。使用 CIDEr 和 SPICE 等指标进行的广泛定量评估，辅以人类判断评估，表明所提出的模型在新开发的数据集上实现了强劲的性能并有效地实现了其预期目标。

Title: Search over Self-Edit Strategies for LLM Adaptation

Authors: Alistair Cheong, Haolin Cong, Tyler Yang, Dustin Miao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14532
Pdf URL: https://arxiv.org/pdf/2601.14532
Copy Paste: [[2601.14532]] Search over Self-Edit Strategies for LLM Adaptation(https://arxiv.org/abs/2601.14532)
Keywords: generation
Abstract: Many LLM-based open-ended search systems freeze the foundation model that proposes improvements to existing solutions, which may bottleneck long-run progress. Recent work has explored updating the proposal model at test time [arXiv:2511.23473], but the update strategy is still typically hand-specified. Therefore, this study investigated whether an LLM can use task feedback to decide how it should update its weights. For tractability, we focused on the simpler case where there is only one round of self-improvement, and restricted the update operator to self-supervised next token prediction (NTP), leaving the model freedom in choosing its training data and key NTP hyperparameters. Using the Self-Adapting Language Models (SEAL) [arXiv:2506.10943] framework as a testbed, we relaxed its fixed human template constraint and allowed the model to generate its own self-edit templates, thereby giving it more control over its training data and hyperparameters. Two variants were studied, differing in whether template generation was conditioned on a lightweight archive of past templates. In SEAL's Single-Passage Knowledge Incorporation setting with Qwen3-8B on SQuAD [arXiv:1606.05250], the no-archive variant performed comparably to the weaker "Implications" baseline, while the archive variant outperformed "Implications" and approached the strongest human-designed "Rewrite" baseline without surpassing it. Further analysis of collapse in the model's exploration revealed that a naive archive can confer some short-term robustness but can also accelerate homogenization, suggesting that explicit novelty pressure may be required to consistently advance beyond carefully optimized human strategies. Our code is available at this https URL .
摘要：许多基于法学硕士的开放式搜索系统冻结了对现有解决方案提出改进的基础模型，这可能会阻碍长期进展。最近的工作探索了在测试时更新提案模型 [arXiv:2511.23473]，但更新策略通常仍然是手动指定的。因此，本研究调查了法学硕士是否可以使用任务反馈来决定如何更新其权重。为了易于处理，我们专注于只有一轮自我改进的简单情况，并将更新算子限制为自我监督的下一个令牌预测（NTP），从而使模型可以自由选择其训练数据和关键 NTP 超参数。使用自适应语言模型（SEAL）[arXiv:2506.10943]框架作为测试平台，我们放宽了其固定的人工模板约束，并允许模型生成自己的自编辑模板，从而使其能够更好地控制其训练数据和超参数。研究了两种变体，不同之处在于模板生成是否以过去模板的轻量级存档为条件。在 SQuAD 上的 SEAL 与 Qwen3-8B 的单通道知识合并设置中 [arXiv:1606.05250]，无存档变体的表现与较弱的“含义”基线相当，而存档变体的表现优于“含义”并接近最强的人类设计的“重写”基线，但没有超过它。对模型探索中崩溃的进一步分析表明，幼稚的存档可以赋予一些短期稳健性，但也可以加速同质化，这表明可能需要明确的新颖性压力才能不断超越精心优化的人类策略。我们的代码可在此 https URL 获取。

Title: Report for NSF Workshop on AI for Electronic Design Automation

Authors: Deming Chen, Vijay Ganesh, Weikai Li, Yingyan (Celine)Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2601.14541
Pdf URL: https://arxiv.org/pdf/2601.14541
Copy Paste: [[2601.14541]] Report for NSF Workshop on AI for Electronic Design Automation(https://arxiv.org/abs/2601.14541)
Keywords: generation
Abstract: This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website this https URL.
摘要：本报告总结了 NSF 电子设计自动化 (EDA) 人工智能研讨会的讨论和建议，该研讨会于 2024 年 12 月 10 日在温哥华与 NeurIPS 2024 同期举行。该研讨会汇集了机器学习和 EDA 领域的专家，探讨了跨越大语言模型 (LLM)、图神经网络 (GNN)、强化学习 (RL)、神经符号方法等的人工智能如何促进 EDA 并缩短设计周转时间。研讨会包括四个主题：（1）用于物理合成和制造设计（DFM）的人工智能，讨论物理制造过程中的挑战和潜在的人工智能应用；（2）用于高级和逻辑级综合（HLS/LLS）的AI，涵盖编译指示插入、程序转换、RTL代码生成等； (3) 用于优化和设计的人工智能工具箱，讨论可能应用于 EDA 任务的前沿人工智能发展； (4) 用于测试和验证的人工智能，包括法学硕士辅助验证工具、机器学习增强的 SAT 解决方案、安全/可靠性挑战等。报告建议 NSF 促进人工智能/EDA 合作，投资用于 EDA 的基础人工智能，开发强大的数据基础设施，促进可扩展的计算基础设施，并投资于劳动力发展，以实现硬件设计民主化并支持下一代硬件系统。研讨会信息可以在该 https URL 网站上找到。

Title: QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Authors: Nilesh Prasad Pandey, Jangseon Park, Onat Gungor, Flavio Ponzina, Tajana Rosing
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14549
Pdf URL: https://arxiv.org/pdf/2601.14549
Copy Paste: [[2601.14549]] QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design(https://arxiv.org/abs/2601.14549)
Keywords: generative
Abstract: Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic KV caches, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to LLM inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free quantization with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs, storing inlier weights in compact multi-level Resistive-RAM (ReRAM) while preserving critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation. On language modeling and reasoning benchmarks, QMC outperforms and matches state-of-the-art quantization methods using advanced algorithms and hybrid data formats, while achieving greater compression under both algorithm-only evaluation and realistic deployment settings. Specifically, compared against SoTA quantization methods on the latest edge AI platform, QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16, establishing QMC as a scalable, deployment-ready co-design for efficient on-device inference.
摘要：在边缘平台上部署小语言模型 (SLM) 对于实时、隐私敏感的生成式 AI 至关重要，但仍受到内存、延迟和能源预算的限制。量化可以减小模型大小和成本，但会受到新兴非易失性存储器中的设备噪声的影响，而传统的存储器层次结构进一步限制了效率。 SRAM 提供快速访问，但密度较低，DRAM 必须同时容纳静态权重和动态 KV 缓存，这会产生带宽争用，而 Flash 虽然密集，但主要用于初始化，并在推理期间保持不活动状态。这些限制凸显了针对 LLM 推理量身定制的混合内存组织的需求。我们提出了带有内存协同设计的离群值感知量化（QMC），这是一种采用新颖的异构内存架构的免重新训练量化。 QMC 识别 SLM 中的内部权重和异常值，将内部权重存储在紧凑型多级电阻 RAM (ReRAM) 中，同时在高精度片上磁阻 RAM (MRAM) 中保留关键异常值，从而减轻噪声引起的性能退化。在语言建模和推理基准方面，QMC 使用先进算法和混合数据格式，优于并匹配最先进的量化方法，同时在仅算法评估和实际部署设置下实现更大的压缩。具体来说，与最新边缘 AI 平台上的 SoTA 量化方法相比，与 FP16 相比，QMC 将内存使用量减少了 6.3 倍至 7.3 倍，外部数据传输量减少了 7.6 倍，能耗减少了 11.7 倍，延迟减少了 12.5 倍，从而将 QMC 确立为可扩展、可部署的协同设计，可实现高效的设备上推理。

Title: Anatomically Guided Latent Diffusion for Brain MRI Progression Modeling

Authors: Cheng Wan, Bahram Jafrasteh, Ehsan Adeli, Miaomiao Zhang, Qingyu Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14584
Pdf URL: https://arxiv.org/pdf/2601.14584
Copy Paste: [[2601.14584]] Anatomically Guided Latent Diffusion for Brain MRI Progression Modeling(https://arxiv.org/abs/2601.14584)
Keywords: generative
Abstract: Accurately modeling longitudinal brain MRI progression is crucial for understanding neurodegenerative diseases and predicting individualized structural changes. Existing state-of-the-art approaches, such as Brain Latent Progression (BrLP), often use multi-stage training pipelines with auxiliary conditioning modules but suffer from architectural complexity, suboptimal use of conditional clinical covariates, and limited guarantees of anatomical consistency. We propose Anatomically Guided Latent Diffusion Model (AG-LDM), a segmentation-guided framework that enforces anatomically consistent progression while substantially simplifying the training pipeline. AG-LDM conditions latent diffusion by directly fusing baseline anatomy, noisy follow-up states, and clinical covariates at the input level, a strategy that avoids auxiliary control networks by learning a unified, end-to-end model that represents both anatomy and progression. A lightweight 3D tissue segmentation model (WarpSeg) provides explicit anatomical supervision during both autoencoder fine-tuning and diffusion model training, ensuring consistent brain tissue boundaries and morphometric fidelity. Experiments on 31,713 ADNI longitudinal pairs and zero-shot evaluation on OASIS-3 demonstrate that AG-LDM matches or surpasses more complex diffusion models, achieving state-of-the-art image quality and 15-20\% reduction in volumetric errors in generated images. AG-LDM also exhibits markedly stronger utilization of temporal and clinical covariates (up to 31.5x higher sensitivity than BrLP) and generates biologically plausible counterfactual trajectories, accurately capturing hallmarks of Alzheimer's progression such as limbic atrophy and ventricular expansion. These results highlight AG-LDM as an efficient, anatomically grounded framework for reliable brain MRI progression modeling.
摘要：准确模拟纵向脑 MRI 进展对于了解神经退行性疾病和预测个体化结构变化至关重要。现有最先进的方法，例如大脑潜在进展（BrLP），通常使用带有辅助调理模块的多阶段训练管道，但存在架构复杂性、条件临床协变量使用不理想以及解剖一致性保证有限等问题。我们提出了解剖学引导的潜在扩散模型（AG-LDM），这是一种分割引导的框架，可以强制执行解剖学上一致的进展，同时大大简化训练流程。 AG-LDM 通过直接融合基线解剖结构、噪声后续状态和输入级别的临床协变量来调节潜在扩散，该策略通过学习代表解剖结构和进展的统一的端到端模型来避免辅助控制网络。轻量级 3D 组织分割模型 (WarpSeg) 在自动编码器微调和扩散模型训练期间提供明确的解剖学监督，确保一致的脑组织边界和形态保真度。对 31,713 个 ADNI 纵向对的实验和 OASIS-3 上的零样本评估表明，AG-LDM 匹配或超越了更复杂的扩散模型，实现了最先进的图像质量，并且生成图像的体积误差减少了 15-20%。 AG-LDM 还表现出对时间和临床协变量的明显更强的利用（灵敏度比 BrLP 高出 31.5 倍），并生成生物学上合理的反事实轨迹，准确捕获阿尔茨海默病进展的标志，例如边缘萎缩和心室扩张。这些结果凸显了 AG-LDM 作为一种有效的、基于解剖学的框架，可用于可靠的脑 MRI 进展建模。

Title: Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation

Authors: Shovito Barua Soumma, Asiful Arefeen, Stephanie M. Carpenter, Melanie Hingle, Hassan Ghasemzadeh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14590
Pdf URL: https://arxiv.org/pdf/2601.14590
Copy Paste: [[2601.14590]] Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation(https://arxiv.org/abs/2601.14590)
Keywords: generation
Abstract: Counterfactual explanations (CFEs) provide human-centric interpretability by identifying the minimal, actionable changes required to alter a machine learning model's prediction. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. We conduct a comprehensive evaluation of CF generation using large language models (LLMs), including GPT-4 (zero-shot and few-shot) and two open-source models-BioMistral-7B and LLaMA-3.1-8B, in both pretrained and fine-tuned configurations. Using the multimodal AI-READI clinical dataset, we assess CFs across three dimensions: intervention quality, feature diversity, and augmentation effectiveness. Fine-tuned LLMs, particularly LLaMA-3.1-8B, produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic, behaviorally modifiable feature adjustments. When used for data augmentation under controlled label-scarcity settings, LLM-generated CFs substantially restore classifier performance, yielding an average 20% F1 recovery across three scarcity scenarios. Compared with optimization-based baselines such as DiCE, CFNOW, and NICE, LLMs offer a flexible, model-agnostic approach that generates more clinically actionable and semantically coherent counterfactuals. Overall, this work demonstrates the promise of LLM-driven counterfactuals for both interpretable intervention design and data-efficient model training in sensor-based digital health. Impact: SenseCF fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority class in an imbalanced dataset for improving model training and boosting model robustness and predictive performance
摘要：反事实解释 (CFE) 通过识别改变机器学习模型的预测所需的最小的、可操作的更改来提供以人为中心的解释性。因此，CF 可用作（i）异常预防的干预措施和（ii）用于训练稳健模型的增强数据。我们使用大型语言模型 (LLM)，包括 GPT-4（零样本和少样本）和两个开源模型 BioMistral-7B 和 LLaMA-3.1-8B，在预训练和微调配置中对 CF 生成进行了全面评估。使用多模式 AI-READI 临床数据集，我们从三个维度评估 CF：干预质量、特征多样性和增强有效性。经过微调的 LLM，特别是 LLaMA-3.1-8B，生成的 CF 具有高合理性（高达 99%）、强大的有效性（高达 0.99）以及现实的、行为可修改的特征调整。当在受控标签稀缺设置下用于数据增强时，LLM 生成的 CF 显着恢复了分类器性能，在三种稀缺场景中产生平均 20% 的 F1 恢复。与 DiCE、CFNOW 和 NICE 等基于优化的基线相比，法学硕士提供了一种灵活的、与模型无关的方法，可以生成更具临床可操作性和语义一致的反事实。总体而言，这项工作展示了法学硕士驱动的反事实对于基于传感器的数字健康中可解释的干预设计和数据高效模型训练的前景。影响：SenseCF 对法学硕士进行微调，以生成有效的、有代表性的反事实解释，并补充不平衡数据集中的少数类别，以改进模型训练并提高模型的稳健性和预测性能

Title: 3D Space as a Scratchpad for Editable Text-to-Image Generation

Authors: Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha, Kevin Blackburn-Matzen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14602
Pdf URL: https://arxiv.org/pdf/2601.14602
Copy Paste: [[2601.14602]] 3D Space as a Scratchpad for Editable Text-to-Image Generation(https://arxiv.org/abs/2601.14602)
Keywords: generation
Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at this https URL
摘要：大型语言模型 (LLM) 的最新进展表明，当中间思想被具体化到明确的工作空间（例如思想链跟踪或工具增强推理）中时，推理能力会得到改善。然而，视觉语言模型（VLM）缺乏类似的空间推理机制，限制了它们生成准确反映几何关系、对象身份和构图意图的图像的能力。我们引入了空间暂存器的概念——一种连接语言意图和图像合成的 3D 推理基础。给定文本提示，我们的框架会解析主题和背景元素，将它们实例化为可编辑的 3D 网格，并采用代理场景规划来进行放置、方向和视点选择。由此产生的 3D 排列被渲染回具有身份保留线索的图像域，使 VLM 能够生成空间一致和视觉连贯的输出。与之前基于 2D 布局的方法不同，我们的方法支持直观的 3D 编辑，可以可靠地传播到最终图像中。根据经验，它在 GenAI-Bench 上的文本对齐提高了 32%，证明了显式 3D 推理对于精确、可控图像生成的好处。我们的结果凸显了视觉语言模型的新范式，该模型不仅在语言上进行考虑，而且还在空间上进行考虑。代码和可视化位于此 https URL

Title: Mirai: Autoregressive Visual Generation Needs Foresight

Authors: Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, Toshihiko Yamasaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14671
Pdf URL: https://arxiv.org/pdf/2601.14671
Copy Paste: [[2601.14671]] Mirai: Autoregressive Visual Generation Needs Foresight(https://arxiv.org/abs/2601.14671)
Keywords: generation
Abstract: Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling. We formulate this insight with Mirai (meaning "future" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations. Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10$\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark. Our study highlights that visual autoregressive models need foresight.
摘要：自回归 (AR) 视觉生成器将图像建模为离散标记序列，并使用下一个标记可能性进行训练。这种严格的因果关系监督仅通过其紧邻的下一个标记来优化每个步骤，这会削弱全局一致性并减慢收敛速度。我们询问前瞻、源自后续标记的训练信号是否可以帮助 AR 视觉生成。我们沿着注入水平、前视布局和前视源轴进行了一系列受控诊断，揭示了一个关键见解：将前视与 AR 模型在 2D 图像网格上的内部表示对齐，可以改进因果关系建模。我们用 Mirai（日语中的“未来”的意思）来阐述这一见解，这是一个通用框架，可以将未来信息注入 AR 训练中，无需架构更改，也无需额外的推理开销：Mirai-E 使用来自单向表示的多个未来位置的显式预见，而 Mirai-I 利用来自匹配的双向表示的隐式预见。大量实验表明，Mirai 显着加速了收敛速度并提高了生成质量。例如，Mirai 可以将 LlamaGen-B 的收敛速度提高高达 10$\times$，并将 ImageNet 类条件图像生成基准上的生成 FID 从 5.34 减少到 4.34。我们的研究强调视觉自回归模型需要远见。

Title: LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models

Authors: Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhingra, Seonghyeon Nam, Haitao Yang, Zhuo Hui, Christopher Metzler, Andrea Vedaldi, Hamed Pirsiavash, Lei Luo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14674
Pdf URL: https://arxiv.org/pdf/2601.14674
Copy Paste: [[2601.14674]] LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models(https://arxiv.org/abs/2601.14674)
Keywords: generation, generative
Abstract: Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is this https URL
摘要：给定单目视频，视频重新渲染的目标是从新颖的摄像机轨迹生成场景视图。现有方法面临两个不同的挑战。无几何条件的模型缺乏空间意识，导致视点变化时的漂移和变形。另一方面，几何条件模型依赖于估计深度和显式重建，这使得它们容易受到深度误差和校准误差的影响。我们建议通过使用嵌入大型 4D 重建模型潜在空间中的隐式几何知识来调节视频生成过程来解决这些挑战。这些潜伏捕获连续空间中的场景结构，无需显式重建。因此，它们提供了一种灵活的表示，允许在更有效地正则化误差之前进行预训练扩散。通过联合调节这些潜在的和源相机的姿势，我们证明我们的模型在视频重新渲染任务上取得了最先进的结果。项目网页是这个https URL

Title: A comprehensive overview of deep learning models for object detection from videos/images

Authors: Sukana Zulfqar, Sadia Saeed, M. Azam Zia, Anjum Ali, Faisal Mehmood, Abid Ali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14677
Pdf URL: https://arxiv.org/pdf/2601.14677
Copy Paste: [[2601.14677]] A comprehensive overview of deep learning models for object detection from videos/images(https://arxiv.org/abs/2601.14677)
Keywords: generative
Abstract: Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.
摘要：视频和图像监控中的目标检测是一项成熟但快速发展的任务，受到最近深度学习进展的强烈影响。这篇综述通过检查架构创新、生成模型集成以及使用时间信息来增强鲁棒性和准确性来总结现代技术。与早期的调查不同，它根据核心架构、数据处理策略和监控特定挑战（例如动态环境、遮挡、照明变化和实时要求）对方法进行分类。主要目标是评估当前语义对象检测的有效性，次要目标包括分析深度学习模型及其实际应用。该评论涵盖了基于 CNN 的检测器、GAN 辅助方法和时间融合方法，强调了生成模型如何支持重建丢失帧、减少遮挡和标准化照明等任务。它还概述了预处理流程、特征提取进度、基准数据集和比较评估。最后，确定了低延迟、高效和时空学习方法的新兴趋势以供未来研究。

Title: DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling

Authors: Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gwing Kei Yip, Gerald W.Y. Cheng, Yunlin Mao, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2601.14732
Pdf URL: https://arxiv.org/pdf/2601.14732
Copy Paste: [[2601.14732]] DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling(https://arxiv.org/abs/2601.14732)
Keywords: generation
Abstract: AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of- the-art vision-language models. Code is available at this https URL.
摘要：用于药物发现和化学文献挖掘的 AI 模型必须解释分子图像并生成与 3D 几何和立体化学一致的输出。大多数分子语言模型依赖于字符串或图形，而视觉语言模型通常会错过立体化学细节，并且难以将连续的 3D 结构映射为离散的标记。我们提出了 DeepMoLM：深度分子语言建模，这是一种双视图框架，它将高分辨率分子图像建立在源自分子构象的几何不变量中。 DeepMoLM 保留来自 1024 个 $\times$ 1024 个输入的高频证据，将构象邻域编码为离散的扩展 3 维指纹，并通过交叉注意力融合视觉和几何流，从而无需原子坐标即可实现物理接地生成。 DeepMoLM 改进了 PubChem 字幕，相对最强的通才基线提高了 12.3% 的 METEOR 增益，同时保持与专业方法的竞争力。它为所有属性查询生成有效的数字输出，并在专家设置中获得分子量 MAE 13.64 g/mol 和复杂性 37.89。在从图像生成 ChEBI-20 描述时，它超出了通才基线并匹配最先进的视觉语言模型。代码可从此 https URL 获取。

Title: Safeguarding Facial Identity against Diffusion-based Face Swapping via Cascading Pathway Disruption

Authors: Liqin Wang, Qianyue Hu, Wei Lu, Xiangyang Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14738
Pdf URL: https://arxiv.org/pdf/2601.14738
Copy Paste: [[2601.14738]] Safeguarding Facial Identity against Diffusion-based Face Swapping via Cascading Pathway Disruption(https://arxiv.org/abs/2601.14738)
Keywords: generative
Abstract: The rapid evolution of diffusion models has democratized face swapping but also raises concerns about privacy and identity security. Existing proactive defenses, often adapted from image editing attacks, prove ineffective in this context. We attribute this failure to an oversight of the structural resilience and the unique static conditional guidance mechanism inherent in face swapping systems. To address this, we propose VoidFace, a systemic defense method that views face swapping as a coupled identity pathway. By injecting perturbations at critical bottlenecks, VoidFace induces cascading disruption throughout the pipeline. Specifically, we first introduce localization disruption and identity erasure to degrade physical regression and semantic embeddings, thereby impairing the accurate modeling of the source face. We then intervene in the generative domain by decoupling attention mechanisms to sever identity injection, and corrupting intermediate diffusion features to prevent the reconstruction of source identity. To ensure visual imperceptibility, we perform adversarial search in the latent manifold, guided by a perceptual adaptive strategy to balance attack potency with image quality. Extensive experiments show that VoidFace outperforms existing defenses across various diffusion-based swapping models, while producing adversarial faces with superior visual quality.
摘要：扩散模型的快速发展使换脸变得民主化，但也引发了对隐私和身份安全的担忧。现有的主动防御通常改编自图像编辑攻击，在这种情况下被证明是无效的。我们将这种失败归因于对结构弹性的监督以及换脸系统固有的独特静态条件指导机制。为了解决这个问题，我们提出了 VoidFace，一种系统防御方法，将面部交换视为耦合的身份途径。通过在关键瓶颈处注入扰动，VoidFace 会在整个管道中引发级联中断。具体来说，我们首先引入定位破坏和身份擦除来降低物理回归和语义嵌入，从而损害源面部的准确建模。然后，我们通过解耦注意机制来切断身份注入，并破坏中间扩散特征以防止源身份的重建，从而对生成域进行干预。为了确保视觉不可察觉性，我们在潜在流形中进行对抗性搜索，以感知自适应策略为指导，以平衡攻击效力与图像质量。大量实验表明，VoidFace 的性能优于各种基于扩散的交换模型的现有防御，同时生成具有卓越视觉质量的对抗性面孔。

Title: Enhancing Text-to-Image Generation via End-Edge Collaborative Hybrid Super-Resolution

Authors: Chongbin Yi, Yuxin Liang, Ziqi Zhou, Peng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14741
Pdf URL: https://arxiv.org/pdf/2601.14741
Copy Paste: [[2601.14741]] Enhancing Text-to-Image Generation via End-Edge Collaborative Hybrid Super-Resolution(https://arxiv.org/abs/2601.14741)
Keywords: super-resolution, generation
Abstract: Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.
摘要：人工智能生成内容 (AIGC) 已取得重大进展，高分辨率文本到图像 (T2I) 生成对于提高用户体验质量 (QoE) 变得越来越重要。尽管资源受限的边缘计算足以支持快速的低分辨率 T2I 生成，但实现高分辨率输出仍然面临着以延迟为代价确保图像保真度的挑战。为了解决这个问题，我们首先研究了用于图像增强的超分辨率（SR）方法的性能，确认了一个基本的权衡：基于轻量级学习的 SR 难以恢复精细细节，而基于扩散的 SR 以大量的计算成本实现了更高的保真度。受这些观察的启发，我们提出了一个终端边缘协作生成增强框架。收到 T2I 生成任务后，系统首先根据自适应选择的去噪步骤和边缘侧的超分辨率尺度生成低分辨率图像，然后将其分割为补丁并通过区域感知混合 SR 策略进行处理。该策略将基于扩散的 SR 模型应用于前景补丁以进行细节恢复，并将基于轻量级学习的 SR 模型应用于背景补丁以进行有效的升级，最终将增强后的补丁拼接到高分辨率图像中。实验表明，与基线相比，我们的系统将服务延迟减少了 33%，同时保持了具有竞争力的图像质量。

Title: ReinPath: A Multimodal Reinforcement Learning Approach for Pathology

Authors: Kangcheng Zhou, Jun Jiang, Qing Zhang, Shuang Zheng, Qingli Li, Shugong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14757
Pdf URL: https://arxiv.org/pdf/2601.14757
Copy Paste: [[2601.14757]] ReinPath: A Multimodal Reinforcement Learning Approach for Pathology(https://arxiv.org/abs/2601.14757)
Keywords: generation
Abstract: Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text this http URL, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning this http URL address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning this http URL improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy this http URL construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning this http URL experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the this http URL method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.
摘要：可解释性在计算病理学中具有重要意义，导致组织病理学图像和相应文本的多模态信息集成的发展此http URL，由于缺乏支持显式推理和推理的高质量数据集，现有的多模态方法的可解释性有限，简单推理此http URL解决了上述问题，我们引入了一种新颖的多模态病理学大语言模型，具有强推理此http URL提高了准确且上下文相关的文本描述的生成，我们设计了一种与群体相关策略集成的语义奖励策略此http URL构建高质量的病理视觉问答 (VQA) 数据集，专门用于支持复杂推理。在此数据集上进行的 http URL 实验表明，我们的方法优于最先进的方法，即使仅使用 20% 的训练，此 http URL 方法也能在下游零样本图像分类任务上实现与 CLIP 相当的性能。

Title: Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Authors: Injin Kong, Hyoungjoon Lee, Yohan Jo
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.14758
Pdf URL: https://arxiv.org/pdf/2601.14758
Copy Paste: [[2601.14758]] Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models(https://arxiv.org/abs/2601.14758)
Keywords: generation
Abstract: Post-training pretrained Autoregressive models (ARMs) into Masked Diffusion models (MDMs) has emerged as a cost-effective strategy to overcome the limitations of sequential generation. However, the internal algorithmic transformations induced by this paradigm shift remain unexplored, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics. In this work, we address this question by conducting a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic "mechanism shift" dependent on the structural nature of the task. Structurally, we observe a distinct divergence: while MDMs largely retain autoregressive circuitry for tasks dominated by local causal dependencies, they abandon initialized pathways for global planning tasks, exhibiting distinct rewiring characterized by increased early-layer processing. Semantically, we identify a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. Through these findings, we conclude that diffusion post-training does not merely adapt model parameters but fundamentally reorganizes internal computation to support non-sequential global planning.
摘要：将预训练自回归模型 (ARM) 训练后转换为掩模扩散模型 (MDM) 已成为克服顺序生成的局限性的一种经济有效的策略。然而，这种范式转变引起的内部算法转换仍未被探索，因此尚不清楚训练后的 MDM 是否获得了真正的双向推理能力，或者仅仅是重新包装了自回归启发式。在这项工作中，我们通过对 ARM 及其 MDM 对应产品进行电路比较分析来解决这个问题。我们的分析揭示了取决于任务结构性质的系统性“机制转变”。在结构上，我们观察到明显的分歧：虽然 MDM 在很大程度上保留了由局部因果依赖性主导的任务的自回归电路，但它们放弃了全局规划任务的初始化路径，表现出以增加早期层处理为特征的独特重新布线。从语义上讲，我们确定了从 ARM 中尖锐的本地化专业化到 MDM 中的分布式集成的转变。通过这些发现，我们得出结论，扩散后训练不仅调整模型参数，而且从根本上重新组织内部计算以支持非顺序全局规划。

Title: Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation

Authors: Yifei Liu, Changxing Ding, Ling Guo, Huaiguang Jiang, Qiong Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14788
Pdf URL: https://arxiv.org/pdf/2601.14788
Copy Paste: [[2601.14788]] Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation(https://arxiv.org/abs/2601.14788)
Keywords: generation, generative
Abstract: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.
摘要：由于其令人印象深刻的生成能力和灵活性，扩散模型已广泛应用于文本驱动的人体运动生成和相关任务。然而，当前的运动扩散模型面临两个主要限制：由缺乏运动特定信息的预训练文本编码器引起的表征差距，以及迭代去噪过程中的错误传播。本文介绍了重建锚定扩散模型（RAM）来应对这些挑战。首先，RAM 利用运动潜在空间作为文本到运动生成的中间监督。为此，RAM 共同训练具有两个关键目标函数的运动重建分支：自正则化以增强运动空间的辨别力，以及以运动为中心的潜在对齐以实现从文本到运动潜在空间的准确映射。其次，我们提出了重构错误指导（REG），这是一种测试阶段的指导机制，利用扩散模型固有的自我纠正能力来减轻错误传播。在每个去噪步骤中，REG 使用运动重建分支来重建先前的估计，再现先前的错误模式。通过放大当前预测和重建估计之间的残差，REG 突出了当前预测的改进。大量的实验表明 RAM 实现了显着的改进和最先进的性能。我们的代码将被发布。

Title: Synthetic Data Augmentation for Multi-Task Chinese Porcelain Classification: A Stable Diffusion Approach

Authors: Ziyao Ling, Silvia Mirri, Paola Salomoni, Giovanni Delnevo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14791
Pdf URL: https://arxiv.org/pdf/2601.14791
Copy Paste: [[2601.14791]] Synthetic Data Augmentation for Multi-Task Chinese Porcelain Classification: A Stable Diffusion Approach(https://arxiv.org/abs/2601.14791)
Keywords: generative
Abstract: The scarcity of training data presents a fundamental challenge in applying deep learning to archaeological artifact classification, particularly for the rare types of Chinese porcelain. This study investigates whether synthetic images generated through Stable Diffusion with Low-Rank Adaptation (LoRA) can effectively augment limited real datasets for multi-task CNN-based porcelain classification. Using MobileNetV3 with transfer learning, we conducted controlled experiments comparing models trained on pure real data against those trained on mixed real-synthetic datasets (95:5 and 90:10 ratios) across four classification tasks: dynasty, glaze, kiln and type identification. Results demonstrate task-specific benefits: type classification showed the most substantial improvement (5.5\% F1-macro increase with 90:10 ratio), while dynasty and kiln tasks exhibited modest gains (3-4\%), suggesting that synthetic augmentation effectiveness depends on the alignment between generated features and task-relevant visual signatures. Our work contributes practical guidelines for deploying generative AI in archaeological research, demonstrating both the potential and limitations of synthetic data when archaeological authenticity must be balanced with data diversity.
摘要：训练数据的稀缺给深度学习应用于考古文物分类带来了根本性挑战，特别是对于稀有类型的中国瓷器。本研究探讨通过稳定扩散与低阶适应 (LoRA) 生成的合成图像是否可以有效增强有限的真实数据集，以实现基于 CNN 的多任务瓷器分类。使用 MobileNetV3 和迁移学习，我们进行了对照实验，将在纯真实数据上训练的模型与在混合真实合成数据集（95:5 和 90:10 比例）上训练的模型进行比较，涉及四个分类任务：王朝、釉料、窑炉和类型识别。结果证明了特定于任务的好处：类型分类显示出最显着的改进（F1 宏增加 5.5%，比例为 90:10），而王朝和窑炉任务则表现出适度的增益（3-4%），这表明合成增强效果取决于生成的特征和任务相关的视觉特征之间的一致性。我们的工作为在考古研究中部署生成式人工智能提供了实用指南，展示了当考古真实性必须与数据多样性取得平衡时合成数据的潜力和局限性。

Title: Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation

Authors: Ondřej Holub (1), Essi Ryymin (2), Rodrigo Alves (1) ((1) Czech Technical University in Prague, (2) Häme University of Applied Sciences)
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2601.14798
Pdf URL: https://arxiv.org/pdf/2601.14798
Copy Paste: [[2601.14798]] Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation(https://arxiv.org/abs/2601.14798)
Keywords: generation
Abstract: Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic this http URL iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.
摘要：设计好的反思问题在教学上很重要，但很耗时，而且教师之间的支持不均衡。本文介绍了一种反思中反思框架，用于使用大型语言模型 (LLM) 自动生成反思问题。我们的方法协调两个角色专门代理，即学生-教师和教师-教育者，它们进行苏格拉底式多轮对话，根据教师指定的主题、关键概念、学生水平和可选教学材料迭代地完善单个问题。学生-教师提出候选人问题并附上简短的理由，而教师-教育者则根据清晰度、深度、相关性、参与度和概念互连性对这些问题进行评估，仅以有针对性的指导问题或停止对话的固定信号来回答。我们在该主题的真实初中 ICT 设置中评估该框架，使用 GPT-4o-mini 作为骨干模型，并使用更强大的 GPT-4 级 LLM 作为外部评估器，对清晰度、相关性、深度和整体质量进行成对比较。首先，我们研究交互设计和上下文（动态此 http URL 迭代计数；存在或不存在学生水平和材料）如何影响问题质量。动态停止与上下文信息相结合始终优于固定的 5 步或 10 步细化，很长的对话容易出现偏差或过于复杂。其次，我们表明，与使用相同骨干模型的一次性基线相比，我们的双代理协议产生的问题被认为更相关、更深入、总体更好。

Title: Tailoring Adverse Event Prediction in Type 1 Diabetes with Patient-Specific Deep Learning Models

Authors: Giorgia Rigamonti, Mirko Paolo Barbato, Davide Marelli, Paolo Napoletano
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14917
Pdf URL: https://arxiv.org/pdf/2601.14917
Copy Paste: [[2601.14917]] Tailoring Adverse Event Prediction in Type 1 Diabetes with Patient-Specific Deep Learning Models(https://arxiv.org/abs/2601.14917)
Keywords: generation
Abstract: Effective management of Type 1 Diabetes requires continuous glucose monitoring and precise insulin adjustments to prevent hyperglycemia and hypoglycemia. With the growing adoption of wearable glucose monitors and mobile health applications, accurate blood glucose prediction is essential for enhancing automated insulin delivery and decision-support systems. This paper presents a deep learning-based approach for personalized blood glucose prediction, leveraging patient-specific data to improve prediction accuracy and responsiveness in real-world scenarios. Unlike traditional generalized models, our method accounts for individual variability, enabling more effective subject-specific predictions. We compare Leave-One-Subject-Out Cross-Validation with a fine-tuning strategy to evaluate their ability to model patient-specific dynamics. Results show that personalized models significantly improve the prediction of adverse events, enabling more precise and timely interventions in real-world scenarios. To assess the impact of patient-specific data, we conduct experiments comparing a multimodal, patient-specific approach against traditional CGM-only methods. Additionally, we perform an ablation study to investigate model performance with progressively smaller training sets, identifying the minimum data required for effective personalization-an essential consideration for real-world applications where extensive data collection is often challenging. Our findings underscore the potential of adaptive, personalized glucose prediction models for advancing next-generation diabetes management, particularly in wearable and mobile health platforms, enhancing consumer-oriented diabetes care solutions.
摘要：1 型糖尿病的有效管理需要持续的血糖监测和精确的胰岛素调整，以防止高血糖和低血糖。随着可穿戴血糖监测仪和移动健康应用的日益普及，准确的血糖预测对于增强自动化胰岛素输送和决策支持系统至关重要。本文提出了一种基于深度学习的个性化血糖预测方法，利用患者特定的数据来提高现实场景中的预测准确性和响应能力。与传统的广义模型不同，我们的方法考虑了个体差异，从而能够更有效地针对特定主题进行预测。我们将留一受试者交叉验证与微调策略进行比较，以评估它们对患者特定动态进行建模的能力。结果表明，个性化模型显着改善了不良事件的预测，从而能够在现实场景中进行更精确和及时的干预。为了评估患者特定数据的影响，我们进行了实验，将多模式、患者特定方法与传统的仅 CGM 方法进行比较。此外，我们还进行了一项消融研究，以逐渐缩小训练集来研究模型性能，确定有效个性化所需的最少数据——这是现实世界应用程序的一个重要考虑因素，因为广泛的数据收集通常具有挑战性。我们的研究结果强调了自适应、个性化血糖预测模型在推进下一代糖尿病管理方面的潜力，特别是在可穿戴和移动健康平台方面，增强了以消费者为导向的糖尿病护理解决方案。

Title: TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models

Authors: Carolin Holtermann, Nina Krebs, Anne Lauscher
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.14951
Pdf URL: https://arxiv.org/pdf/2601.14951
Copy Paste: [[2601.14951]] TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models(https://arxiv.org/abs/2601.14951)
Keywords: generation
Abstract: Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.
摘要：时间改变了我们世界中实体的视觉外观，例如物体、地点和动物。因此，为了准确生成上下文相关的图像，关于时间的知识和推理可能至关重要（例如，对于生成春季与冬季的景观）。然而，尽管在理解和改进自然语言处理中的时态知识方面存在大量工作，但关于时态现象如何在文本到图像（T2I）模型中出现和处理的研究仍然很少。我们通过 TempViz 解决了这一差距，这是第一个全面评估图像生成中的时间知识的数据集，由 7.9k 提示和 600 多个参考图像组成。使用 TempViz，我们研究了跨五个时间知识类别的五个 T2I 模型的功能。人类评估表明时间能力普遍较弱，没有模型跨类别的准确率超过 75%。对于更大规模的研究，我们还研究了自动评估方法，将几种既定方法与人类判断进行比较。然而，这些方法都没有提供对时间线索的可靠评估——进一步表明未来对 T2I 时间知识研究的迫切需要。

Title: Improving Regret Approximation for Unsupervised Dynamic Environment Generation

Authors: Harry Mead, Bruno Lacerda, Jakob Foerster, Nick Hawes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14957
Pdf URL: https://arxiv.org/pdf/2601.14957
Copy Paste: [[2601.14957]] Improving Regret Approximation for Unsupervised Dynamic Environment Generation(https://arxiv.org/abs/2601.14957)
Keywords: generation
Abstract: Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: this https URL.
摘要：无监督环境设计 (UED) 旨在自动生成强化学习 (RL) 代理的训练课程，目标是提高泛化能力和零样本性能。然而，设计有效的课程仍然是一个难题，特别是在环境参数化的小子集导致所需政策的复杂性显着增加的情况下。当前的方法很难解决困难的信用分配问题，并且依赖于遗憾近似，无法识别具有挑战性的水平，随着环境规模的增长，这两者都会变得更加复杂。我们提出UED动态环境生成（DEGen），以实现更密集的水平生成器奖励信号，降低信用分配的难度，并允许UED扩展到更大的环境规模。我们还引入了一种新的后悔近似，最大化负优势（MNA），作为一个显着改进的优化指标，可以更好地识别更具挑战性的水平。我们凭经验证明，MNA 优于当前的遗憾近似，并且在与 DEGen 结合时，始终优于现有方法，尤其是随着环境规模的增长。我们已在此处提供所有代码：此 https URL。

Title: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Authors: Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14959
Pdf URL: https://arxiv.org/pdf/2601.14959
Copy Paste: [[2601.14959]] Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers(https://arxiv.org/abs/2601.14959)
Keywords: generation
Abstract: Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at this https URL.
摘要：现有的视频帧插值（VFI）方法通常采用以帧为中心的方法，将视频处理为独立的短片段（例如三元组），这会导致时间不一致和运动伪影。为了克服这个问题，我们提出了一种整体的、以视频为中心的范式，名为 \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}interpolation (LDF-VFI)。我们的框架建立在自回归扩散变换器的基础上，该变换器对整个视频序列进行建模以确保长程时间一致性。为了减轻自回归生成中固有的误差积累，我们引入了一种新颖的跳跃串联采样策略，可以有效地保持时间稳定性。此外，LDF-VFI 结合了稀疏、局部注意力和平铺 VAE 编码，这种组合不仅能够有效处理长序列，而且还允许在推理时泛化到任意空间分辨率（例如 4K），而无需重新训练。增强的条件 VAE 解码器利用输入视频的多尺度特征，进一步提高了重建保真度。根据经验，LDF-VFI 在具有挑战性的长序列基准测试中实现了最先进的性能，展示了卓越的每帧质量和时间一致性，尤其是在大运动的场景中。源代码可从此 https URL 获取。

Title: InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

Authors: Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Enhong Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14968
Pdf URL: https://arxiv.org/pdf/2601.14968
Copy Paste: [[2601.14968]] InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement(https://arxiv.org/abs/2601.14968)
Keywords: generative
Abstract: Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.
摘要：大多数现有的时间序列分类方法采用一种判别范式，将输入序列直接映射到单热编码的类标签。虽然有效，但这种范式很难合并上下文特征，并且无法捕获类之间的语义关系。为了解决这些限制，我们提出了 InstructTime，这是一种新颖的框架，它将时间序列分类重新表述为多模式生成任务。具体来说，连续数字序列、上下文文本特征和任务指令被视为多模式输入，而类标签则通过调整的语言模型生成为文本输出。为了弥合模态差距，InstructTime 引入了一个时间序列离散化模块，该模块将连续序列转换为离散时间标记，以及对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架的基础上，我们进一步提出了 InstructTime++，它通过合并隐式特征建模来扩展 InstructTime，以补偿语言模型有限的归纳偏差。 InstructTime++ 利用专门的工具包从原始时间序列和上下文输入中挖掘信息隐式模式，包括统计特征提取和基于视觉语言的图像字幕，并将其转换为文本描述以实现无缝集成。对多个基准数据集的大量实验证明了 InstructTime++ 的卓越性能。

Title: SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Authors: Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.15017
Pdf URL: https://arxiv.org/pdf/2601.15017
Copy Paste: [[2601.15017]] SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation(https://arxiv.org/abs/2601.15017)
Keywords: generation
Abstract: While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.
摘要：虽然视频到音频的生成在语义和时间对齐方面取得了显着的进展，但大多数现有研究仅关注这些方面，而对合成音频的空间感知和沉浸质量的关注有限。这种限制很大程度上源于当前模型对单声道音频数据集的依赖，该数据集缺乏学习视觉到空间音频映射所需的双耳空间信息。为了解决这一差距，我们引入了两个关键贡献：我们构建了 BinauralVGGSound，这是第一个大规模视频双耳音频数据集，旨在支持空间感知视频到音频的生成；我们提出了一种由视觉线索引导的端到端空间音频生成框架，它明确地模拟了空间特征。我们的框架采用了视觉引导的音频空间化模块，可确保生成的音频展现真实的空间属性和分层的空间深度，同时保持语义和时间对齐。实验表明，我们的方法在空间保真度方面远远优于最先进的模型，并提供更身临其境的听觉体验，而无需牺牲时间或语义的一致性。所有数据集、代码和模型检查点都将公开发布，以方便未来的研究。

Title: HyperNet-Adaptation for Diffusion-Based Test Case Generation

Authors: Oliver Weißl, Vincenzo Riccio, Severin Kacianka, Andrea Stocco
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2601.15041
Pdf URL: https://arxiv.org/pdf/2601.15041
Copy Paste: [[2601.15041]] HyperNet-Adaptation for Diffusion-Based Test Case Generation(https://arxiv.org/abs/2601.15041)
Keywords: generation, generative
Abstract: The increasing deployment of deep learning systems requires systematic evaluation of their reliability in real-world scenarios. Traditional gradient-based adversarial attacks introduce small perturbations that rarely correspond to realistic failures and mainly assess robustness rather than functional behavior. Generative test generation methods offer an alternative but are often limited to simple datasets or constrained input domains. Although diffusion models enable high-fidelity image synthesis, their computational cost and limited controllability restrict their applicability to large-scale testing. We present HyNeA, a generative testing method that enables direct and efficient control over diffusion-based generation. HyNeA provides dataset-free controllability through hypernetworks, allowing targeted manipulation of the generative process without relying on architecture-specific conditioning mechanisms or dataset-driven adaptations such as fine-tuning. HyNeA employs a distinct training strategy that supports instance-level tuning to identify failure-inducing test cases without requiring datasets that explicitly contain examples of similar failures. This approach enables the targeted generation of realistic failure cases at substantially lower computational cost than search-based methods. Experimental results show that HyNeA improves controllability and test diversity compared to existing generative test generators and generalizes to domains where failure-labeled training data is unavailable.
摘要：深度学习系统的日益部署需要对其在现实场景中的可靠性进行系统评估。传统的基于梯度的对抗攻击引入了小的扰动，这些扰动很少对应于实际的故障，并且主要评估鲁棒性而不是功能行为。生成测试生成方法提供了一种替代方案，但通常仅限于简单的数据集或受限的输入域。尽管扩散模型能够实现高保真图像合成，但其计算成本和有限的可控性限制了它们在大规模测试中的适用性。我们提出了 HyNeA，一种生成测试方法，可以直接有效地控制基于扩散的生成。 HyNeA 通过超网络提供无数据集的可控性，允许对生成过程进行有针对性的操作，而无需依赖于特定于架构的调节机制或数据集驱动的适应（例如微调）。 HyNeA 采用独特的训练策略，支持实例级调整来识别引发故障的测试用例，而不需要明确包含类似故障示例的数据集。这种方法能够以比基于搜索的方法低得多的计算成本有针对性地生成实际故障案例。实验结果表明，与现有的生成测试生成器相比，HyNeA 提高了可控性和测试多样性，并推广到无法获得故障标记训练数据的领域。

Title: Deep Leakage with Generative Flow Matching Denoiser

Authors: Isaac Baglin, Xiatian Zhu, Simon Hadfield
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.15049
Pdf URL: https://arxiv.org/pdf/2601.15049
Copy Paste: [[2601.15049]] Deep Leakage with Generative Flow Matching Denoiser(https://arxiv.org/abs/2601.15049)
Keywords: generative
Abstract: Federated Learning (FL) has emerged as a powerful paradigm for decentralized model training, yet it remains vulnerable to deep leakage (DL) attacks that reconstruct private client data from shared model updates. While prior DL methods have demonstrated varying levels of success, they often suffer from instability, limited fidelity, or poor robustness under realistic FL settings. We introduce a new DL attack that integrates a generative Flow Matching (FM) prior into the reconstruction process. By guiding optimization toward the distribution of realistic images (represented by a flow matching foundation model), our method enhances reconstruction fidelity without requiring knowledge of the private data. Extensive experiments on multiple datasets and target models demonstrate that our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics. Crucially, the method remains effective across different training epochs, larger client batch sizes, and under common defenses such as noise injection, clipping, and sparsification. Our findings call for the development of new defense strategies that explicitly account for adversaries equipped with powerful generative priors.
摘要：联邦学习 (FL) 已成为去中心化模型训练的强大范例，但它仍然容易受到深度泄漏 (DL) 攻击，这些攻击会从共享模型更新中重建私人客户端数据。虽然之前的深度学习方法已经取得了不同程度的成功，但它们在实际的 FL 设置下经常会出现不稳定、保真度有限或鲁棒性较差的问题。我们引入了一种新的深度学习攻击，它将生成流匹配（FM）先验集成到重建过程中。通过指导对真实图像分布的优化（由流匹配基础模型表示），我们的方法增强了重建保真度，而无需了解私有数据。对多个数据集和目标模型的广泛实验表明，我们的方法在像素级、感知和基于特征的相似性指标上始终优于最先进的攻击。至关重要的是，该方法在不同的训练时期、更大的客户端批量大小以及噪声注入、裁剪和稀疏化等常见防御下仍然有效。我们的研究结果呼吁制定新的防御策略，明确考虑配备强大生成先验的对手。

Title: Differential Privacy Image Generation with Reconstruction Loss and Noise Injection Using an Error Feedback SGD

Authors: Qiwei Ma, Jun Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.15061
Pdf URL: https://arxiv.org/pdf/2601.15061
Copy Paste: [[2601.15061]] Differential Privacy Image Generation with Reconstruction Loss and Noise Injection Using an Error Feedback SGD(https://arxiv.org/abs/2601.15061)
Keywords: generation
Abstract: Traditional data masking techniques such as anonymization cannot achieve the expected privacy protection while ensuring data utility for privacy-preserving machine learning. Synthetic data plays an increasingly important role as it generates a large number of training samples and prevents information leakage in real data. The existing methods suffer from the repeating trade-off processes between privacy and utility. We propose a novel framework for differential privacy generation, which employs an Error Feedback Stochastic Gradient Descent(EFSGD) method and introduces a reconstruction loss and noise injection mechanism into the training process. We generate images with higher quality and usability under the same privacy budget as the related work. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both grayscale and RGB images. We achieve state-of-the-art results over almost all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.
摘要：匿名化等传统数据脱敏技术无法在保证隐私保护机器学习的数据实用性的同时，实现预期的隐私保护。合成数据发挥着越来越重要的作用，因为它可以生成大量训练样本并防止真实数据中的信息泄漏。现有方法遭受隐私和实用性之间重复的权衡过程的困扰。我们提出了一种新颖的差分隐私生成框架，该框架采用误差反馈随机梯度下降（EFSGD）方法，并在训练过程中引入重建损失和噪声注入机制。我们在与相关工作相同的隐私预算下生成具有更高质量和可用性的图像。大量的实验证明了我们提出的灰度和 RGB 图像框架的有效性和泛化性。我们在三个基准上的几乎所有指标上都取得了最先进的结果：MNIST、Fashion-MNIST 和 CelebA。

Title: Field-Space Autoencoder for Scalable Climate Emulators

Authors: Johannes Meuer, Maximilian Witte, Étiénne Plésiat, Thomas Ludwig, Christopher Kadow
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2601.15102
Pdf URL: https://arxiv.org/pdf/2601.15102
Copy Paste: [[2601.15102]] Field-Space Autoencoder for Scalable Climate Emulators(https://arxiv.org/abs/2601.15102)
Keywords: super-resolution, generative
Abstract: Kilometer-scale Earth system models are essential for capturing local climate change. However, these models are computationally expensive and produce petabyte-scale outputs, which limits their utility for applications such as probabilistic risk assessment. Here, we present the Field-Space Autoencoder, a scalable climate emulation framework based on a spherical compression model that overcomes these challenges. By utilizing Field-Space Attention, the model efficiently operates on native climate model output and therefore avoids geometric distortions caused by forcing spherical data onto Euclidean grids. This approach preserves physical structures significantly better than convolutional baselines. By producing a structured compressed field, it serves as a good baseline for downstream generative emulation. In addition, the model can perform zero-shot super-resolution that maps low-resolution large ensembles and scarce high-resolution data into a shared representation. We train a generative diffusion model on these compressed fields. The model can simultaneously learn internal variability from abundant low-resolution data and fine-scale physics from sparse high-resolution data. Our work bridges the gap between the high volume of low-resolution ensemble statistics and the scarcity of high-resolution physical detail.
摘要：公里级地球系统模型对于捕捉当地气候变化至关重要。然而，这些模型的计算成本很高，并且会产生 PB 级的输出，这限制了它们在概率风险评估等应用中的实用性。在这里，我们提出了场空间自动编码器，这是一个基于球形压缩模型的可扩展气候仿真框架，克服了这些挑战。通过利用场空间注意力，该模型可以有效地对本地气候模型输出进行操作，从而避免由于将球形数据强制到欧几里得网格而导致的几何失真。这种方法比卷积基线更好地保留了物理结构。通过生成结构化压缩字段，它可以作为下游生成仿真的良好基线。此外，该模型可以执行零样本超分辨率，将低分辨率大型集合和稀缺高分辨率数据映射到共享表示中。我们在这些压缩场上训练生成扩散模型。该模型可以同时从丰富的低分辨率数据中学习内部变异性，并从稀疏的高分辨率数据中学习精细尺度的物理现象。我们的工作弥补了大量低分辨率集合统计数据与稀缺高分辨率物理细节之间的差距。

Title: Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation

Authors: Haonan Yuan, Qingyun Sun, Jiacheng Tao, Xingcheng Fu, Jianxin Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.15124
Pdf URL: https://arxiv.org/pdf/2601.15124
Copy Paste: [[2601.15124]] Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation(https://arxiv.org/abs/2601.15124)
Keywords: generation
Abstract: Graph Foundation Models (GFMs) have emerged as a frontier in graph learning, which are expected to deliver transferable representations across diverse tasks. However, GFMs remain constrained by in-memory bottlenecks: they attempt to encode knowledge into model parameters, which limits semantic capacity, introduces heavy lossy compression with conflicts, and entangles graph representation with the knowledge in ways that hinder efficient adaptation, undermining scalability and interpretability. In this work,we propose RAG-GFM, a Retrieval-Augmented Generation aided Graph Foundation Model that offloads knowledge from parameters and complements parameterized learning. To externalize graph knowledge, we build a dual-modal unified retrieval module, where a semantic store from prefix-structured text and a structural store from centrality-based motif. To preserve heterogeneous information, we design a dual-view alignment objective that contrasts both modalities to capture both content and relational patterns. To enable efficient downstream adaptation, we perform in-context augmentation to enrich supporting instances with retrieved texts and motifs as contextual evidence. Extensive experiments on five benchmark graph datasets demonstrate that RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification, achieving superior effectiveness and efficiency.
摘要：图基础模型（GFM）已成为图学习的前沿，有望在不同的任务中提供可转移的表示。然而，GFM 仍然受到内存瓶颈的限制：它们试图将知识编码到模型参数中，这限制了语义容量，引入了具有冲突的严重有损压缩，并将图表示与知识纠缠在一起，从而阻碍了有效的适应，破坏了可扩展性和可解释性。在这项工作中，我们提出了 RAG-GFM，一种检索增强生成辅助图基础模型，可以从参数中卸载知识并补充参数化学习。为了具体化图知识，我们构建了一个双模态统一检索模块，其中来自前缀结构文本的语义存储和来自基于中心性的主题的结构存储。为了保留异构信息，我们设计了一个双视图对齐目标，该目标对比两种模式以捕获内容和关系模式。为了实现有效的下游适应，我们执行上下文增强，以检索到的文本和主题作为上下文证据来丰富支持实例。对五个基准图数据集的大量实验表明，RAG-GFM 在跨域节点和图分类方面始终优于 13 个最先进的基线，实现了卓越的有效性和效率。

Title: DeepFedNAS: A Unified Framework for Principled, Hardware-Aware, and Predictor-Free Federated Neural Architecture Search

Authors: Bostan Khan, Masoud Daneshtalab
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2601.15127
Pdf URL: https://arxiv.org/pdf/2601.15127
Copy Paste: [[2601.15127]] DeepFedNAS: A Unified Framework for Principled, Hardware-Aware, and Predictor-Free Federated Neural Architecture Search(https://arxiv.org/abs/2601.15127)
Keywords: generation
Abstract: Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a principled, multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal cache of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: this https URL
摘要：联合神经架构搜索 (FedNAS) 旨在自动化保护隐私的联合学习 (FL) 的模型设计，但目前面临两个关键瓶颈：产生次优模型的无指导超网训练，以及用于训练后子网发现的昂贵的数小时管道。我们介绍 DeepFedNAS，这是一种新颖的两阶段框架，以原则性的多目标适应度函数为基础，将数学网络设计与架构启发法相结合。在重新设计的超网的支持下，DeepFedNAS 引入了联邦帕累托最优超网训练，它利用高适应度架构的预先计算的帕累托最优缓存作为智能课程来优化共享超网权重。随后，其无预测器搜索方法通过利用该适应度函数作为直接、零成本的准确性代理，消除了对昂贵的准确性代理的需求，从而在短短几秒钟内实现按需子网发现。 DeepFedNAS 实现了最先进的精度（例如，相对于 CIFAR-100 绝对提高了 1.21%）、卓越的参数和通信效率，以及训练后搜索管道总时间大幅加速约 61 倍。通过将管道从 20 多个小时缩短到大约 20 分钟（包括初始缓存生成）并启用 20 秒的单独子网搜索，DeepFedNAS 使硬件感知的 FL 部署即时且实用。完整的源代码和实验脚本可在以下位置找到：此 https URL

Title: ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

Authors: Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao, Yujun Shen, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.15221
Pdf URL: https://arxiv.org/pdf/2601.15221
Copy Paste: [[2601.15221]] ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation(https://arxiv.org/abs/2601.15221)
Keywords: generation
Abstract: Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.
摘要：使用扩散模型生成 3D 对象的最新进展取得了显着的成功，但生成逼真的 3D 城市场景仍然具有挑战性。仅依赖 3D 扩散模型的现有方法往往会导致外观细节的退化，而仅使用 2D 扩散模型的方法通常会损害相机的可控性。为了克服这一限制，我们提出了 ScenDi，一种集成了 3D 和 2D 扩散模型的城市场景生成方法。我们首先训练 3D 潜在扩散模型来生成 3D 高斯，从而能够以相对较低的分辨率渲染图像。为了实现可控合成，可以通过指定 3D 边界框、路线图或文本提示等输入来选择性地调节此 3DGS 生成过程。然后，我们训练 2D 视频扩散模型，以增强基于 3D 高斯渲染图像的外观细节。通过利用粗略 3D 场景作为 2D 视频扩散的指导，ScenDi 根据输入条件生成所需的场景，并成功遵循准确的摄像机轨迹。在两个具有挑战性的现实世界数据集 Waymo 和 KITTI-360 上进行的实验证明了我们方法的有效性。

Title: FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

Authors: Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng, Levent Burak Kara, Joaquim Jorge, Qun-Ce Xu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2601.15250
Pdf URL: https://arxiv.org/pdf/2601.15250
Copy Paste: [[2601.15250]] FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion(https://arxiv.org/abs/2601.15250)
Keywords: generation, generative
Abstract: Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.
摘要：由于从单个视图推断被遮挡的 3D 几何图形固有的模糊性，单目 RGB 图像的语义场景完成 (SSC) 是一项基本但具有挑战性的任务。虽然前馈方法取得了进展，但它们常常难以在遮挡区域中生成合理的细节并保留对象的基本空间关系。这种针对整个 3D 空间的准确生成推理能力在现实应用中至关重要。在本文中，我们提出了 FlowSSC，这是第一个直接应用于单目语义场景完成的生成框架。 FlowSSC将SSC任务视为条件生成问题，并且可以与现有的前馈SSC方法无缝集成，以显着提高其性能。为了在不影响质量的情况下实现实时推理，我们引入了在紧凑的三平面潜在空间中运行的快捷流匹配。与需要数百个步骤的标准扩散模型不同，我们的方法利用快捷机制在一步中实现高保真生成，从而能够在自治系统中实际部署。 SemanticKITTI 上的大量实验表明 FlowSSC 实现了最先进的性能，显着优于现有基准。

Title: MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

Authors: Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.15279
Pdf URL: https://arxiv.org/pdf/2601.15279
Copy Paste: [[2601.15279]] MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs(https://arxiv.org/abs/2601.15279)
Keywords: generation
Abstract: A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.
摘要：分子的性质从根本上是由其分子图中编码的组成和结构决定的。因此，推理分子特性需要解析和理解分子图的能力。大型语言模型 (LLM) 越来越多地应用于化学，处理分子名称转换、字幕、文本引导生成以及性质或反应预测等任务。大多数现有基准强调一般化学知识，依赖于存在泄漏或偏见风险的文献或替代标签，或减少对多项选择题的评估。我们推出 MolecularIQ，一种分子结构推理基准，专门专注于符号可验证的任务。 MolecularIQ 能够对分子图推理进行细粒度评估，并揭示将模型故障定位到特定任务和分子结构的能力模式。这为当前化学法学硕士的优势和局限性提供了可行的见解，并指导了忠实推理分子结构的模型的开发。

Title: StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, Chenyang Si
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.15281
Pdf URL: https://arxiv.org/pdf/2601.15281
Copy Paste: [[2601.15281]] StableWorld: Towards Stable and Consistent Long Interactive Video Generation(https://arxiv.org/abs/2601.15281)
Keywords: generation
Abstract: In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.
摘要：在本文中，我们探讨了交互式视频生成中被忽视的稳定性和时间一致性的挑战，它通过相机移动和文本提示等交互行为合成动态且可控的视频世界。尽管世界建模取得了显着进展，但当前的方法仍然存在严重的不稳定性和时间退化，常常导致长视域交互过程中的空间漂移和场景崩溃。为了更好地理解这个问题，我们首先调查了不稳定的根本原因，并确定错误累积的主要来源源于同一场景，其中生成的帧逐渐偏离初始干净状态并将错误传播到后续帧。基于这一观察，我们提出了一种简单而有效的方法，\textbf{StableWorld}，一种动态帧驱逐机制。通过不断过滤掉退化的帧，同时保留几何一致的帧，StableWorld 有效地从源头防止累积漂移，从而使交互生成更加稳定和时间一致性。在多个交互式视频模型（例如 Matrix-Game、Open-Oasis 和 Hunyuan-GameCraft）上取得的有希望的结果表明，StableWorld 是模型无关的，可以应用于不同的交互式视频生成框架，以显着提高跨不同交互式场景的稳定性、时间一致性和泛化性。

Title: Rethinking Video Generation Model for the Embodied World

Authors: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.15282
Pdf URL: https://arxiv.org/pdf/2601.15282
Copy Paste: [[2601.15282]] Rethinking Video Generation Model for the Embodied World(https://arxiv.org/abs/2601.15282)
Keywords: generation
Abstract: Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.
摘要：视频生成模型具有显着先进的体现智能，解锁了生成各种机器人数据的新可能性，这些数据可以捕获物理世界中的感知、推理和动作。然而，合成准确反映现实世界机器人交互的高质量视频仍然具有挑战性，并且缺乏标准化基准限制了公平比较和进步。为了解决这一差距，我们引入了一个全面的机器人基准测试 RBench，旨在评估跨五个任务域和四个不同实施例的面向机器人的视频生成。它通过可重复的子指标评估任务级别的正确性和视觉保真度，包括结构一致性、物理合理性和动作完整性。对 25 个代表性模型的评估凸显了在生成物理真实机器人行为方面的重大缺陷。此外，该基准与人类评估的 Spearman 相关系数达到 0.96，验证了其有效性。虽然 RBench 提供了识别这些缺陷的必要视角，但实现物理真实感需要超越评估，以解决高质量训练数据的严重短缺问题。在这些见解的驱动下，我们引入了一个完善的四阶段数据管道，从而产生了 RoVid-X，这是最大的视频生成开源机器人数据集，包含 400 万个带注释的视频剪辑，涵盖数千个任务，并丰富了全面的物理属性注释。总的来说，这个评估和数据的协同生态系统为视频模型的严格评估和可扩展训练奠定了坚实的基础，加速了嵌入式人工智能向通用智能的发展。

Title: LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

Authors: Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2601.15283
Pdf URL: https://arxiv.org/pdf/2601.15283
Copy Paste: [[2601.15283]] LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes(https://arxiv.org/abs/2601.15283)
Keywords: generative
Abstract: We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see this https URL.
摘要：我们提出了一种通过单个多视图场景捕获在室内场景中进行交互式灯光编辑的新颖方法。我们的方法利用基于生成图像的光分解模型，将复杂的室内场景照明分解为其组成光源。这种分解使得能够独立操纵各个光源，特别是允许控制它们的状态（开/关）、色度和强度。我们进一步引入多视图照明协调，以确保照明分解在所有场景视图中的一致传播。它被集成到可重新点亮的 3D 高斯喷射表示中，提供对各个光源的实时交互控制。我们的结果展示了不同室内场景中高度真实的照明分解和重新照明结果。我们在合成数据集和真实数据集上评估我们的方法，并提供与最先进技术的定量和定性比较。有关视频结果和交互式演示，请参阅此 https URL。

Title: Walk through Paintings: Egocentric World Models from Internet Priors

Authors: Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.15284
Pdf URL: https://arxiv.org/pdf/2601.15284
Copy Paste: [[2601.15284]] Walk through Paintings: Egocentric World Models from Internet Priors(https://arxiv.org/abs/2601.15284)
Keywords: generation
Abstract: What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
摘要：如果视频生成模型不仅能够想象一个合理的未来，而且能够想象出正确的未来，准确地反映世界如何随着每一个动作而变化，结果会怎样呢？我们通过提出自我中心世界模型（EgoWM）来解决这个问题，这是一种简单的、与架构无关的方法，可以将任何预训练的视频扩散模型转换为动作条件的世界模型，从而实现可控的未来预测。我们不是从头开始训练，而是重新利用互联网规模视频模型的丰富世界先验，并通过轻量级调节层注入运动命令。这使得模型能够忠实地遵循行为，同时保持现实性和强泛化性。我们的方法可以自然地跨实施例和动作空间扩展，从 3-DoF 移动机器人到 25-DoF 类人机器人，其中预测以自我为中心的关节角度驱动的动力学更具挑战性。该模型为导航和操作任务提供连贯的推出，只需要适度的微调。为了独立于视觉外观来评估物理正确性，我们引入了结构一致性评分（SCS），它衡量稳定的场景元素是否与所提供的动作一致地演变。与之前最先进的导航世界模型相比，EgoWM 将 SCS 提高了 80%，同时将推理延迟降低了六倍，并对不可见的环境（包括绘画内的导航）进行了强大的泛化。

Title: Iterative Refinement Improves Compositional Image Generation

Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2601.15286
Pdf URL: https://arxiv.org/pdf/2601.15286
Copy Paste: [[2601.15286]] Iterative Refinement Improves Compositional Image Generation(https://arxiv.org/abs/2601.15286)
Keywords: generation
Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at this https URL
摘要：文本到图像 (T2I) 模型已经取得了显着的进步，但它们仍然在处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时间策略，例如与验证器的并行采样或简单地增加去噪步骤，可以改善即时对齐，但对于必须满足许多约束的丰富组合设置仍然不够。受到大型语言模型中思想链推理成功的启发，我们提出了一种迭代测试时间策略，其中 T2I 模型在作为循环批评者的视觉语言模型的反馈的指导下，跨多个步骤逐步完善其生成。我们的方法很简单，不需要外部工具或先验知识，并且可以灵活地应用于各种图像生成器和视觉语言模型。根据经验，我们展示了跨基准图像生成的一致收益：与计算匹配的并行采样相比，ConceptMix (k=7) 的正确率提高了 16.9%，T2I-CompBench（3D 空间类别）提高了 13.8%，Visual Jenga 场景分解提高了 12.5%。除了定量收益之外，迭代细化通过将复杂的提示分解为顺序修正来产生更忠实的生成，对于并行基线，人类评估者在 58.7% 的情况下更喜欢我们的方法，而不是 41.3% 的情况。总之，这些发现强调了迭代自我校正作为构图图像生成的广泛适用的原则。结果和可视化可在此 https URL 获取