2026-01-09

Title: Generation of synthetic delay time series for air transport applications

Authors: Pau Esteve, Massimiliano Zanin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04279
Pdf URL: https://arxiv.org/pdf/2601.04279
Copy Paste: [[2601.04279]] Generation of synthetic delay time series for air transport applications(https://arxiv.org/abs/2601.04279)
Keywords: generation
Abstract: The generation of synthetic data is receiving increasing attention from the scientific community, thanks to its ability to solve problems like data scarcity and privacy, and is starting to find applications in air transport. We here tackle the problem of generating synthetic, yet realistic, time series of delays at airports, starting from large collections of operations in Europe and the US. We specifically compare three models, two of them based on state of the art Deep Learning algorithms, and one simplified Genetic Algorithm approach. We show how the latter can generate time series that are almost indistinguishable from real ones, while maintaining a high variability. We further validate the resulting time series in a problem of detecting delay propagations between airports. We finally make the synthetic data available to the scientific community.
摘要：由于合成数据能够解决数据稀缺和隐私等问题，合成数据的生成越来越受到科学界的关注，并开始在航空运输中得到应用。我们在这里解决在机场生成合成但现实的延误时间序列的问题，从欧洲和美国的大量运营开始。我们特别比较了三种模型，其中两种基于最先进的深度学习算法，一种是简化的遗传算法方法。我们展示了后者如何生成与真实时间序列几乎无法区分的时间序列，同时保持较高的可变性。我们进一步验证了检测机场之间延迟传播问题所得到的时间序列。我们最终将合成数据提供给科学界。

Title: LEGATO: Good Identity Unlearning Is Continuous

Authors: Qiang Chen, Chun-Wun Cheng, Xiu Su, Hongyan Xu, Xi Lin, Shan You, Angelica I. Aviles-Rivero, Yi Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04282
Pdf URL: https://arxiv.org/pdf/2601.04282
Copy Paste: [[2601.04282]] LEGATO: Good Identity Unlearning Is Continuous(https://arxiv.org/abs/2601.04282)
Keywords: generative
Abstract: Machine unlearning has become a crucial role in enabling generative models trained on large datasets to remove sensitive, private, or copyright-protected data. However, existing machine unlearning methods face three challenges in learning to forget identity of generative models: 1) inefficient, where identity erasure requires fine-tuning all the model's parameters; 2) limited controllability, where forgetting intensity cannot be controlled and explainability is lacking; 3) catastrophic collapse, where the model's retention capability undergoes drastic degradation as forgetting progresses. Forgetting has typically been handled through discrete and unstable updates, often requiring full-model fine-tuning and leading to catastrophic collapse. In this work, we argue that identity forgetting should be modeled as a continuous trajectory, and introduce LEGATO - Learn to ForgEt Identity in GenerAtive Models via Trajectory-consistent Neural Ordinary Differential Equations. LEGATO augments pre-trained generators with fine-tunable lightweight Neural ODE adapters, enabling smooth, controllable forgetting while keeping the original model weights frozen. This formulation allows forgetting intensity to be precisely modulated via ODE step size, offering interpretability and robustness. To further ensure stability, we introduce trajectory consistency constraints that explicitly prevent catastrophic collapse during unlearning. Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show that LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse and reduces fine-tuned parameters.
摘要：机器遗忘已成为使在大型数据集上训练的生成模型能够删除敏感、私有或受版权保护的数据的关键作用。然而，现有的机器遗忘方法在学习忘记生成模型的身份方面面临三个挑战：1）效率低下，其中身份擦除需要微调所有模型的参数； 2）可控性有限，遗忘强度无法控制，缺乏可解释性； 3）灾难性崩溃，模型的保留能力随着遗忘的进展而急剧下降。遗忘通常是通过离散且不稳定的更新来处理的，通常需要全模型微调并导致灾难性崩溃。在这项工作中，我们认为身份遗忘应该建模为连续轨迹，并引入 LEGATO - 通过轨迹一致的神经常微分方程学习在生成模型中忘记身份。 LEGATO 通过可微调的轻量级神经 ODE 适配器增强了预先训练的生成器，从而实现平滑、可控的遗忘，同时保持原始模型权重冻结。该公式允许通过 ODE 步长精确调节遗忘强度，从而提供可解释性和鲁棒性。为了进一步确保稳定性，我们引入了轨迹一致性约束，明确防止遗忘过程中的灾难性崩溃。跨域内和域外身份遗忘基准的广泛实验表明，LEGATO 实现了最先进的遗忘性能，避免了灾难性崩溃并减少了微调参数。

Title: ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues

Authors: Behrad Binaei-Haghighi, Nafiseh Sadat Sajadi, Mehrad Liviyan, Reyhane Akhavan Kharazi, Fatemeh Amirkhani, Behnam Bahrak
Subjects: cs.LG, cs.CV, cs.HC, cs.IR
Abstract URL: https://arxiv.org/abs/2601.04297
Pdf URL: https://arxiv.org/pdf/2601.04297
Copy Paste: [[2601.04297]] ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues(https://arxiv.org/abs/2601.04297)
Keywords: generation
Abstract: The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
摘要：对人类情感和心理状态的客观评估提出了重大挑战，特别是通过非语言渠道。本文介绍了数字绘图作为一种丰富且尚未充分开发的情感感知方式。我们提出了一种新颖的多模态框架，名为 ArtCognition，用于自动分析房屋-树-人（HTP）测试，这是一种广泛使用的心理工具。 ArtCognition 独特地融合了两种不同的数据流：由计算机视觉模型捕获的最终艺术作品的静态视觉特征，以及源自绘画过程本身的动态行为运动学线索，例如笔画速度、暂停和平滑度。为了弥合低级特征和高级心理解释之间的差距，我们采用了检索增强生成（RAG）架构。这将分析建立在既定的心理学知识的基础上，增强了可解释性并减少了模型幻觉的可能性。我们的结果表明，视觉和行为运动学线索的融合提供了比单独任何一种方式更细致的评估。我们显示了提取的多模态特征和标准化心理指标之间的显着相关性，验证了该框架作为支持临床医生的可扩展工具的潜力。这项工作为非侵入性情感状态评估提供了一种新方法，并为技术辅助心理保健开辟了新途径。

Title: Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

Authors: Chenye Meng, Zejian Li, Zhongni Liu, Yize Li, Changle Xie, Kaixin Jia, Ling Yang, Huanghuang Deng, Shiying Ding, Shengyuan Zhang, Jiayi Li, Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04300
Pdf URL: https://arxiv.org/pdf/2601.04300
Copy Paste: [[2601.04300]] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes(https://arxiv.org/abs/2601.04300)
Keywords: generation
Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.
摘要：扩散模型的训练后对齐依赖于简化的信号，例如标量奖励或二元偏好。这限制了与复杂的人类专业知识的协调，这些专业知识是分层且细粒度的。为了解决这个问题，我们首先与领域专家构建了一个分层的、细粒度的评估标准，将图像质量分解为以树结构组织的多个正负属性。在此基础上，我们提出了一个两阶段的协调框架。首先，我们通过监督微调将领域知识注入辅助扩散模型。其次，我们引入了复杂偏好优化 (CPO)，它扩展了 DPO，使目标扩散与我们的非二元分层标准保持一致。具体来说，我们重新表述对齐问题，以同时最大化正属性的概率，同时通过辅助扩散最小化负属性的概率。我们在绘画生成领域实例化我们的方法，并根据我们的标准，使用具有细粒度属性的带注释的绘画数据集进行 CPO 训练。大量实验表明，CPO 显着提高了发电质量并与专业知识保持一致，为细粒度标准调整开辟了新途径。

Title: Quantifying the Effect of Test Set Contamination on Generative Evaluations

Authors: Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Abhay Puri, Niloofar Mireshghallah, Sanmi Koyejo
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.04301
Pdf URL: https://arxiv.org/pdf/2601.04301
Copy Paste: [[2601.04301]] Quantifying the Effect of Test Set Contamination on Generative Evaluations(https://arxiv.org/abs/2601.04301)
Keywords: generation, generative
Abstract: As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
摘要：由于前沿人工智能系统是根据网络规模的数据进行预训练的，测试集污染已成为准确评估其能力的关键问题。虽然研究已经彻底调查了测试集污染对多项选择题回答等歧视性评估的影响，但相对较少的研究研究了测试集污染对生成性评估的影响。在这项工作中，我们通过语言模型生命周期定量评估测试集污染对生成评估的影响。我们在网络数据和 MATH 基准的混合上预训练语言模型，清除模型大小和污染预训练语料库的测试集副本数量；性能随着污染和模型尺寸的增加而提高。使用缩放定律，我们有了一个令人惊讶的发现：即使包含单个测试集副本，模型也能实现比在未污染语料库上训练的不可约误差更低的损失。然后我们研究进一步的训练：使用新数据进行过度训练可以减少污染的影响，而对训练集的监督微调可以提高或降低测试数据的性能，具体取决于训练前污染的数量。最后，在推论中，我们确定了调节记忆的因素：高采样温度可以减轻污染影响，较长的解决方案比较短的解决方案更难记住，这与判别性评估形成鲜明对比，其中解决方案的长度只有几个标记。通过描述生成和记忆如何相互作用，我们强调了人工智能系统可信评估的新复杂性。

Title: Embedding Textual Information in Images Using Quinary Pixel Combinations

Authors: A V Uday Kiran Kandala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04302
Pdf URL: https://arxiv.org/pdf/2601.04302
Copy Paste: [[2601.04302]] Embedding Textual Information in Images Using Quinary Pixel Combinations(https://arxiv.org/abs/2601.04302)
Keywords: generative
Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.
摘要：本文提出了一种使用 RGB 空间中像素强度的五进制组合将文本数据嵌入到图像中的新技术。现有方法主要依赖于最低和最高有效位（LSB 和 MSB）操作、像素值差分（PVD）、RGB 通道中的空间扰动、基于变换域的方法、量化方法、基于边缘和区域的方法，以及最近通过深度学习方法和生成 AI 技术来隐藏图像空间域中的文本信息。其中大多数依赖于翻转多个像素的像素强度（例如 LSB 和基于 LSB 的方法的组合）以及变换系数，通常会导致噪声的形式。编码和解码在大多数现有方法中都是确定性的，并且在深度学习和生成人工智能方法等更高模型的情况下计算量很大。所提出的方法适用于 RGB 空间中的五元像素强度组合，其中每个 R、G 和 B 通道中的五个受控不同像素强度变化可制定多达一百二十五个不同的像素强度组合。这些组合被映射到文本符号，从而能够表示大写和小写字母字符、数字、空格和常用特殊字符。针对原始图像和编码图像评估了不同的指标，例如 MSE、MAE、SNR、PSNR、SSIM、直方图比较和热图分析，导致图像没有明显失真。此外，该方法通过在单个 RGB 像素内编码完整的文本符号来提高嵌入效率，这与通常需要多个像素或多步骤过程的基于 LSB 和 MSB 的方法以及会产生更高计算开销的基于变换和学习的方法形成鲜明对比。

Title: Unified Text-Image Generation with Weakness-Targeted Post-Training

Authors: Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan, Amita Kamath, Michal Drozdzal, Reyhane Askari-Hemmat, Luke Zettlemoyer, Marjan Ghazvininejad
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04339
Pdf URL: https://arxiv.org/pdf/2601.04339
Copy Paste: [[2601.04339]] Unified Text-Image Generation with Weakness-Targeted Post-Training(https://arxiv.org/abs/2601.04339)
Keywords: generation
Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.
摘要：联合生成文本和图像的统一多模态生成架构最近已成为文本到图像（T2I）合成的一个有前途的方向。然而，许多现有系统依赖于显式模态切换，在手动切换到图像生成之前生成推理文本。这种单独的、顺序的推理过程限制了跨模式耦合并禁止自动多模式生成。这项工作探索了后训练以实现完全统一的文本图像生成，其中模型在单个推理过程中自主地从文本推理过渡到视觉合成。我们研究了联合文本图像生成对 T2I 性能的影响以及训练后每种模式的相对重要性。我们还探索了不同的训练后数据策略，表明与广泛的图像描述语料库或基准对齐数据相比，解决特定限制的目标数据集取得了更出色的结果。我们的方法使用离线、奖励加权的后训练和完全自行生成的合成数据，可以在四个不同的 T2I 基准上改进多模式图像生成，证明奖励加权两种模式和战略设计的训练后数据的有效性。

Title: ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Authors: Mohsen Ghafoorian, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04342
Pdf URL: https://arxiv.org/pdf/2601.04342
Copy Paste: [[2601.04342]] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers(https://arxiv.org/abs/2601.04342)
Keywords: generation
Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at this https URL.
摘要：视频扩散模型的最新进展已转向基于变压器的架构，实现了最先进的视频生成，但代价是二次注意力复杂性，这严重限制了较长序列的可扩展性。我们引入了 ReHyAt，这是一种循环混合注意力机制，它将 softmax 注意力的保真度与线性注意力的效率结合起来，实现了块式循环重构和恒定的内存使用。与仅并发线性 SANA Video 不同，ReHyAt 的混合设计允许对现有基于 softmax 的模型进行高效蒸馏，将训练成本降低两个数量级至约 160 个 GPU 小时，同时在质量上具有竞争力。我们的轻量级蒸馏和微调管道提供了一种方法，可以应用于未来最先进的基于 softmax 的双向模型。 VBench 和 VBench-2.0 上的实验以及人类偏好研究表明，ReHyAt 实现了最先进的视频质量，同时将注意力成本从二次降低到线性，从而解锁了长时间和设备上视频生成的实用可扩展性。项目页面可通过此 https URL 获取。

Title: PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

Authors: Kunyang Li, Mubarak Shah, Yuzhang Shang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04359
Pdf URL: https://arxiv.org/pdf/2601.04359
Copy Paste: [[2601.04359]] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache(https://arxiv.org/abs/2601.04359)
Keywords: generation, generative
Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.
摘要：统一的自回归模型是一个基于 Transformer 的框架，它将不同的多模态任务（例如文本、图像、视频）作为共享令牌空间下的单序列建模问题来解决。此类模型依靠 KV-cache 机制将注意力计算从 O(T^2) 减少到 O(T)；然而，KV 缓存大小随着生成令牌的数量线性增长，并且它迅速成为限制推理效率和生成长度的主要瓶颈。统一自回归视频生成继承了这一限制。我们的分析表明，KV 缓存标记表现出独特的时空属性：（i）文本和条件图像标记充当持续受到高度关注的持久语义锚，以及（ii）对先前帧的关注随着时间距离自然衰减。利用这些观察结果，我们引入了 PackCache，这是一种免训练的 KV 缓存管理方法，通过三种协调机制动态压缩 KV 缓存：保留语义引用的条件锚定、根据时间距离分配缓存预算的跨帧衰减建模以及在缓存删除下保持连贯 3D 结构的空间保留位置嵌入。在效率方面，PackCache 在 48 帧长序列上将端到端生成速度提高了 1.7-2.2 倍，展示了其实现更长序列视频生成的强大潜力。值得注意的是，最后四帧（受逐渐扩展的 KV 缓存影响最大的部分，因此也是剪辑中最昂贵的部分）对于 48 帧视频，PackCache 分别在 A40 和 H200 上提供了 2.6 倍和 3.7 倍的加速。

Title: Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

Authors: Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.04442
Pdf URL: https://arxiv.org/pdf/2601.04442
Copy Paste: [[2601.04442]] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization(https://arxiv.org/abs/2601.04442)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
摘要：大型视觉语言模型（LVLM）通过产生逐步推理的思维链机制表现出了强大的推理能力。然而，这种思维缓慢的方法通常会导致过度思考，即使对于简单的查询，模型也会产生过于冗长的响应，导致测试时效率低下，甚至降低准确性。先前的工作试图通过自适应推理策略来缓解这个问题，但这些方法在很大程度上忽视了一个基本瓶颈：视觉感知失败。我们认为，稳定的推理很大程度上依赖于低水平的视觉基础，而推理错误往往源于不完美的感知，而不是深思熟虑不足。为了解决这个限制，我们提出了门控感知推理优化（GPRO），这是一种元推理控制器，它在每个生成步骤的三个决策路径之间动态路由计算：轻量级快速路径、用于重新检查视觉输入的慢速感知路径以及用于内部自我反思的慢速推理路径。为了了解这种区别，我们从大约 79 万个样本中得出大规模的失败归因监督，使用教师模型来区分知觉幻觉和推理错误。然后，我们使用多目标强化学习来训练控制器，以优化不确定性下任务准确性和计算成本之间的权衡。五个基准测试的实验表明，GPRO 显着提高了准确性和效率，超越了最近的慢速思维方法，同时生成了显着更短的响应。

Title: UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04453
Pdf URL: https://arxiv.org/pdf/2601.04453
Copy Paste: [[2601.04453]] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving(https://arxiv.org/abs/2601.04453)
Keywords: generation, generative
Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at this https URL .
摘要：世界模型已成为自动驾驶的核心，准确的场景理解和未来预测对于安全控制至关重要。最近的工作探索了使用视觉语言模型（VLM）进行规划，但现有方法通常将感知、预测和规划视为单独的模块。我们提出了 UniDrive-WM，这是一种基于 VLM 的统一世界模型，可在单一架构中联合执行驾驶场景理解、轨迹规划和轨迹条件未来图像生成。 UniDrive-WM 的轨迹规划器可预测未来轨迹，从而调节基于 VLM 的图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号，可以增强场景理解并迭代地细化轨迹生成。我们进一步比较未来图像预测的离散和连续输出表示，分析它们对下游驾驶性能的影响。在具有挑战性的 Bench2Drive 基准测试上进行的实验表明，UniDrive-WM 可以生成高保真度的未来图像，并且与之前的最佳方法相比，L2 轨迹误差的规划性能提高了 5.9%，碰撞率提高了 9.2%。这些结果证明了将 VLM 驱动的推理、规划和生成世界建模紧密集成以实现自动驾驶的优势。项目页面可通过此 https URL 获取。

Title: Meta-probabilistic Modeling

Authors: Kevin Zhang, Yixin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04462
Pdf URL: https://arxiv.org/pdf/2601.04462
Copy Paste: [[2601.04462]] Meta-probabilistic Modeling(https://arxiv.org/abs/2601.04462)
Keywords: generative
Abstract: While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.
摘要：虽然概率图模型可以发现数据中的潜在结构，但其有效性取决于选择明确的模型。在实践中识别此类模型具有挑战性，通常需要通过反复试验进行迭代检查和修订。为此，我们提出了元概率建模（MPM），这是一种直接从多个相关数据集学习生成模型结构的元学习算法。 MPM 使用分层架构，其中全局模型规范在数据集之间共享，而局部参数仍特定于数据集。对于学习和推理，我们提出了一个易于处理的受 VAE 启发的代理目标，并通过双层优化对其进行优化：局部变量通过坐标上升进行分析更新，而全局参数则使用基于梯度的方法进行训练。我们在以对象为中心的图像建模和顺序文本建模上评估 MPM，证明它可以使生成模型适应数据，同时恢复有意义的潜在表示。

Title: IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

Authors: Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2601.04498
Pdf URL: https://arxiv.org/pdf/2601.04498
Copy Paste: [[2601.04498]] IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation(https://arxiv.org/abs/2601.04498)
Keywords: generation
Abstract: Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at this https URL.
摘要：信息图表是复合视觉制品，它将数据可视化与文本和说明性元素相结合以传达信息。虽然最近的文本到图像（T2I）模型可以生成美观的图像，但它们在生成信息图表方面的可靠性仍不清楚。生成的信息图表乍一看可能是正确的，但包含容易被忽视的问题，例如扭曲的数据编码或不正确的文本内容。我们推出了 IGENBENCH，这是第一个用于评估文本到信息图表生成可靠性的基准，其中包含涵盖 30 种信息图表类型的 600 个精心策划的测试用例。我们设计了一个自动评估框架，该框架根据 10 种问题类型的分类将可靠性验证分解为原子是/否问题。我们采用多模态大语言模型 (MLLM) 来验证每个问题，从而产生问题级准确性 (Q-ACC) 和信息图级准确性 (I-ACC)。我们在 IGENBENCH 上综合评估了 10 个最先进的 T2I 模型。我们的系统分析揭示了未来模型开发的关键见解：（i）三层性能层次结构，顶级模型的 Q-ACC 为 0.90，但 I-ACC 仅 0.49； (ii) 数据相关维度成为普遍瓶颈（例如数据完整性：0.21）； (iii) 在所有模型中实现端到端正确性的挑战。我们在此 https URL 发布 IGENBENCH。

Title: Surface-based Molecular Design with Multi-modal Flow Matching

Authors: Fang Wu, Zhengyuan Zhou, Shuting Jin, Xiangxiang Zeng, Jure Leskovec, Jinbo Xu
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2601.04506
Pdf URL: https://arxiv.org/pdf/2601.04506
Copy Paste: [[2601.04506]] Surface-based Molecular Design with Multi-modal Flow Matching(https://arxiv.org/abs/2601.04506)
Keywords: generation, generative
Abstract: Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors. However, the critical role of molecular surfaces in protein-protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides. SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.
摘要：治疗性肽在靶向以前不可成药的结合位点方面显示出前景，最近深度生成模型的进展使得针对特定蛋白质受体的全原子肽共同设计成为可能。然而，分子表面在蛋白质-蛋白质相互作用（PPI）中的关键作用尚未得到充分探索。为了弥补这一差距，我们提出了一种全设计肽生成范例，称为 SurfFlow，这是一种新型的基于表面的生成算法，可以对肽的序列、结构和表面进行全面的协同设计。 SurfFlow 采用多模态条件流匹配 (CFM) 架构来学习表面几何形状和生化特性的分布，从而提高肽结合的准确性。根据综合 PepMerge 基准进行评估，SurfFlow 在所有指标上始终优于全原子基准。这些结果突出了在从头肽发现中考虑分子表面的优势，并证明了整合多种蛋白质模式以发现更有效的治疗性肽的潜力。

Title: FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer

Authors: Chengyang Li, Baoping Cheng, Yao Cheng, Haocheng Zhang, Renshuai Liu, Yinglin Zheng, Jing Liao, Xuan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04520
Pdf URL: https://arxiv.org/pdf/2601.04520
Copy Paste: [[2601.04520]] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer(https://arxiv.org/abs/2601.04520)
Keywords: generation
Abstract: Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.
摘要：最近的面部纹理生成方法更喜欢使用深度网络来合成图像内容，然后填充 UV 贴图，从而从单个图像生成引人注目的完整纹理。然而，合成的纹理 UV 图通常来自由训练数据或 2D 人脸生成器构建的空间，这限制了该方法对野外输入图像的泛化能力。因此，他们的面部细节、结构和身份可能与输入不一致。在本文中，我们通过提出一种名为 FaceRefiner 的基于风格迁移的面部纹理细化方法来解决这个问题。 FaceRefiner 将 3D 采样纹理视为样式，并将纹理生成方法的输出视为内容。然后，照片级写实风格有望从风格图像转移到内容图像。与当前仅将高级和中级信息传输到结果的样式传输方法不同，我们的样式传输方法集成了可微分渲染，还可以传输可见面部区域中的低级（或像素级）信息。这种多级信息传输的主要好处是，输入中的细节、结构和语义可以得到很好的保留。在 Multi-PIE、CelebA 和 FFHQ 数据集上的大量实验表明，与最先进的技术相比，我们的细化方法可以提高纹理质量和面部身份保留能力。

Title: TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

Authors: Jacob Ede Levine, Yun Lyan Luo, Sai Chandra Kosaraju
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04521
Pdf URL: https://arxiv.org/pdf/2601.04521
Copy Paste: [[2601.04521]] TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation(https://arxiv.org/abs/2601.04521)
Keywords: generation
Abstract: The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.
摘要：可靠、有效和多样化的分子设计是现代药物发现的基础，因为改进的分子生成支持有效探索潜在候选药物的化学空间，并降低早期设计工作的成本。尽管有这些需求，当前将分子生成为 SMILES 字符串的化学语言模型很容易出现复合标记错误：许多样本无法解析或在化学上不可信，而旨在防止失败的硬约束可能会限制探索。为了解决这一差距，我们引入了 TSSR，这是一种用于角色级微笑生成的两阶段、交换奖励驱动的强化学习 (RL) 框架。第一阶段奖励修复语法的本地令牌交换，促进从无效字符串到可解析字符串的转换。第二阶段提供来自 RDKit 诊断的化学感知反馈，奖励化合价、芳香度和连接问题的减少。奖励分解为可解释的术语（交换效率、错误减少、有效性距离），与模型无关，并且不需要特定于任务的标签或手工编写的语法。我们使用 GRU 策略在 MOSES 基准上评估 TSSR，该 GRU 策略经过 PPO 训练，包括随机初始化的纯 RL (P-RL) 和从预训练的化学语言模型开始的微调 RL (F-RL)，每次运行评估 10,000 个生成的 SMILES。在 P-RL 中，TSSR 显着提高了句法有效性、化学有效性和新颖性。在 F-RL 中，TSSR 保留了药物相似性和可合成性，同时提高了有效性和新颖性。令牌级分析表明，语法编辑和化学修复共同作用以减少 RDKit 检测到的错误。 TSSR 将稀疏的最终目标转化为更密集、更可解释的奖励，在不降低多样性的情况下提高句法和化学质量。 TSSR 与数据集无关，可以适应各种强化学习方法。

Title: GEnSHIN: Graphical Enhanced Spatio-temporal Hierarchical Inference Network for Traffic Flow Prediction

Authors: Zhiyan Zhou, Junjie Liao, Manho Zhang, Yingyi Liao, Ziai Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04550
Pdf URL: https://arxiv.org/pdf/2601.04550
Copy Paste: [[2601.04550]] GEnSHIN: Graphical Enhanced Spatio-temporal Hierarchical Inference Network for Traffic Flow Prediction(https://arxiv.org/abs/2601.04550)
Keywords: generation
Abstract: With the acceleration of urbanization, intelligent transportation systems have an increasing demand for accurate traffic flow prediction. This paper proposes a novel Graph Enhanced Spatio-temporal Hierarchical Inference Network (GEnSHIN) to handle the complex spatio-temporal dependencies in traffic flow prediction. The model integrates three innovative designs: 1) An attention-enhanced Graph Convolutional Recurrent Unit (GCRU), which strengthens the modeling capability for long-term temporal dependencies by introducing Transformer modules; 2) An asymmetric dual-embedding graph generation mechanism, which leverages the real road network and data-driven latent asymmetric topology to generate graph structures that better fit the characteristics of actual traffic flow; 3) A dynamic memory bank module, which utilizes learnable traffic pattern prototypes to provide personalized traffic pattern representations for each sensor node, and introduces a lightweight graph updater during the decoding phase to adapt to dynamic changes in road network states. Extensive experiments on the public dataset METR-LA show that GEnSHIN achieves or surpasses the performance of comparative models across multiple metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Notably, the model demonstrates excellent prediction stability during peak morning and evening traffic hours. Ablation experiments further validate the effectiveness of each core module and its contribution to the final performance.
摘要：随着城市化进程的加快，智能交通系统对准确的交通流量预测的需求越来越大。本文提出了一种新颖的图增强时空分层推理网络（GENSHIN）来处理交通流预测中复杂的时空依赖性。该模型融合了三项创新设计：1）注意力增强型图卷积循环单元（GCRU），通过引入 Transformer 模块，增强了长期时间依赖性的建模能力； 2）非对称双嵌入图生成机制，利用真实路网和数据驱动的潜在非对称拓扑，生成更贴合实际交通流特征的图结构； 3）动态存储库模块，利用可学习的交通模式原型为每个传感器节点提供个性化的交通模式表示，并在解码阶段引入轻量级图更新器以适应道路网络状态的动态变化。在公共数据集 METR-LA 上进行的大量实验表明，GEnSHIN 在平均绝对误差 (MAE)、均方根误差 (RMSE) 和平均绝对百分比误差 (MAPE) 等多个指标上达到或超越了比较模型的性能。值得注意的是，该模型在早晚交通高峰时段表现出出色的预测稳定性。消融实验进一步验证了每个核心模块的有效性及其对最终性能的贡献。

Title: A Vision for Multisensory Intelligence: Sensing, Synergy, and Science

Authors: Paul Pu Liang
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.04563
Pdf URL: https://arxiv.org/pdf/2601.04563
Copy Paste: [[2601.04563]] A Vision for Multisensory Intelligence: Sensing, Synergy, and Science(https://arxiv.org/abs/2601.04563)
Keywords: generation
Abstract: Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see this https URL.
摘要：我们对世界的体验是多感官的，涵盖语言、视觉、声音、触觉、味觉和嗅觉的综合。然而，人工智能主要在文本、视觉和音频等数字模式方面取得了进步。本文概述了未来十年多感官人工智能的研究愿景。这套新技术可以通过将人工智能与人类感官以及从身体上的生理和触觉线索到家庭、城市和环境中的物理和社会信号的丰富信号连接起来，改变人类和人工智能彼此体验和互动的方式。我们概述了该领域必须如何通过传感、科学和协同这三个相互关联的主题来发展。首先，传感研究应该扩展人工智能如何以更丰富的方式捕捉世界，超越数字媒体。其次，发展一门量化多模态异质性和相互作用的原理科学，开发统一的建模架构和表示，并理解跨模态转移。最后，我们提出了学习模式之间以及人类与人工智能之间协同作用的新技术挑战，涵盖多感官集成、对齐、推理、生成、泛化和体验。伴随本愿景论文的是麻省理工学院媒体实验室多感官智能小组的一系列项目、资源和最新进展演示，请参阅此 https URL。

Title: Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation

Authors: Xiaowei Mao, Huihu Ding, Yan Lin, Tingrui Wu, Shengnan Guo, Dazhuo Qiu, Feiling Fang, Jilin Hu, Huaiyu Wan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04572
Pdf URL: https://arxiv.org/pdf/2601.04572
Copy Paste: [[2601.04572]] Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation(https://arxiv.org/abs/2601.04572)
Keywords: generative
Abstract: Imputing missing values in spatial-temporal traffic data is essential for intelligent transportation systems. Among advanced imputation methods, score-based diffusion models have demonstrated competitive performance. These models generate data by reversing a noising process, using observed values as conditional guidance. However, existing diffusion models typically apply a uniform guidance scale across both spatial and temporal dimensions, which is inadequate for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than closely following the conditional observations, resulting in suboptimal imputation performance. To address this, we propose FENCE, a spatial-temporal feedback diffusion guidance method designed to adaptively control guidance scales during imputation. First, FENCE introduces a dynamic feedback mechanism that adjusts the guidance scale based on the posterior likelihood approximations. The guidance scale is increased when generated values diverge from observations and reduced when alignment improves, preventing overcorrection. Second, because alignment to observations varies across nodes and denoising steps, a global guidance scale for all nodes is suboptimal. FENCE computes guidance scales at the cluster level by grouping nodes based on their attention scores, leveraging spatial-temporal correlations to provide more accurate guidance. Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy.
摘要：估算时空交通数据中的缺失值对于智能交通系统至关重要。在先进的插补方法中，基于分数的扩散模型已经表现出有竞争力的性能。这些模型通过逆转噪声过程来生成数据，使用观测值作为条件指导。然而，现有的扩散模型通常在空间和时间维度上应用统一的引导尺度，这对于丢失数据率较高的节点来说是不够的。稀疏观察提供的条件指导不足，导致生成过程偏向学习的先验分布，而不是紧密遵循条件观察，从而导致插补性能不佳。为了解决这个问题，我们提出了 FENCE，一种时空反馈扩散指导方法，旨在在插补过程中自适应控制指导尺度。首先，FENCE引入了动态反馈机制，根据后验似然近似调整引导尺度。当生成的值与观察值不同时，指导比例会增加；当对齐改善时，指导比例会减小，从而防止过度校正。其次，由于观测值的对齐方式因节点和去噪步骤而异，因此所有节点的全局指导尺度并不是最优的。 FENCE 通过根据注意力分数对节点进行分组来计算集群级别的指导尺度，利用时空相关性来提供更准确的指导。真实交通数据集的实验结果表明，FENCE 显着提高了插补精度。

Title: 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks

Authors: Yusri Al-Sanaani, Rebecca Thornhill, Sreeraman Rajan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04588
Pdf URL: https://arxiv.org/pdf/2601.04588
Copy Paste: [[2601.04588]] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks(https://arxiv.org/abs/2601.04588)
Keywords: generative
Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the this http URL findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.
摘要：晚期钆增强 (LGE) MRI 对左心房 (LA) 壁和心内膜的分割对于量化心房颤动患者的心房纤维化至关重要。由于数据的可用性有限和解剖结构的复杂性，基于机器学习的精确分割模型的开发仍然具有挑战性。在这项工作中，我们研究 3D 条件生成模型作为增强稀缺 LGE 训练数据和提高 LA 分割性能的潜在解决方案。我们开发了一个管道，使用三个 3D 条件生成器（Pix2Pix GAN、SPADE-GAN 和 SPADE-LDM），从复合语义标签图合成高保真 3D LGE MRI 体积，将解剖专家注释与无监督组织簇相结合。评估合成图像的真实性及其对下游 LA 分割的影响。 SPADE-LDM 生成最真实、结构准确的图像，实现了 4.063 的 FID，并超越了 GAN 模型，Pix2Pix 和 SPADE-GAN 的 FID 分别为 40.821 和 7.652。当使用合成 LGE 图像进行增强时，使用 3D U-Net 模型进行 LA 腔分割的 Dice 评分从 0.908 提高到 0.936，与此 http URL 结果相比，显示出统计上显着的改进 (p < 0.05)，这表明标签条件 3D 合成在增强代表性不足的心脏结构分割方面的潜力。

Title: MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Authors: Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I. Morariu, Tong Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04589
Pdf URL: https://arxiv.org/pdf/2601.04589
Copy Paste: [[2601.04589]] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing(https://arxiv.org/abs/2601.04589)
Keywords: generation
Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.
摘要：现实世界的设计文档（例如海报）本质上是多层的，结合了装饰、文本和图像。根据自然语言指令编辑它们需要细粒度、层感知的推理来识别相关层并协调修改。之前的工作很大程度上忽略了多层设计文档编辑，而是专注于单层图像编辑或多层生成，这些工作假设平面画布并且缺乏确定修改内容和修改位置所需的推理。为了解决这一差距，我们引入了多层文档编辑代理 (MiLDEAgent)，这是一种基于推理的框架，它将用于分层理解的强化学习训练多模态推理器与用于有针对性修改的图像编辑器相结合。为了系统地对这一设置进行基准测试，我们引入了 MiLDEBench，这是一个包含超过 20K 设计文档以及各种编辑指令的人机交互语料库。该基准由特定于任务的评估协议 MiLDEEval 进行补充，该协议涵盖四个维度，包括指令遵循、布局一致性、美观和文本渲染。对 14 个开源模型和 2 个闭源模型的大量实验表明，现有方法无法泛化：开源模型通常无法完成多层文档编辑任务，而闭源模型则存在格式违规问题。相比之下，MiLDEAgent 实现了强大的分层感知推理和精确编辑，显着优于所有开源基线，并获得与闭源模型相当的性能，从而为多层文档编辑建立了第一个强大的基线。

Title: HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment

Authors: Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04614
Pdf URL: https://arxiv.org/pdf/2601.04614
Copy Paste: [[2601.04614]] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment(https://arxiv.org/abs/2601.04614)
Keywords: generation
Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
摘要：随着文本到图像生成技术的快速发展，准确评估生成的图像和文本提示之间的对齐情况已成为一项关键挑战。现有方法依赖于欧几里得空间度量，忽略了语义对齐的结构化本质，同时缺乏对不同样本的自适应能力。为了解决这些限制，我们提出了 HyperAlign，一种基于双曲蕴涵几何的自适应文本到图像对齐评估框架。首先，我们使用 CLIP 提取欧几里得特征并将其映射到双曲空间。其次，我们设计了一种动态监督蕴涵建模机制，将离散蕴涵逻辑转化为连续的几何结构监督。最后，我们提出了一种自适应调制回归器，它利用双曲几何特征来生成样本级调制参数，自适应校准欧几里德余弦相似度来预测最终分数。 HyperAlign在单数据库评估和跨数据库泛化任务上均取得了极具竞争力的性能，充分验证了双曲几何建模用于图文对齐评估的有效性。

Title: Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning

Authors: Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li, Yanchao Yang, Tao Fang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.04672
Pdf URL: https://arxiv.org/pdf/2601.04672
Copy Paste: [[2601.04672]] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning(https://arxiv.org/abs/2601.04672)
Keywords: generation
Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
摘要：农业病害诊断对 VLM 提出了挑战，因为传统的微调需要大量标签、缺乏可解释性且概括性差。虽然推理提高了模型的稳健性，但现有方法依赖于昂贵的专家注释，并且很少解决农业查询的开放式、多样化的性质。为了解决这些限制，我们提出了 \textbf{Agri-R1}，一种推理增强型农业大型模型。我们的框架通过视觉语言合成和基于 LLM 的过滤自动生成高质量推理数据，仅使用 19% 的可用样本。训练采用组相对策略优化 (GRPO) 和新颖的奖励函数，该函数集成了特定领域的词典和模糊匹配，以评估开放式响应的正确性和语言灵活性。在 CDDMBench 上进行评估，我们得到的 3B 参数模型实现了与 7B 至 13B 参数基线竞争的性能，显示出疾病识别准确率相对提高了 23.2%，农业知识 QA 提高了 33.3%，跨领域泛化能力比标准微调提高了 26.10 点。消融研究证实，结构化推理数据和 GRPO 驱动的探索之间的协同作用支撑了这些收益，并且随着问题复杂性的增加，收益也随之增加。

Title: HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

Authors: Yang Zou, Xingyue Zhu, Kaiqi Han, Jun Ma, Xingyuan Li, Zhiying Jiang, Jinyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04682
Pdf URL: https://arxiv.org/pdf/2601.04682
Copy Paste: [[2601.04682]] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution(https://arxiv.org/abs/2601.04682)
Keywords: super-resolution
Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: this https URL
摘要：红外视频在具有挑战性的环境下的视觉任务中引起了极大的兴趣，但经常遭受严重的大气湍流和压缩退化的影响。现有的视频超分辨率（VSR）方法要么忽略了红外图像和可见图像之间固有的模态差距，要么无法恢复湍流引起的失真。由于湍流和分辨率之间退化的解耦建模，直接级联湍流缓解 (TM) 算法与 VSR 方法会导致误差传播和累积。我们引入了 HATIR，一种用于湍流红外视频超分辨率的热感知扩散，它将热感知变形先验注入扩散采样路径中，以联合模拟湍流退化和结构细节损失的逆过程。具体来说，HATIR 构建了相量引导流估计器，其植根于热活跃区域随时间推移表现出一致相量响应的物理原理，从而实现可靠的湍流感知流来指导反向扩散过程。为了确保非均匀失真下结构恢复的保真度，提出了一种湍流感知解码器来选择性地抑制不稳定的时间线索，并通过湍流门控和结构感知注意来增强边缘感知特征聚合。我们构建了 FLIR-IVSR，这是第一个湍流红外 VSR 数据集，包含来自 FLIR T1050sc 相机 (1024 X 768) 的配对 LR-HR 序列，涵盖 640 个具有不同相机和物体运动条件的不同场景。这鼓励了红外 VSR 的未来研究。项目页面：此 https URL

Title: Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?

Authors: Mir Rayat Imtiaz Hossain, Leo Feng, Leonid Sigal, Mohamed Osama Ahmed
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04690
Pdf URL: https://arxiv.org/pdf/2601.04690
Copy Paste: [[2601.04690]] Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?(https://arxiv.org/abs/2601.04690)
Keywords: generative
Abstract: Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings. These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information. In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs.
摘要：大型语言模型 (LLM) 已成为一种很有前途的推荐系统，它提供了通过生成方法对用户偏好进行建模的新颖方法。然而，许多现有方法通常仅依赖于文本语义或以有限的方式合并协作信号，通常仅使用用户或项目嵌入。这些方法很难处理代表用户历史的多个项目嵌入，恢复文本语义并忽略更丰富的协作信息。在这项工作中，我们提出了一个简单而有效的解决方案，通过单独的轻量级投影仪模块将从协作过滤中学习到的用户和项目嵌入投影到 LLM 令牌空间中。然后，经过微调的法学硕士会根据这些预计的嵌入以及文本标记来生成推荐。初步结果表明，该设计有效地利用了结构化的用户-项目交互数据，提高了纯文本法学硕士基线的推荐性能，并为桥接传统推荐系统与现代法学硕士提供了一条实用途径。

Title: Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Authors: Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04706
Pdf URL: https://arxiv.org/pdf/2601.04706
Copy Paste: [[2601.04706]] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models(https://arxiv.org/abs/2601.04706)
Keywords: generation
Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at this https URL.
摘要：将图像生成和理解集成到单个框架中已成为多模态领域的关键目标。然而，理解如何有效地协助生成尚未得到充分探索。与之前专注于利用理解模型中的推理能力和世界知识的作品不同，本文引入了一个新颖的视角：利用理解来增强生成图像的保真度和细节丰富度。为此，我们提出了 Forge-and-Quench，这是一个将这一原则付诸实践的新统一框架。在我们框架的生成过程中，MLLM 首先对整个会话上下文（包括文本指令）进行推理，以生成增强的文本指令。然后，通过新颖的桥接适配器，将这种精致的指令映射到虚拟视觉表示，称为桥接功能。此功能充当关键环节，从理解模型中形成见解，以淬炼和完善生成过程。随后，它作为视觉引导信号与替换原始输入的增强文本指令一起注入到 T2I 主干中。为了验证这一范式，我们对桥功能和桥适配器的设计进行了全面的研究。我们的框架展示了卓越的可扩展性和灵活性，能够在不同的 MLLM 和 T2I 模型之间进行高效迁移，并显着节省培训开销，而且不会影响 MLLM 固有的多模式理解能力。实验表明，Forge-and-Quench 显着提高了多个模型的图像保真度和细节，同时还保持了指令遵循的准确性并增强了世界知识应用。模型和代码可从此 https URL 获取。

Title: MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training

Authors: Irfan Ullah, Young-Koo Lee
Subjects: cs.LG, cs.AI, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2601.04707
Pdf URL: https://arxiv.org/pdf/2601.04707
Copy Paste: [[2601.04707]] MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training(https://arxiv.org/abs/2601.04707)
Keywords: generation
Abstract: Graph Neural Networks (GNNs) are powerful tools for learning graph-structured data, but their scalability is hindered by inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization. Existing training frameworks fail to overlap these stages, leading to suboptimal resource utilization. This paper proposes MQ-GNN, a multi-queue pipelined framework that maximizes training efficiency by interleaving GNN training stages and optimizing resource utilization. MQ-GNN introduces Ready-to-Update Asynchronous Consistent Model (RaCoM), which enables asynchronous gradient sharing and model updates while ensuring global consistency through adaptive periodic synchronization. Additionally, it employs global neighbor sampling with caching to reduce data transfer overhead and an adaptive queue-sizing strategy to balance computation and memory efficiency. Experiments on four large-scale datasets and ten baseline models demonstrate that MQ-GNN achieves up to \boldmath $\bm{4.6\,\times}$ faster training time and 30% improved GPU utilization while maintaining competitive accuracy. These results establish MQ-GNN as a scalable and efficient solution for multi-GPU GNN training.
摘要：图神经网络 (GNN) 是学习图结构数据的强大工具，但其可扩展性受到低效的小批量生成、数据传输瓶颈和昂贵的 GPU 间同步的阻碍。现有的培训框架无法重叠这些阶段，导致资源利用率不高。本文提出了 MQ-GNN，这是一种多队列流水线框架，通过交错 GNN 训练阶段和优化资源利用率来最大化训练效率。 MQ-GNN 引入了准备更新异步一致性模型 (RaCoM)，它可以实现异步梯度共享和模型更新，同时通过自适应定期同步确保全局一致性。此外，它还采用带有缓存的全局邻居采样来减少数据传输开销，并采用自适应队列大小策略来平衡计算和内存效率。对四个大型数据集和十个基线模型的实验表明，MQ-GNN 的训练时间缩短了 \boldmath $\bm{4.6\,\times}$，GPU 利用率提高了 30%，同时保持了有竞争力的准确性。这些结果使 MQ-GNN 成为多 GPU GNN 训练的可扩展且高效的解决方案。

Title: AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection

Authors: Yunqing Hu, Zheming Yang, Chang Zhao, Qi Guo, Meng Gao, Pengcheng Li, Wen Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04734
Pdf URL: https://arxiv.org/pdf/2601.04734
Copy Paste: [[2601.04734]] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection(https://arxiv.org/abs/2601.04734)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
摘要：多模态大语言模型（MLLM）在语义理解和视觉推理方面表现出卓越的能力，但它们在精确的对象定位和资源受限的边缘云部署方面仍然面临挑战。为了解决这个问题，本文提出了AIVD框架，该框架通过轻量级边缘检测器和基于云的MLLM之间的协作实现了统一的精确定位和高质量的语义生成。为了增强云 MLLM 对边缘裁剪框噪声和场景变化的鲁棒性，我们设计了一种具有视觉语义协作增强功能的高效微调策略，显着提高了分类准确性和语义一致性。此外，为了在异构边缘设备和动态网络条件下保持高吞吐量和低延迟，我们提出了一种异构资源感知动态调度算法。实验结果表明，AIVD 大幅降低了资源消耗，同时提高了 MLLM 分类性能和语义生成质量。所提出的调度策略还可以在不同场景中实现更高的吞吐量和更低的延迟。

Title: Intraday spatiotemporal PV power prediction at national scale using satellite-based solar forecast models

Authors: Luca Lanzilao, Angela Meyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04751
Pdf URL: https://arxiv.org/pdf/2601.04751
Copy Paste: [[2601.04751]] Intraday spatiotemporal PV power prediction at national scale using satellite-based solar forecast models(https://arxiv.org/abs/2601.04751)
Keywords: generation
Abstract: We present a novel framework for spatiotemporal photovoltaic (PV) power forecasting and use it to evaluate the reliability, sharpness, and overall performance of seven intraday PV power nowcasting models. The model suite includes satellite-based deep learning and optical-flow approaches and physics-based numerical weather prediction models, covering both deterministic and probabilistic formulations. Forecasts are first validated against satellite-derived surface solar irradiance (SSI). Irradiance fields are then converted into PV power using station-specific machine learning models, enabling comparison with production data from 6434 PV stations across Switzerland. To our knowledge, this is the first study to investigate spatiotemporal PV forecasting at a national scale. We additionally provide the first visualizations of how mesoscale cloud systems shape national PV production on hourly and sub-hourly timescales. Our results show that satellite-based approaches outperform the Integrated Forecast System (IFS-ENS), particularly at short lead times. Among them, SolarSTEPS and SHADECast deliver the most accurate SSI and PV power predictions, with SHADECast providing the most reliable ensemble spread. The deterministic model IrradianceNet achieves the lowest root mean square error, while probabilistic forecasts of SolarSTEPS and SHADECast provide better-calibrated uncertainty. Forecast skill generally decreases with elevation. At a national scale, satellite-based models forecast the daily total PV generation with relative errors below 10% for 82% of the days in 2019-2020, demonstrating robustness and their potential for operational use.
摘要：我们提出了一种新颖的时空光伏（PV）电力预测框架，并用它来评估七个日内光伏发电临近预报模型的可靠性、清晰度和整体性能。该模型套件包括基于卫星的深度学习和光流方法以及基于物理的数值天气预报模型，涵盖确定性和概率公式。首先根据卫星产生的表面太阳辐照度 (SSI) 验证预测。然后，使用特定于电站的机器学习模型将辐照度场转换为光伏发电，从而可以与瑞士 6434 个光伏电站的生产数据进行比较。据我们所知，这是第一项在全国范围内调查时空光伏预测的研究。我们还首次提供了中尺度云系统如何在每小时和次小时时间尺度上塑造国家光伏发电的可视化结果。我们的结果表明，基于卫星的方法优于综合预报系统 (IFS-ENS)，特别是在较短的交付周期内。其中，SolarSTEPS 和 SHADECast 提供最准确的 SSI 和 PV 功率预测，SHADECast 提供最可靠的集合分布。确定性模型 IrradianceNet 实现了最低的均方根误差，而 SolarSTEPS 和 SHADECast 的概率预测提供了更好校准的不确定性。预测能力通常随着海拔的升高而降低。在全国范围内，基于卫星的模型预测 2019-2020 年 82% 的每日光伏发电总量相对误差低于 10%，这证明了其稳健性及其运营使用潜力。

Title: CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

Authors: Tobia Poppi, Burak Uzkent, Amanmeet Garg, Lucas Porto, Garin Kessler, Yezhou Yang, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, Florian Schiffers
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2601.04778
Pdf URL: https://arxiv.org/pdf/2601.04778
Copy Paste: [[2601.04778]] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models(https://arxiv.org/abs/2601.04778)
Keywords: generation
Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
摘要：视频语言模型（VLM）实现了强大的多模态理解，但仍然容易产生幻觉，特别是在推理动作和时间顺序时。现有的缓解策略，例如文本过滤或随机视频扰动，通常无法解决根本原因：过度依赖语言先验而不是细粒度的视觉动态。我们提出了一种用于反事实视频生成的可扩展框架，该框架合成仅在动作或时间结构上不同的视频，同时保留场景上下文。我们的流程将用于行动建议和编辑指导的多模式法学硕士与基于扩散的图像和视频模型相结合，以大规模生成语义硬底片。使用这个框架，我们构建了 CounterVid，这是一个包含约 26k 个偏好对的合成数据集，目标是动作识别和时间推理。我们进一步介绍 MixDPO，这是一种统一的直接偏好优化方法，联合利用文本和视觉偏好。使用 MixDPO 微调 Qwen2.5-VL 可产生一致的改进，特别是在时间顺序方面，并有效地转移到标准视频幻觉基准。代码和模型将公开。

Title: SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning

Authors: Xihe Qiu, Yang Dai, Xiaoyu Tan, Sijia Li, Fenghao Sun, Lu Gan, Liang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04785
Pdf URL: https://arxiv.org/pdf/2601.04785
Copy Paste: [[2601.04785]] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning(https://arxiv.org/abs/2601.04785)
Keywords: generation
Abstract: Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.
摘要：磁共振成像（MRI）可提供详细的组织信息，但其临床应用受到采集时间长、成本高和分辨率有限的限制。图像翻译作为解决这些限制的策略最近引起了人们的关注。尽管Pix2Pix已广泛应用于医学图像翻译，但其潜力尚未得到充分挖掘。在本研究中，我们提出了一种增强的 Pix2Pix 框架，该框架集成了挤压和激励残差网络 (SEResNet) 和 U-Net++，以提高图像生成质量和结构保真度。 SEResNet 通过通道注意力增强关键特征表示，而 U-Net++ 增强多尺度特征融合。简化的 PatchGAN 判别器进一步稳定了训练并完善了局部解剖真实感。实验结果表明，在少于 500 张图像的少镜头条件下，该方法在多个模态内 MRI 翻译任务中实现了一致的结构保真度和卓越的图像质量，表现出很强的泛化能力。这些结果表明 Pix2Pix 可以有效扩展用于医学图像翻译。

Title: Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers

Authors: Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04791
Pdf URL: https://arxiv.org/pdf/2601.04791
Copy Paste: [[2601.04791]] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers(https://arxiv.org/abs/2601.04791)
Keywords: restoration, generative
Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.
摘要：随着生成模型的最新进展，扩散模型已成为解决每个领域的逆问题的强大先验。由于潜在扩散模型 (LDM) 提供了通用先验，因此一些研究已经探索了它们作为与领域无关的零样本逆求解器的潜力。尽管做出了这些努力，现有的潜在扩散逆求解器仍存在不稳定性，表现出不良的伪影和质量下降。在这项工作中，我们首先将不稳定性确定为求解器与真实反向扩散动力学之间的差异，并表明减少这种差距可以稳定求解器。在此基础上，我们引入了测量一致的朗之万校正器 (MCLC)，这是一种理论上可靠的即插即用校正模块，可通过测量一致的朗之万更新来纠正基于 LDM 的逆解算器。与依赖于线性流形假设（通常在潜在空间中不成立）的现有方法相比，MCLC 无需这种假设即可运行，从而产生更稳定和可靠的行为。我们通过实验证明了 MCLC 的有效性及其与现有求解器在不同图像恢复任务中的兼容性。此外，我们还分析斑点伪影并深入了解其根本原因。我们强调，MCLC 是迈向更强大的零样本逆问题求解器的关键一步。

Title: On the Definition and Detection of Cherry-Picking in Counterfactual Explanations

Authors: James Hinns, Sofie Goethals, Stephan Van der Veeken, Theodoros Evgeniou, David Martens
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04977
Pdf URL: https://arxiv.org/pdf/2601.04977
Copy Paste: [[2601.04977]] On the Definition and Detection of Cherry-Picking in Counterfactual Explanations(https://arxiv.org/abs/2601.04977)
Keywords: generation
Abstract: Counterfactual explanations are widely used to communicate how inputs must change for a model to alter its prediction. For a single instance, many valid counterfactuals can exist, which leaves open the possibility for an explanation provider to cherry-pick explanations that better suit a narrative of their choice, highlighting favourable behaviour and withholding examples that reveal problematic behaviour. We formally define cherry-picking for counterfactual explanations in terms of an admissible explanation space, specified by the generation procedure, and a utility function. We then study to what extent an external auditor can detect such manipulation. Considering three levels of access to the explanation process: full procedural access, partial procedural access, and explanation-only access, we show that detection is extremely limited in practice. Even with full procedural access, cherry-picked explanations can remain difficult to distinguish from non cherry-picked explanations, because the multiplicity of valid counterfactuals and flexibility in the explanation specification provide sufficient degrees of freedom to mask deliberate selection. Empirically, we demonstrate that this variability often exceeds the effect of cherry-picking on standard counterfactual quality metrics such as proximity, plausibility, and sparsity, making cherry-picked explanations statistically indistinguishable from baseline explanations. We argue that safeguards should therefore prioritise reproducibility, standardisation, and procedural constraints over post-hoc detection, and we provide recommendations for algorithm developers, explanation providers, and auditors.
摘要：反事实解释被广泛用于传达输入必须如何改变才能使模型改变其预测。就单个实例而言，可能存在许多有效的反事实，这为解释提供者提供了选择更适合他们选择的叙述的解释的可能性，突出有利的行为并保留揭示有问题行为的例子。我们根据由生成过程指定的可接受的解释空间和效用函数，正式定义反事实解释的挑选。然后我们研究外部审计师可以在多大程度上发现这种操纵行为。考虑到解释过程的三个访问级别：完全程序访问、部分程序访问和仅解释访问，我们表明检测在实践中极其有限。即使具有完全的程序访问权限，精心挑选的解释仍然很难与非精心挑选的解释区分开来，因为有效反事实的多样性和解释规范中的灵活性提供了足够的自由度来掩盖有意的选择。根据经验，我们证明这种变异性通常超过了挑选对标准反事实质量指标（例如接近性、合理性和稀疏性）的影响，使得精选的解释在统计上与基线解释无法区分。因此，我们认为，保障措施应优先考虑可重复性、标准化和程序限制，而不是事后检测，并且我们为算法开发人员、解释提供者和审计人员提供建议。

Title: OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Authors: Minseong Kweon, Jinsun Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04984
Pdf URL: https://arxiv.org/pdf/2601.04984
Copy Paste: [[2601.04984]] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction(https://arxiv.org/abs/2601.04984)
Keywords: restoration
Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.
摘要：我们介绍了 OceanSplat，这是一种基于 3D 高斯分布的新颖方法，用于准确表示水下场景中的 3D 几何形状。为了克服水下光学退化引起的多视图不一致，我们的方法通过渲染相对于每个输入视图的水平和垂直平移的相机视图并通过反向扭曲对齐它们来强制三目视图一致性。此外，这些平移的相机视图用于通过三角测量之前导出合成极线深度，其充当自监督深度正则化器。这些几何约束有助于 3D 高斯的空间优化并保留水下环境中的场景结构。我们还提出了一种深度感知的 alpha 调整，该调整在早期训练期间根据 3D 高斯的 $z$ 分量和观察方向调节 3D 高斯的不透明度，从而阻止介质诱导图元的形成。通过我们的贡献，3D 高斯从散射介质中解脱出来，实现了对象几何形状的稳健表示，并显着减少了重建水下场景中的浮动伪影。对真实水下和模拟场景的实验表明，OceanSplat 在散射介质中的场景重建和恢复方面远远优于现有方法。

Title: DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Authors: Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.05052
Pdf URL: https://arxiv.org/pdf/2601.05052
Copy Paste: [[2601.05052]] DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights(https://arxiv.org/abs/2601.05052)
Keywords: generation, generative
Abstract: Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.
摘要：为神经网络权重构建高效且有效的生成模型一直是人们关注的研究热点，面临着现代神经网络的高维权重空间及其对称性带来的挑战。一些先前的生成模型仅限于生成部分神经网络权重，特别是对于较大的模型，例如 ResNet 和 ViT。那些确实生成完整权重的模型在生成速度方面遇到困难，或者需要对生成的模型进行微调。在这项工作中，我们提出了 DeepWeightFlow，一种直接在权重空间中运行的流匹配模型，可为各种架构、神经网络大小和数据模态生成多样化且高精度的神经网络权重。 DeepWeightFlow 生成的神经网络不需要微调即可表现良好，并且可以扩展到大型网络。我们在生成权重模型的背景下应用 Git Re-Basin 和 TransFusion 进行神经网络规范化，以考虑神经网络排列对称性的影响并提高较大模型大小的生成效率。生成的网络擅长迁移学习，并且可以在几分钟内生成数百个神经网络的集合，远远超过基于扩散的方法的效率。 DeepWeightFlow 模型为更高效和可扩展地生成不同的神经网络集铺平了道路。

Title: From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

Authors: Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05059
Pdf URL: https://arxiv.org/pdf/2601.05059
Copy Paste: [[2601.05059]] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)(https://arxiv.org/abs/2601.05059)
Keywords: generation
Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.
摘要：视觉语言模型 (VLM) 有望通过实现智能、可扩展和自动化的多模态内容处理来彻底改变制药行业的数字化转型。异构数据模式（文本、图像、视频、音频和网络链接）的传统手动注释很容易出现不一致、质量下降和内容利用效率低下的情况。大量的长视频和音频数据进一步加剧了这些挑战（例如长时间的临床试验访谈和教育研讨会）。在这里，我们介绍了一个适应领域的视频到视频剪辑生成框架，该框架集成了音频语言模型（ALM）和视觉语言模型（VLM）来生成精彩片段。我们的贡献有三个方面：(i) 具有淡入/淡出和时间戳归一化功能的可重复剪切和合并算法，确保平滑过渡和音频/视觉对齐； (ii) 基于角色定义和及时注入定制产出（营销、培训、监管）的个性化机制； (iii) 平衡 ALM/VLM 增强处理的成本高效的 e2e 管道策略。对视频 MME 基准 (900) 和我们涵盖 14 个疾病领域的 16,159 个药房视频的专有数据集进行的评估表明，速度提高了 3 到 4 倍，成本降低了 4 倍，并且剪辑质量具有竞争力。除了效率提升之外，我们还报告说，与最先进的 VLM 基线（例如 Gemini 2.5 Pro）相比，我们的方法提高了剪辑一致性分数 (0.348) 和信息性分数 (0.721)，突出了透明、自定义提取和合规性支持生命科学视频摘要的潜力。

Title: Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05124
Pdf URL: https://arxiv.org/pdf/2601.05124
Copy Paste: [[2601.05124]] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing(https://arxiv.org/abs/2601.05124)
Keywords: generation
Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
摘要：上下文图像生成和编辑（ICGE）使用户能够通过交错的图像文本提示来指定视觉概念，要求精确理解和忠实执行用户意图。尽管最近的统一多模态模型表现出有希望的理解能力，但这些优势往往无法有效地转化为图像生成。我们引入了 Re-Align，这是一个统一的框架，通过结构化推理引导的对齐来弥合理解和生成之间的差距。其核心在于上下文思维链（IC-CoT），这是一种结构化推理范式，可解耦语义指导和参考关联，提供清晰的文本目标并减轻参考图像之间的混乱。此外，Re-Align 引入了一种有效的 RL 训练方案，该方案利用替代奖励来衡量结构化推理文本与生成图像之间的对齐情况，从而提高模型在 ICGE 任务上的整体性能。大量实验验证了 Re-Align 在上下文图像生成和编辑任务方面均优于具有可比模型规模和资源的竞争方法。

Title: VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Authors: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05125
Pdf URL: https://arxiv.org/pdf/2601.05125
Copy Paste: [[2601.05125]] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding(https://arxiv.org/abs/2601.05125)
Keywords: generation
Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
摘要：这项工作介绍了 VERSE，这是一种通过探索视觉嵌入空间来分析和改进应用于视觉丰富文档理解的视觉语言模型的方法。 VERSE 能够实现潜在表示的可视化，支持模型可行性的评估。它还有助于识别有问题的区域并指导合成数据的生成以提高这些集群的性能。我们通过对合成 MERIT 数据集进行训练并对其现实世界的对应数据集 MERIT Secret 进行评估来验证该方法。结果表明，VERSE 有助于揭示与容易出错的簇相关的视觉特征，并且使用包含这些特征的样本进行重新训练可显着提高 F1 性能，而不会降低泛化能力。此外，我们还证明，当使用 VERSE 进行优化时，Donut 和 Idefics2 等本地模型可以匹配甚至超越 GPT-4 和 Pixtral 等 SaaS 解决方案的性能。

Title: VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05138
Pdf URL: https://arxiv.org/pdf/2601.05138
Copy Paste: [[2601.05138]] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control(https://arxiv.org/abs/2601.05138)
Keywords: generation
Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
摘要：视频世界模型旨在模拟动态的真实世界环境，但现有方法难以对摄像机和多对象运动提供统一且精确的控制，因为视频本质上是在投影的 2D 图像平面中进行动态操作。为了弥补这一差距，我们引入了 VerseCrafter，这是一种 4D 感知视频世界模型，可以在统一的 4D 几何世界状态中对摄像机和对象动态进行明确且连贯的控制。我们的方法以新颖的 4D 几何控制表示为中心，它通过静态背景点云和每个对象的 3D 高斯轨迹对世界状态进行编码。这种表示不仅可以捕获对象的路径，还可以捕获其随时间变化的概率 3D 占用情况，为刚性边界框或参数模型提供灵活的、与类别无关的替代方案。这些 4D 控件被渲染为预训练视频扩散模型的调节信号，从而能够生成精确遵循指定动态的高保真、视图一致的视频。不幸的是，另一个主要挑战在于缺乏具有明确 4D 注释的大规模训练数据。我们通过开发一个自动数据引擎来解决这个问题，该引擎可以从野外视频中提取所需的 4D 控件，从而使我们能够在海量且多样化的数据集上训练我们的模型。

Title: A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Authors: Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.05143
Pdf URL: https://arxiv.org/pdf/2601.05143
Copy Paste: [[2601.05143]] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering(https://arxiv.org/abs/2601.05143)
Keywords: generation
Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
摘要：作物病害分析的视觉问答需要准确的视觉理解和可靠的语言生成。这项工作提出了一个轻量级的视觉语言框架，用于从叶子图像中识别作物和疾病。所提出的方法将 Swin Transformer 视觉编码器与序列到序列语言解码器相结合。采用两阶段训练策略来改进视觉表示学习和跨模式对齐。该模型使用分类和自然语言生成指标在大规模农作物病害数据集上进行评估。实验结果表明作物和疾病识别的准确性很高。该框架还在 BLEU、ROUGE 和 BERTScore 上取得了强劲的性能。我们提出的模型优于大规模视觉语言基线，同时使用的参数显着减少。可解释性是使用 Grad-CAM 和令牌级别归因来评估的。定性结果证明了在不同的用户驱动查询下的稳健性能。这些发现强调了针对作物病害视觉问答的特定任务视觉预训练的有效性。

Title: Multi-Scale Local Speculative Decoding for Image Generation

Authors: Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05149
Pdf URL: https://arxiv.org/pdf/2601.05149
Copy Paste: [[2601.05149]] Multi-Scale Local Speculative Decoding for Image Generation(https://arxiv.org/abs/2601.05149)
Keywords: generation
Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
摘要：自回归 (AR) 模型在图像合成方面取得了显着的成功，但其顺序性质带来了显着的延迟限制。推测解码提供了一种有前途的加速途径，但现有方法受到令牌级模糊性和缺乏空间意识的限制。在这项工作中，我们引入了多尺度局部推测解码（MuLo-SD），这是一种新颖的框架，它将多分辨率绘图与空间信息验证相结合，以加速 AR 图像的生成。我们的方法利用低分辨率绘图器与学习的上采样器配对来提出候选图像标记，然后由高分辨率目标模型并行验证。至关重要的是，我们结合了局部拒绝和重采样机制，通过关注空间邻域而不是第一次拒绝后的光栅扫描重采样，能够有效地纠正吃水误差。我们证明 MuLo-SD 实现了显着的加速 - 高达 $\mathbf{1.7\times}$ - 在加速方面优于 EAGLE-2 和 LANTERN 等强大的推测解码基线，同时保持可比较的语义对齐和感知质量。这些结果使用 GenEval、DPG-Bench 和 FID/HPSv2 在 MS-COCO 5k 验证分组上进行验证。广泛的消融突出了上采样设计、概率池以及局部拒绝和邻域扩展重采样的影响。我们的方法为图像合成的推测解码树立了新的最先进技术，缩小了效率和保真度之间的差距。

Title: FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Authors: Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05212
Pdf URL: https://arxiv.org/pdf/2601.05212
Copy Paste: [[2601.05212]] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching(https://arxiv.org/abs/2601.05212)
Keywords: generative
Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.
摘要：脑磁共振成像 (MRI) 在研究神经发育、衰老和疾病方面发挥着核心作用。一项关键应用是脑年龄预测 (BAP)，它可以根据 MRI 数据估计个体的生物脑年龄。有效的 BAP 模型需要大量、多样化且年龄平衡的数据集，而现有的 3D MRI 数据集在人口统计上存在偏差，限制了公平性和普遍性。获取新数据成本高昂且受到道德约束，这激发了生成数据增强。当前的生成方法通常基于潜在扩散模型，该模型在学习的低维潜在空间中运行，以满足体积 MRI 数据的存储需求。然而，这些方法的推理速度通常很慢，可能会由于潜在压缩而引入伪影，并且很少以年龄为条件，从而影响 BAP 性能。在这项工作中，我们提出了 FlowLet，这是一种条件生成框架，它通过利用可逆 3D 小波域内的流匹配来合成年龄条件 3D MRI，有助于避免重建伪影并减少计算需求。实验表明 FlowLet 只需很少的采样步骤即可生成高保真体积。使用 FlowLet 生成的数据训练 BAP 模型可以提高代表性不足的年龄组的性能，并且基于区域的分析证实了解剖结构的保存。

Title: Plenoptic Video Generation

Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05239
Pdf URL: https://arxiv.org/pdf/2601.05239
Copy Paste: [[2601.05239]] Plenoptic Video Generation(https://arxiv.org/abs/2601.05239)
Keywords: generation, generative
Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: this https URL
摘要：摄像机控制的生成视频重新渲染方法，例如 ReCamMaster，已经取得了显着的进展。然而，尽管它们在单视图设置中取得了成功，但这些作品往往很难在多视图场景中保持一致性。由于生成模型固有的随机性，确保幻觉区域的时空一致性仍然具有挑战性。为了解决这个问题，我们引入了 PlenopticDreamer，这是一个同步生成幻觉以维持时空记忆的框架。核心思想是以自回归方式训练多输入单输出视频条件模型，并辅以摄像机引导的视频检索策略，该策略自适应地选择前几代的显着视频作为条件输入。此外，我们的训练还结合了渐进式上下文缩放以提高收敛性、自我调节以增强针对错误累积引起的远程视觉退化的鲁棒性，以及长视频调节机制以支持扩展视频生成。 Basic 和 Agibot 基准测试的大量实验表明，PlenopticDreamer 实现了最先进的视频重新渲染，提供卓越的视图同步、高保真视觉效果、精确的摄像机控制和多样化的视图转换（例如，第三人称到第三人称，以及机器人操作中从头视图到抓手视图）。项目页面：此 https URL

Title: RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.05241
Pdf URL: https://arxiv.org/pdf/2601.05241
Copy Paste: [[2601.05241]] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation(https://arxiv.org/abs/2601.05241)
Keywords: generation
Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
摘要：操纵数据的多样性、数量和质量对于训练有效的机器人策略至关重要。然而，由于硬件和物理设置的限制，收集大规模的现实世界操作数据仍然难以在不同的环境中扩展。最近的工作使用文本提示条件图像扩散模型，通过改变视觉观察中的背景和桌面对象来增强操作数据。然而，这些方法往往忽视了最先进的政策模型所需的多视角和时间连贯观察的实际需求。此外，仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导，我们引入了视觉识别提示，它提供示例图像作为条件输入，以指导生成所需的场景设置。为此，我们还构建了一个可扩展的管道，以从大型机器人数据集中管理视觉身份池。使用我们的增强操作数据来训练下游视觉语言动作和视觉运动策略模型，可以在模拟和真实机器人设置中产生一致的性能增益。

Title: GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05244
Pdf URL: https://arxiv.org/pdf/2601.05244
Copy Paste: [[2601.05244]] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation(https://arxiv.org/abs/2601.05244)
Keywords: generation
Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at this https URL.
摘要：引用表达式分割（RES）和理解（REC）分别对表达式描述的对象进行分割和检测，而引用表达式生成（REG）为所选对象生成表达式。现有的数据集和方法通常只支持单目标表达式，即一个表达式引用一个对象，不考虑多目标和无目标表达式。这极大地限制了REx（RES/REC/REG）的实际应用。本文介绍了三个新基准，称为广义引用表达式分段 (GRES)、理解 (GREC) 和生成 (GREG)，统称为 GREx，它们扩展了经典 REx，以允许表达式识别任意数量的对象。我们构建了第一个大规模 GREx 数据集 gRefCOCO，其中包含多目标、无目标和单目标表达式及其对应的带有标记目标的图像。 GREx 和 gRefCOCO 被设计为与 REx 向后兼容，便于进行广泛的实验来研究现有 REx 方法在 GREx 任务上的性能差距。 GRES/GREC 的挑战之一是复杂的关系建模，为此我们提出了一个基线 ReLA，它自适应地将图像划分为具有子实例线索的区域，并显式地建模区域-区域和区域-语言依赖关系。所提出的 ReLA 在 GRES 和 GREC 任务上都取得了最先进的结果。建议的 gRefCOCO 数据集和方法可在此 https URL 中获取。

Title: Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05246
Pdf URL: https://arxiv.org/pdf/2601.05246
Copy Paste: [[2601.05246]] Pixel-Perfect Visual Geometry Estimation(https://arxiv.org/abs/2601.05246)
Keywords: generative
Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
摘要：从图像中恢复干净、准确的几何形状对于机器人和增强现实至关重要。然而，现有的几何基础模型仍然严重受到像素飞散和精细细节丢失的影响。在本文中，我们提出了像素完美的视觉几何模型，可以通过利用像素空间中的生成建模来预测高质量、飞行的无像素点云。我们首先介绍像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）构建的单目深度基础模型。为了解决与像素空间扩散相关的高计算复杂性，我们提出了两种关键设计：1）语义提示 DiT，它结合了视觉基础模型的语义表示来提示扩散过程，保留全局语义，同时增强细粒度的视觉细节； 2) Cascade DiT 架构，逐步增加图像标记的数量，提高效率和准确性。为了进一步将 PPD 扩展到视频（PPVD），我们引入了一种新的语义一致 DiT，它从多视图几何基础模型中提取时间一致的语义。然后，我们在 DiT 内执行引用引导的令牌传播，以最小的计算和内存开销保持时间一致性。我们的模型在所有生成单目和视频深度估计模型中实现了最佳性能，并且比所有其他模型产生明显更清晰的点云。