2026-01-21

Title: GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

Authors: Lukas Abrie Nel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11574
Pdf URL: https://arxiv.org/pdf/2601.11574
Copy Paste: [[2601.11574]] GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment(https://arxiv.org/abs/2601.11574)
Keywords: generation
Abstract: Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful hyperparameter tuning and extensive computational resources. We introduce GRADE (Gumbel-softmax Relaxation for Alignment via Differentiable Estimation), a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process. Using the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE), we enable end-to-end gradient flow from reward signals through generated tokens to model parameters. On sentiment-controlled text generation using the IMDB dataset, GRADE-STE achieves a test reward of 0.763 +- 0.344 compared to PPO's 0.510 +- 0.313 and REINFORCE's 0.617 +- 0.378, representing a 50% relative improvement over PPO. Critically, GRADE-STE exhibits gradient variance over 14 times lower than REINFORCE and maintains stable training dynamics throughout optimization. Our rigorous evaluation with proper train/validation/test splits demonstrates that these improvements generalize to held-out data, with GRADE-STE showing the best generalization characteristics among all methods tested. GRADE offers a simpler, more stable, and more effective alternative to reinforcement learning for LLM alignment.
摘要：来自人类反馈的强化学习（RLHF）已成为使大型语言模型与人类偏好保持一致的主导范例。然而，PPO 等策略梯度方法受到高方差梯度估计的影响，需要仔细的超参数调整和大量的计算资源。我们引入了 GRADE（Gumbel-softmax Relaxation for Alignment via Differentiable Estimation），这种方法通过离散令牌采样过程的可微松弛，用直接反向传播代替高方差策略梯度估计。使用带有直通估计的 Gumbel-Softmax 重新参数化 (GRADE-STE)，我们实现了从奖励信号到生成的令牌到模型参数的端到端梯度流。在使用 IMDB 数据集进行情感控制文本生成时，GRADE-STE 获得了 0.763 ± 0.344 的测试奖励，而 PPO 为 0.510 ± 0.313，REINFORCE 为 0.617 ± 0.378，相对于 PPO 提高了 50%。至关重要的是，GRADE-STE 的梯度方差比 REINFORCE 低 14 倍以上，并且在整个优化过程中保持稳定的训练动态。我们通过适当的训练/验证/测试分割进行的严格评估表明，这些改进可以推广到保留的数据，而 GRADE-STE 在所有测试的方法中显示出最佳的泛化特征。 GRADE 为法学硕士对齐提供了一种更简单、更稳定、更有效的强化学习替代方案。

Title: Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning

Authors: Jason Qiu
Subjects: cs.CV, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2601.11614
Pdf URL: https://arxiv.org/pdf/2601.11614
Copy Paste: [[2601.11614]] Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning(https://arxiv.org/abs/2601.11614)
Keywords: generation, generative
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder in which pathological changes begin many years before the onset of clinical symptoms, making early detection essential for timely intervention. T1-weighted (T1w) Magnetic Resonance Imaging (MRI) is routinely used in clinical practice to identify macroscopic brain alterations, but these changes typically emerge relatively late in the disease course. Diffusion MRI (dMRI), in contrast, is sensitive to earlier microstructural abnormalities by probing water diffusion in brain tissue. dMRI metrics, including fractional anisotropy (FA) and mean diffusivity (MD), provide complementary information about white matter integrity and neurodegeneration. However, dMRI acquisitions are time-consuming and susceptible to motion artifacts, limiting their routine use in clinical populations. To bridge this gap, I propose a 3D TransUNet image synthesis framework that predicts FA and MD maps directly from T1w MRI. My model generates high-fidelity maps, achieving a structural similarity index (SSIM) exceeding 0.93 and a strong Pearson correlation (>0.94) with ground-truth dMRI. When integrated into a multi-modal diagnostic model, these synthetic features boost AD classification accuracy by 5% (78.75%->83.75%) and, most importantly, improve mild cognitive impairment (MCI) detection by 12.5%. This study demonstrates that high-quality diffusion microstructural information can be inferred from routinely acquired T1w MRI, effectively transferring the benefits of multi-modality imaging to settings where diffusion data are unavailable. By reducing scan time while preserving complementary structural and microstructural information, the proposed approach has the potential to improve the accessibility, efficiency, and accuracy of AD diagnosis in clinical practice.
摘要：阿尔茨海默病 (AD) 是一种进行性神经退行性疾病，其病理变化在临床症状出现前许多年就开始了，因此早期检测对于及时干预至关重要。 T1 加权 (T1w) 磁共振成像 (MRI) 在临床实践中常规用于识别宏观脑部变化，但这些变化通常在病程相对较晚的时候出现。相比之下，扩散 MRI (dMRI) 通过探测脑组织中的水扩散，对早期微结构异常敏感。 dMRI 指标，包括分数各向异性 (FA) 和平均扩散率 (MD)，提供有关白质完整性和神经变性的补充信息。然而，dMRI 采集非常耗时且容易受到运动伪影的影响，限制了其在临床人群中的常规使用。为了弥补这一差距，我提出了一个 3D TransUNet 图像合成框架，可以直接从 T1w MRI 预测 FA 和 MD 图。我的模型生成高保真度地图，结构相似性指数 (SSIM) 超过 0.93，并且与地面实况 dMRI 具有很强的皮尔逊相关性 (>0.94)。当集成到多模式诊断模型中时，这些综合特征将 AD 分类准确率提高了 5%（78.75%->83.75%），最重要的是，将轻度认知障碍 (MCI) 检测提高了 12.5%。这项研究表明，可以从常规采集的 T1w MRI 中推断出高质量的扩散微观结构信息，从而有效地将多模态成像的优势转移到无法获得扩散数据的环境中。通过减少扫描时间，同时保留补充的结构和微观结构信息，所提出的方法有可能提高临床实践中 AD 诊断的可及性、效率和准确性。

Title: A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow

Authors: Haonan Wei, Linyuan Wang, Nuolin Sun, Zhizhong Zheng, Lei Li, Bin Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11630
Pdf URL: https://arxiv.org/pdf/2601.11630
Copy Paste: [[2601.11630]] A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow(https://arxiv.org/abs/2601.11630)
Keywords: generation
Abstract: Currently, Flow matching methods aim to compress the iterative generation process of diffusion models into a few or even a single step, with MeanFlow and FreeFlow being representative achievements of one-step generation based on Ordinary Differential Equations (ODEs). We observe that the 28-layer Transformer architecture of FreeFlow can be characterized as an Euler discretization scheme for an ODE along the depth axis, where the layer index serves as the discrete time step. Therefore, we distill the number of layers of the FreeFlow model, following the same derivation logic as FreeFlow, and propose SLT (Single-Layer Transformer), which uses a single shared DiT block to approximate the depth-wise feature evolution of the 28-layer teacher. During training, it matches the teacher's intermediate features at several depth patches, fuses those patch-level representations, and simultaneously aligns the teacher's final velocity prediction. Through distillation training, we compress the 28 independent Transformer Blocks of the teacher model DiT-XL/2 into a single Transformer Block, reducing the parameter count from 675M to 4.3M. Furthermore, leveraging its minimal parameters and rapid sampling speed, SLT can screen more candidate points in the noise space within the same timeframe, thereby selecting higher-quality initial points for the teacher model FreeFlow and ultimately enhancing the quality of generated images. Experimental results demonstrate that within a time budget comparable to two random samplings of the teacher model, our method performs over 100 noise screenings and produces a high-quality sample through the teacher model using the selected points. Quality fluctuations caused by low-quality initial noise under a limited number of FreeFlow sampling calls are effectively avoided, substantially improving the stability and average generation quality of one-step generation.
摘要：目前，流匹配方法旨在将扩散模型的迭代生成过程压缩为几步甚至一步，其中MeanFlow和FreeFlow是基于常微分方程（ODE）一步生成的代表性成果。我们观察到 FreeFlow 的 28 层 Transformer 架构可以表征为沿深度轴的 ODE 的欧拉离散化方案，其中层索引充当离散时间步长。因此，我们提取了 FreeFlow 模型的层数，遵循与 FreeFlow 相同的推导逻辑，并提出了 SLT（Single-Layer Transformer），它使用单个共享 DiT 块来近似 28 层教师的深度特征演化。在训练过程中，它会在多个深度补丁上匹配教师的中间特征，融合这些补丁级表示，并同时对齐教师的最终速度预测。通过蒸馏训练，我们将教师模型 DiT-XL/2 的 28 个独立 Transformer Block 压缩为单个 Transformer Block，将参数数量从 675M 减少到 4.3M。此外，利用其最小的参数和快速的采样速度，SLT可以在相同的时间范围内筛选噪声空间中更多的候选点，从而为教师模型FreeFlow选择更高质量的初始点，最终提高生成图像的质量。实验结果表明，在与教师模型的两次随机采样相当的时间预算内，我们的方法执行了 100 多次噪声筛选，并使用所选点通过教师模型生成了高质量的样本。有效避免了有限次数的FreeFlow采样调用下低质量初始噪声引起的质量波动，大幅提高了一步生成的稳定性和平均生成质量。

Title: Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos

Authors: Anil Egin, Andrea Tangherloni, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11635
Pdf URL: https://arxiv.org/pdf/2601.11635
Copy Paste: [[2601.11635]] Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos(https://arxiv.org/abs/2601.11635)
Keywords: generative
Abstract: Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.
摘要：人脸视频匿名化旨在保护隐私，同时允许在许多计算机视觉下游任务（例如表情识别、人员跟踪和动作识别）中对视频进行分析。我们在这里提出了一种新颖的统一框架，称为 Anon-NET，经过简化以去识别面部视频，同时保留原始视频的年龄、性别、种族、姿势和表情。具体来说，我们通过基于扩散的生成模型来修复面部，该模型由高级属性识别和运动感知表达传输引导。然后，我们通过视频驱动的动画对去识别化的面部进行动画处理，该动画接受去识别化的面部和原始视频作为输入。对数据集 VoxCeleb2、CelebV-HQ 和 HDTF（包括不同的面部动态）进行的广泛实验证明了 AnonNET 在混淆身份的同时保持视觉真实感和时间一致性的有效性。 AnonNet的代码将公开发布。

Title: Global Optimization By Gradient from Hierarchical Score-Matching Spaces

Authors: Ming Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.11639
Pdf URL: https://arxiv.org/pdf/2601.11639
Copy Paste: [[2601.11639]] Global Optimization By Gradient from Hierarchical Score-Matching Spaces(https://arxiv.org/abs/2601.11639)
Keywords: generative
Abstract: Gradient descent is the most commonly used optimization method, but limited to local optimality, and confined to the field of continuous differentiable problems with simple convex constraints. This work solve these limitations and restrictions by unifying all optimization problems with various complex constraints as a general hierarchical optimization objective without constraints, which is optimized by gradient obtained through score matching. By this way, global optimization by deterministic method using strict gradient is achieved for the first time, and verified through simple-constructed and complex-practical experiments. Even more importantly, it reveals the profound connection between global optimization and diffusion based generative modeling.
摘要：梯度下降是最常用的优化方法，但仅限于局部最优，并且仅限于具有简单凸约束的连续可微问题领域。这项工作通过将所有具有各种复杂约束的优化问题统一为无约束的通用分层优化目标，通过分数匹配获得的梯度进行优化，解决了这些限制和限制。通过这种方式，首次实现了使用严格梯度的确定性方法的全局优化，并通过简单的结构和复杂的实际实验进行了验证。更重要的是，它揭示了全局优化和基于扩散的生成模型之间的深刻联系。

Title: Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Authors: Yuxi Liu, Yipeng Hu, Zekun Zhang, Kunze Jiang, Kun Yuan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11641
Pdf URL: https://arxiv.org/pdf/2601.11641
Copy Paste: [[2601.11641]] Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers(https://arxiv.org/abs/2601.11641)
Keywords: generation
Abstract: While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixtrue-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.
摘要：虽然扩散变压器（DiT）在视频生成方面取得了显着进展，但这种长序列生成任务仍然受到自注意力机制固有的二次复杂性的限制，为实际部署造成了重大障碍。尽管稀疏注意力方法试图解决这一挑战，但现有方法要么依赖于过于简化的静态模式，要么需要计算成本高昂的采样操作来实现动态稀疏性，从而导致模式预测不准确并降低生成质量。为了克服这些限制，我们提出了 \underline{\textbf{M}}ixtrue-\underline{\textbf{O}}f-\underline{\textbf{D}}distribution \textbf{DiT} (\textbf{MOD-DiT})，这是一种新颖的免采样动态注意力框架，它通过两阶段过程准确地模拟不断演变的注意力模式。首先，MOD-DiT 利用早期去噪步骤的先验信息，并采用分布式混合方法来建模有效的线性近似模型，然后使用该模型来预测特定去噪间隔的掩模模式。其次，在线块屏蔽策略动态应用这些预测的屏蔽，同时维护历史稀疏信息，从而消除了重复采样操作的需要。广泛的评估证明了跨多个基准和模型架构的一致加速和质量改进，验证了 MOD-DiT 在高效、高质量视频生成方面的有效性，同时克服了传统稀疏注意力方法的计算限制。

Title: Predicting When to Trust Vision-Language Models for Spatial Reasoning

Authors: Muhammad Imran, Yugyung Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11644
Pdf URL: https://arxiv.org/pdf/2601.11644
Copy Paste: [[2601.11644]] Predicting When to Trust Vision-Language Models for Spatial Reasoning(https://arxiv.org/abs/2601.11644)
Keywords: generative
Abstract: Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.
摘要：视觉语言模型 (VLM) 在多模态任务中表现出令人印象深刻的能力，但也表现出系统性空间推理失败，在基本方向关系上仅达到 49% (CLIP) 到 54% (BLIP-2) 的准确度。为了在机器人和自主系统中安全部署，我们需要预测何时信任 VLM 空间预测，而不是接受所有输出。我们提出了一种基于视觉的置信度估计框架，该框架通过使用对象检测的独立几何验证来验证 VLM 预测。与依赖自我评估的基于文本的方法不同，我们的方法通过梯度增强融合四个信号：VLM 声明和坐标之间的几何对齐、重叠引起的空间模糊、检测质量和 VLM 内部不确定性。我们在 BLIP-2 上实现了 0.674 AUROC（比基于文本的基线提高了 34.0%），在 CLIP 上实现了 0.583 AUROC（提高了 16.1%），概括了生成和分类架构。我们的框架支持选择性预测：在 60% 的目标准确度下，我们实现了 61.9% 的覆盖率，而 BLIP-2 的覆盖率为 27.6%（提高了 2.2 倍）。特征分析显示，基于视觉的信号对模型重要性的贡献率为 87.4%，而 VLM 置信度为 12.7%，这证实了外部几何验证优于自我评估。我们展示了可靠的场景图构造，其中基于置信度的修剪将精度从 52.1% 提高到 78.3%，同时保留 68.2% 的边缘。

Title: Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification

Authors: Miriam Doh, Aditya Gulati, Corina Canali, Nuria Oliver
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.11651
Pdf URL: https://arxiv.org/pdf/2601.11651
Copy Paste: [[2601.11651]] Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification(https://arxiv.org/abs/2601.11651)
Keywords: generation, generative
Abstract: This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women's faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men's; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors' views.
摘要：本文研究了文本到图像（T2I）生成人工智能和下游性别分类任务中的算法外观主义（基于身体外观的系统性优惠待遇）。通过对使用 Stable Diffusion 2.1 和 3.5 Medium 创建的 26,400 张合成面孔进行分析，我们展示了生成式 AI 模型如何系统地将面部吸引力与积极属性关联起来，反之亦然，反映了社会构建的偏见，而不是基于证据的相关性。此外，我们发现三种性别分类算法中存在显着的性别偏见，具体取决于输入面孔的属性。我们的研究结果揭示了三个严重危害：（1）T2I 模型中吸引力与正向属性关联的系统编码； (2) 分类系统中的性别差异，女性面孔，特别是带有负面属性的面孔，其错误分类率明显高于男性； (3)通过年龄同质化、性别暴露模式和地理还原论强化新模式中的审美限制。这些趋同模式揭示了算法外观主义作为跨人工智能视觉系统运行的系统基础设施，通过表示和识别加剧了现有的不平等。免责声明：本作品包含的视觉和文字内容反映了外貌与社会建构属性之间的刻板关联，包括性别、种族和与社会期望相关的特征。本研究中发现的任何此类关联都源于生成人工智能系统中嵌入的偏见，而不是来自经验事实或作者的观点。

Title: Generating metamers of human scene understanding

Authors: Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11675
Pdf URL: https://arxiv.org/pdf/2601.11675
Copy Paste: [[2601.11675]] Generating metamers of human scene understanding(https://arxiv.org/abs/2601.11675)
Keywords: generation
Abstract: Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.
摘要：人类视觉将来自视觉外围的低分辨率“要点”信息与来自固定位置的稀疏但高分辨率的信息相结合，以构建对视觉场景的连贯理解。在本文中，我们介绍了 MetamerGen，一种用于生成与潜在人类场景表示一致的场景的工具。 MetamerGen 是一种潜在扩散模型，它将外围获得的场景要点信息与从场景观看注视中获得的信息相结合，生成图像同色异谱，以供人类在观看场景后理解。从高分辨率和低分辨率（即“注视点”）输入生成图像构成了一个新颖的图像到图像合成问题，我们通过引入由 DINOv2 标记组成的注视点场景的双流表示来解决该问题，该标记将来自固定区域的详细特征与捕获场景上下文的外围退化特征融合在一起。为了评估 MetamerGen 生成的图像与潜在人类场景表示的感知一致性，我们进行了相同-不同的行为实验，要求参与者在生成的图像和原始图像之间做出“相同”或“不同”的反应。这样，我们就确定了场景代，这些场景代确实是观看者形成的潜在场景表示的同色异体。 MetamerGen 是理解场景的强大工具。我们的概念验证分析揭示了视觉处理多个层面的特定特征，这些特征有助于人类的判断。虽然它甚至可以以随机注视为条件生成同色异谱，但我们发现，当生成的场景以观看者自己的注视区域为条件时，高级语义对齐最能预测同色异谱。

Title: Telling Human and Machine Handwriting Apart

Authors: Luis A. Leiva, Moises Diaz, Nuwan T. Attygalle, Miguel A. Ferrer, Rejean Plamondon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11700
Pdf URL: https://arxiv.org/pdf/2601.11700
Copy Paste: [[2601.11700]] Telling Human and Machine Handwriting Apart(https://arxiv.org/abs/2601.11700)
Keywords: generative
Abstract: Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.
摘要：手写动作可以用作行为生物识别的独特形式，以验证真实用户是否正在操作设备或应用程序。该任务可以被视为反向图灵测试，其中计算机必须检测输入实例是由人类还是人工生成的。为了解决这一任务，我们研究了十个手写符号（孤立的字符、数字、手势、指向痕迹和签名）的公共数据集，这些数据集是使用七种不同的合成器人工复制的，其中包括运动学理论（Sigma h 模型）、生成对抗网络、变形金刚和扩散模型。我们使用非特征化轨迹数据作为输入来训练浅层循环神经网络，该网络实现了出色的性能（所有合成器和数据集的 ROC 曲线下面积 (AUC) 得分为 98.3%，平均错误率为 1.4%）。在少量样本设置中，我们表明，当仅对 10% 的数据进行训练时，我们的分类器就实现了如此出色的性能，而对剩余 90% 的数据作为测试集进行了评估。我们在域外设置中进一步挑战我们的分类器，并观察到非常有竞争力的结果。我们的工作对需要验证人类存在的计算机化系统具有影响，并增加了额外的安全层以阻止攻击者。

Title: MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization

Authors: Andrea Rubbi, Amir Akbarnejad, Mohammad Vali Sanian, Aryan Yazdan Parast, Hesam Asadollahzadeh, Arian Amani, Naveed Akhtar, Sarah Cooper, Andrew Bassett, Pietro Liò, Lassi Paavolainen, Sattar Vakili, Mo Lotfollahi
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2601.11827
Pdf URL: https://arxiv.org/pdf/2601.11827
Copy Paste: [[2601.11827]] MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization(https://arxiv.org/abs/2601.11827)
Keywords: generation, generative
Abstract: Achieving robust generalization under distribution shift remains a central challenge in conditional generative modeling, as existing conditional flow-based methods often struggle to extrapolate beyond the training conditions. We introduce MixFlow, a conditional flow-matching framework for descriptor-controlled generation that directly targets this limitation by jointly learning a descriptor-conditioned base distribution and a descriptor-conditioned flow field via shortest-path flow matching. By modeling the base distribution as a learnable, descriptor-dependent mixture, MixFlow enables smooth interpolation and extrapolation to unseen conditions, leading to substantially improved out-of-distribution generalization. We provide analytical insights into the behavior of the proposed framework and empirically demonstrate its effectiveness across multiple domains, including prediction of responses to unseen perturbations in single-cell transcriptomic data and high-content microscopy-based drug screening tasks. Across these diverse settings, MixFlow consistently outperforms standard conditional flow-matching baselines. Overall, MixFlow offers a simple yet powerful approach for achieving robust, generalizable, and controllable generative modeling across heterogeneous domains.
摘要：在分布转移下实现稳健的泛化仍然是条件生成建模中的一个核心挑战，因为现有的基于条件流的方法通常很难在训练条件之外进行推断。我们引入了 MixFlow，一种用于描述符控制生成的条件流匹配框架，通过最短路径流匹配联合学习描述符条件基础分布和描述符条件流场，直接针对这一限制。通过将基本分布建模为可学习的、依赖于描述符的混合，MixFlow 能够平滑插值和外推到未见过的条件，从而显着提高分布外泛化能力。我们对所提出的框架的行为提供了分析见解，并凭经验证明了其在多个领域的有效性，包括预测对单细胞转录组数据中看不见的扰动的反应和基于高内涵显微镜的药物筛选任务。在这些不同的设置中，MixFlow 始终优于标准条件流匹配基线。总体而言，MixFlow 提供了一种简单而强大的方法，用于跨异构域实现稳健、可泛化且可控的生成建模。

Title: TF-CoDiT: Conditional Time Series Synthesis with Diffusion Transformers for Treasury Futures

Authors: Yingxiao Zhang, Jiaxin Duan, Junfu Zhang, Ke Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11880
Pdf URL: https://arxiv.org/pdf/2601.11880
Copy Paste: [[2601.11880]] TF-CoDiT: Conditional Time Series Synthesis with Diffusion Transformers for Treasury Futures(https://arxiv.org/abs/2601.11880)
Keywords: generation
Abstract: Diffusion Transformers (DiT) have achieved milestones in synthesizing financial time-series data, such as stock prices and order flows. However, their performance in synthesizing treasury futures data is still underexplored. This work emphasizes the characteristics of treasury futures data, including its low volume, market dependencies, and the grouped correlations among multivariables. To overcome these challenges, we propose TF-CoDiT, the first DiT framework for language-controlled treasury futures synthesis. To facilitate low-data learning, TF-CoDiT adapts the standard DiT by transforming multi-channel 1-D time series into Discrete Wavelet Transform (DWT) coefficient matrices. A U-shape VAE is proposed to encode cross-channel dependencies hierarchically into a latent variable and bridge the latent and DWT spaces through decoding, thereby enabling latent diffusion generation. To derive prompts that cover essential conditions, we introduce the Financial Market Attribute Protocol (FinMAP) - a multi-level description system that standardizes daily$/$periodical market dynamics by recognizing 17$/$23 economic indicators from 7/8 perspectives. In our experiments, we gather four types of treasury futures data covering the period from 2015 to 2025, and define data synthesis tasks with durations ranging from one week to four months. Extensive evaluations demonstrate that TF-CoDiT can produce highly authentic data with errors at most 0.433 (MSE) and 0.453 (MAE) to the ground-truth. Further studies evidence the robustness of TF-CoDiT across contracts and temporal horizons.
摘要：扩散变压器 (DiT) 在综合金融时间序列数据（例如股票价格和订单流）方面取得了里程碑式的成果。然而，它们在综合国债期货数据方面的表现仍有待探索。这项工作强调了国债期货数据的特征，包括其低交易量、市场依赖性以及多变量之间的分组相关性。为了克服这些挑战，我们提出了 TF-CoDiT，这是第一个用于语言控制的国债期货综合的 DiT 框架。为了促进低数据学习，TF-CoDiT 通过将多通道一维时间序列转换为离散小波变换 (DWT) 系数矩阵来适应标准 DiT。提出了 U 形 VAE，将跨通道依赖性分层编码为潜在变量，并通过解码桥接潜在空间和 DWT 空间，从而实现潜在扩散生成。为了导出涵盖基本条件的提示，我们引入了金融市场属性协议 (FinMAP) - 一种多级描述系统，通过从 7/8 角度识别 17$/$23 经济指标来标准化每日$/$定期市场动态。在我们的实验中，我们收集了2015年至2025年期间的四种国债期货数据，并定义了持续时间从一周到四个月不等的数据合成任务。广泛的评估表明，TF-CoDiT 可以生成高度真实的数据，与真实值的误差最多为 0.433 (MSE) 和 0.453 (MAE)。进一步的研究证明了 TF-CoDiT 在合同和时间范围内的稳健性。

Title: DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Authors: Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2601.11895
Pdf URL: https://arxiv.org/pdf/2601.11895
Copy Paste: [[2601.11895]] DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models(https://arxiv.org/abs/2601.11895)
Keywords: generation
Abstract: DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.
摘要：DevBench 是一个遥测驱动的基准测试，旨在评估实际代码完成任务上的大型语言模型 (LLM)。它包括跨六种编程语言的 1,800 个评估实例以及源自真实开发人员遥测的六种任务类别，例如 API 使用和代码用途理解。与之前的基准不同，它强调生态有效性，避免训练数据污染，并实现详细的诊断。该评估结合了功能正确性、基于相似性的指标以及注重实用性和上下文相关性的法学硕士法官评估。对 9 个最先进的模型进行了评估，揭示了句法精度、语义推理和实用性方面的差异。我们的基准提供了可操作的见解来指导模型选择和改进细节，这些细节通常是其他基准所缺少的，但对于实际部署和有针对性的模型开发至关重要。

Title: RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection

Authors: Yilmaz Korkmaz, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11898
Pdf URL: https://arxiv.org/pdf/2601.11898
Copy Paste: [[2601.11898]] RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection(https://arxiv.org/abs/2601.11898)
Keywords: generation
Abstract: Remote sensing change detection aims to localize and characterize scene changes between two time points and is central to applications such as environmental monitoring and disaster assessment. Meanwhile, visual autoregressive models (VARs) have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias. We introduce RemoteVAR, a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction. Extensive experiments on standard change detection benchmarks show that RemoteVAR delivers consistent and significant improvements over strong diffusion-based and transformer-based baselines, establishing a competitive autoregressive alternative for remote sensing change detection. Code will be available \href{this https URL}{\underline{here}}.
摘要：遥感变化检测旨在定位和表征两个时间点之间的场景变化，是环境监测和灾害评估等应用的核心。与此同时，视觉自回归模型（VAR）最近表现出了令人印象深刻的图像生成能力，但由于可控性弱、密集预测性能欠佳和曝光偏差，它们在像素级判别任务中的采用仍然受到限制。我们引入了 RemoteVAR，一种新的基于 VAR 的变化检测框架，它通过交叉注意力对多分辨率融合双时态特征进行自回归预测，并采用专门为变化图预测设计的自回归训练策略来解决这些限制。对标准变化检测基准的大量实验表明，RemoteVAR 相对于强大的基于扩散和基于变压器的基线提供了一致且显着的改进，为遥感变化检测建立了具有竞争力的自回归替代方案。代码将在 \href{此 https URL}{\underline{此处}} 中提供。

Title: Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Authors: Haonan An, Guang Hua, Wei Du, Hangcheng Cao, Yihang Tao, Guowen Xu, Susanto Rahardja, Yuguang Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11952
Pdf URL: https://arxiv.org/pdf/2601.11952
Copy Paste: [[2601.11952]] Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal(https://arxiv.org/abs/2601.11952)
Keywords: generation, generative
Abstract: Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.
摘要：无盒模型水印由于其与模型无关的性质以及灵活管理生成模型的高熵图像输出的能力，在深度神经网络（DNN）知识产权保护中受到了广泛关注。它通常以黑盒方式运行，采用编码器-解码器框架来嵌入和提取水印。虽然现有的研究主要集中在编码器抵抗各种攻击的鲁棒性上，但解码器在很大程度上被忽视，导致了针对水印的攻击。在本文中，我们确定了一种针对解码器的此类攻击，其中查询响应用于获取反向传播梯度来训练水印去除器。为了解决这个问题，我们提出了解码器梯度屏蔽（DGS），这是一系列防御机制，包括输出端（DGS-O）、输入端（DGS-I）和解码器层（DGS-L）的DGS，为DGS-O提供封闭式解决方案，并为所有DGS提供可证明的性能。利用水印通道梯度泄漏查询的梯度重新定向和重新缩放的联合设计，所提出的DGS有效地防止水印去除器实现训练收敛到所需的低损失值，同时保持解码器输出的图像质量。我们展示了我们提出的 DGS 在不同应用场景中的有效性。我们使用最先进的无盒水印进行除雨和图像生成任务的实验结果表明，我们的 DGS 在所有设置下均实现了 100% 的防御成功率。

Title: R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

Authors: Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.11960
Pdf URL: https://arxiv.org/pdf/2601.11960
Copy Paste: [[2601.11960]] R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning(https://arxiv.org/abs/2601.11960)
Keywords: generation
Abstract: Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R$^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.1% on MATH-500 and 2.4% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at this https URL.
摘要：强化学习已成为改善法学硕士推理的核心范式。然而，现有方法使用单一策略来产生推理响应和训练优化轨迹。生成稳定的推理响应和多样化的训练轨迹之间的客观冲突导致探索不足，从而损害推理能力。在本文中，为了解决这个问题，我们提出了 R$^2$PO（Residual Rollout Policy Optimization），它在策略之上引入了一个轻量级的 Residual Rollout-Head，将训练轨迹与推理响应解耦，从而在训练过程中实现受控轨迹多样化，同时保持推理生成稳定。跨多个基准测试的实验表明，我们的方法始终优于基线，在 MATH-500 上实现了 3.1% 的平均准确度增益，在 APPS 上实现了 2.4% 的平均准确度增益，同时还减少了格式错误并减轻了长度偏差，以实现稳定的优化。我们的代码可通过此 https URL 公开获取。

Title: AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering

Authors: Zongmin Li, Yachuan Li, Lei Kang, Dimosthenis Karatzas, Wenkang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11976
Pdf URL: https://arxiv.org/pdf/2601.11976
Copy Paste: [[2601.11976]] AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering(https://arxiv.org/abs/2601.11976)
Keywords: generation
Abstract: Multi-page Document Visual Question Answering (MP-DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision-language models (LVLMs). We tackle these issues with an Adaptive Visual In-document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine-tuning. The proposed AVIR framework reduces the average page count required for question answering by 70%, while achieving an ANLS of 84.58% on the MP-DocVQA dataset-surpassing previous methods with significantly lower computational cost. The effectiveness of the proposed AVIR is also verified on the SlideVQA and DUDE benchmarks. The code is available at this https URL.
摘要：多页文档视觉问答（MP-DocVQA）仍然具有挑战性，因为长文档不仅会占用计算资源，还会降低大型视觉语言模型（LVLM）中注意力机制的有效性。我们使用自适应视觉文档内检索 (AVIR) 框架来解决这些问题。轻量级检索模型首先对每个页面的问题相关性进行评分。然后根据分数分布对页面进行聚类，以自适应地选择相关内容。聚类后的页面会被 Top-K 再次筛选，以保持上下文紧凑。然而，对于短文档，聚类可靠性会降低，因此我们使用相关概率阈值来选择页面。仅选定的页面会被馈送到冻结的 LVLM 来生成答案，从而无需模型微调。所提出的 AVIR 框架将问答所需的平均页数减少了 70%，同时在 MP-DocVQA 数据集上实现了 84.58% 的 ANLS，以显着降低的计算成本超越了以前的方法。所提出的 AVIR 的有效性也在 SlideVQA 和 DUDE 基准测试上得到了验证。该代码可从此 https URL 获取。

Title: Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation

Authors: Zaiyan Zhang, Jie Li, Shaowei Shi, Qiangqiang Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12052
Pdf URL: https://arxiv.org/pdf/2601.12052
Copy Paste: [[2601.12052]] Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation(https://arxiv.org/abs/2601.12052)
Keywords: restoration
Abstract: Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15\% of the parameters, and achieves a 1.4\% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.
摘要：光学遥感图像对于地球观测是不可或缺的，但持续的云遮挡限制了其下游用途。大多数去云 (CR) 方法都针对低保真度进行了优化，并且可能会过度平滑对于分析就绪数据 (ARD) 至关重要的纹理和边界，从而导致视觉上合理的恢复与语义实用性之间的不匹配。为了弥补这一差距，我们提出了 TDP-CR，这是一种任务驱动的多模式框架，可以联合执行云去除和土地覆盖分割。我们方法的核心是提示引导融合（PGF）机制，它利用可学习的退化提示来编码云厚度和空间不确定性。通过将全局通道上下文与局部提示条件空间偏差相结合，PGF 仅在光学数据损坏的情况下自适应地集成合成孔径雷达 (SAR) 信息。我们进一步引入了一种参数有效的两阶段训练策略，该策略将重建和语义表示学习解耦。在 LuojiaSET-OSFCR 数据集上的实验证明了我们框架的优越性：TDP-CR 在仅使用 15% 的参数的情况下，PSNR 超过了最先进的基线 0.18 dB，并且与多任务竞争对手相比，mIoU 持续提高了 1.4%，有效地提供了可用于分析的数据。

Title: Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Authors: Zijie Lou, Xiangwei Feng, Jiaxin Wang, Xiaochao Qu, Luoqi Liu, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12066
Pdf URL: https://arxiv.org/pdf/2601.12066
Copy Paste: [[2601.12066]] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation(https://arxiv.org/abs/2601.12066)
Keywords: generation, generative
Abstract: Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.
摘要：现有的视频对象去除方法主要依赖于遵循噪声到数据范式的扩散模型，其中从无信息的高斯噪声开始生成。这种方法丢弃了原始输入视频中存在的丰富的结构和上下文先验。因此，此类方法通常缺乏足够的指导，导致对象擦除不完整或合成与场景物理逻辑相冲突的不可信内容。在本文中，我们通过随机桥模型将视频对象去除重新表述为视频到视频的翻译任务。与噪声初始化方法不同，我们的框架建立了从源视频（带有对象）到目标视频（已删除对象）的直接随机路径。这种桥梁公式有效地利用输入视频作为强大的结构先验，指导模型执行精确的去除，同时确保填充区域在逻辑上与周围环境一致。为了解决强桥先验阻碍大型物体去除的权衡问题，我们提出了一种新颖的自适应掩模调制策略。该机制根据掩模特征动态调节输入嵌入，平衡背景保真度与生成灵活性。大量的实验表明，我们的方法在视觉质量和时间一致性方面都明显优于现有方法。

Title: ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification

Authors: VSS Tejaswi Abburi, Ananya Singhal, Saurabh J. Shigwan, Nitin Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12067
Pdf URL: https://arxiv.org/pdf/2601.12067
Copy Paste: [[2601.12067]] ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification(https://arxiv.org/abs/2601.12067)
Keywords: generative
Abstract: Early detection of neurodegenerative diseases such as Alzheimer's Disease (AD) and Frontotemporal Dementia (FTD) is essential for reducing the risk of progression to severe disease stages. As AD and FTD propagate along white-matter regions in a global, graph-dependent manner, graph-based neural networks are well suited to capture these patterns. Hence, we introduce ARMARecon, a unified graph learning framework that integrates Autoregressive Moving Average (ARMA) graph filtering with a reconstruction-driven objective to enhance feature representation and improve classification accuracy. ARMARecon effectively models both local and global connectivity by leveraging 20-bin Fractional Anisotropy (FA) histogram features extracted from white-matter regions, while mitigating over-smoothing. Overall, ARMARecon achieves superior performance compared to state-of-the-art methods on the multi-site dMRI datasets ADNI and NIFD.
摘要：早期发现阿尔茨海默病 (AD) 和额颞叶痴呆 (FTD) 等神经退行性疾病对于降低进展为严重疾病阶段的风险至关重要。由于 AD 和 FTD 以全局的、依赖图的方式沿着白质区域传播，因此基于图的神经网络非常适合捕获这些模式。因此，我们引入了 ARMARecon，这是一个统一的图学习框架，它将自回归移动平均（ARMA）图过滤与重建驱动的目标集成在一起，以增强特征表示并提高分类精度。 ARMARecon 通过利用从白质区域提取的 20 箱分数各向异性 (FA) 直方图特征，有效地模拟局部和全局连接，同时减轻过度平滑。总体而言，与多站点 dMRI 数据集 ADNI 和 NIFD 上最先进的方法相比，ARMARecon 实现了卓越的性能。

Title: CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

Authors: H. Jiang, Y. Sun, Z. Dong, T. Liu, Y. Gu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.12076
Pdf URL: https://arxiv.org/pdf/2601.12076
Copy Paste: [[2601.12076]] CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation(https://arxiv.org/abs/2601.12076)
Keywords: quality assessment
Abstract: Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.
摘要：遥感视频参考对象分割（RS-RVOS）面临动态场景中较弱的目标显着性和严重的视觉信息截断的挑战，使得在分割过程中保持有辨别力的目标表示变得极其困难。此外，由于缺乏大规模的专用基准，该领域的进展受到阻碍，而现有模型经常受到有偏差的初始内存构造的影响，这会损害复杂场景中实例的准确定位，以及不加区别的内存积累，对遮挡或错误分类的噪声进行编码，导致持续的错误传播。本文通过数据和方法论的双重贡献推进了 RS-RVOS 研究。首先，我们构建了 RS-RVOS Bench，这是第一个大规模基准测试，包含 111 个视频序列、约 25,000 帧和 213,000 个时间参考注释。与常见的 RVOS 基准测试不同，在常见的 RVOS 基准测试中，许多表达式都是通过访问完整视频上下文来编写的，我们的数据集采用严格的因果感知注释策略，其中语言参考仅从初始帧中的目标状态生成。其次，我们提出了一种内存质量感知的在线引用分段框架，称为带有分段任意模型的内存质量控制（MQC-SAM）。 MQC-SAM 引入了用于初始记忆校准的时间运动一致性模块，利用短期运动轨迹先验来纠正结构偏差并建立准确的记忆锚定。此外，它结合了基于注意力的解耦记忆集成机制和动态质量评估，选择性更新高置信度语义特征，同时过滤不可靠信息，从而有效防止错误积累和传播。 RS-RVOS Bench 上的大量实验表明 MQC-SAM 实现了最先进的性能。

Title: RCDN: Real-Centered Detection Network for Robust Face Forgery Identification

Authors: Wyatt McCurdy, Xin Zhang, Yuqi Song, Min Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12111
Pdf URL: https://arxiv.org/pdf/2601.12111
Copy Paste: [[2601.12111]] RCDN: Real-Centered Detection Network for Robust Face Forgery Identification(https://arxiv.org/abs/2601.12111)
Keywords: generation
Abstract: Image forgery has become a critical threat with the rapid proliferation of AI-based generation tools, which make it increasingly easy to synthesize realistic but fraudulent facial content. Existing detection methods achieve near-perfect performance when training and testing are conducted within the same domain, yet their effectiveness deteriorates substantially in crossdomain scenarios. This limitation is problematic, as new forgery techniques continuously emerge and detectors must remain reliable against unseen manipulations. To address this challenge, we propose the Real-Centered Detection Network (RCDN), a frequency spatial convolutional neural networks(CNN) framework with an Xception backbone that anchors its representation space around authentic facial images. Instead of modeling the diverse and evolving patterns of forgeries, RCDN emphasizes the consistency of real images, leveraging a dual-branch architecture and a real centered loss design to enhance robustness under distribution shifts. Extensive experiments on the DiFF dataset, focusing on three representative forgery types (FE, I2I, T2I), demonstrate that RCDN achieves both state-of-the-art in-domain accuracy and significantly stronger cross-domain generalization. Notably, RCDN reduces the generalization gap compared to leading baselines and achieves the highest cross/in-domain stability ratio, highlighting its potential as a practical solution for defending against evolving and unseen image forgery techniques.
摘要：随着基于人工智能的生成工具的迅速普及，图像伪造已成为一个严重威胁，这使得合成真实但欺诈性的面部内容变得越来越容易。现有的检测方法在同一域内进行训练和测试时可以实现近乎完美的性能，但在跨域场景中其有效性会大幅下降。这种限制是有问题的，因为新的伪造技术不断出现，探测器必须保持可靠，以防止看不见的操纵。为了应对这一挑战，我们提出了真实中心检测网络（RCDN），这是一种频率空间卷积神经网络（CNN）框架，具有 Xception 主干，将其表示空间锚定在真实的面部图像周围。 RCDN 没有对伪造的多样化和不断演变的模式进行建模，而是强调真实图像的一致性，利用双分支架构和真实中心损失设计来增强分布变化下的鲁棒性。在 DiFF 数据集上进行的大量实验，重点关注三种代表性的伪造类型（FE、I2I、T2I），表明 RCDN 实现了最先进的域内准确性和显着更强的跨域泛化能力。值得注意的是，与领先的基线相比，RCDN 缩小了泛化差距，并实现了最高的跨域/域内稳定性比，凸显了其作为防御不断发展和看不见的图像伪造技术的实用解决方案的潜力。

Title: SynQP: A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data

Authors: Bing Hu, Yixin Li, Asma Bahamyirou, Helen Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12124
Pdf URL: https://arxiv.org/pdf/2601.12124
Copy Paste: [[2601.12124]] SynQP: A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data(https://arxiv.org/abs/2601.12124)
Keywords: generation
Abstract: The use of synthetic data in health applications raises privacy concerns, yet the lack of open frameworks for privacy evaluations has slowed its adoption. A major challenge is the absence of accessible benchmark datasets for evaluating privacy risks, due to difficulties in acquiring sensitive data. To address this, we introduce SynQP, an open framework for benchmarking privacy in synthetic data generation (SDG) using simulated sensitive data, ensuring that original data remains confidential. We also highlight the need for privacy metrics that fairly account for the probabilistic nature of machine learning models. As a demonstration, we use SynQP to benchmark CTGAN and propose a new identity disclosure risk metric that offers a more accurate estimation of privacy risks compared to existing approaches. Our work provides a critical tool for improving the transparency and reliability of privacy evaluations, enabling safer use of synthetic data in health-related applications. % In our quality evaluations, non-private models achieved near-perfect machine-learning efficacy $\ge0.97$. Our privacy assessments (Table II) reveal that DP consistently lowers both identity disclosure risk (SD-IDR) and membership-inference attack risk (SD-MIA), with all DP-augmented models staying below the 0.09 regulatory threshold. Code available at this https URL
摘要：在健康应用中使用合成数据引起了隐私问题，但缺乏隐私评估的开放框架已经减缓了其采用速度。一个主要挑战是由于获取敏感数据的困难，缺乏可访问的基准数据集来评估隐私风险。为了解决这个问题，我们引入了 SynQP，这是一个开放框架，用于使用模拟敏感数据对合成数据生成 (SDG) 中的隐私进行基准测试，以确保原始数据的机密性。我们还强调需要公平地考虑机器学习模型的概率性质的隐私指标。作为演示，我们使用 SynQP 对 CTGAN 进行基准测试，并提出一种新的身份泄露风险指标，与现有方法相比，该指标可以更准确地估计隐私风险。我们的工作为提高隐私评估的透明度和可靠性提供了一个关键工具，从而能够在健康相关应用程序中更安全地使用合成数据。 % 在我们的质量评估中，非私有模型实现了近乎完美的机器学习功效 $\ge0.97$。我们的隐私评估（表 II）显示，DP 持续降低身份泄露风险 (SD-IDR) 和成员推断攻击风险 (SD-MIA)，所有 DP 增强模型均保持在 0.09 监管阈值以下。代码可在此 https URL 获取

Title: Speculative Sampling with Reinforcement Learning

Authors: Chenan Wang, Daniel H. Shi, Haipeng Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12212
Pdf URL: https://arxiv.org/pdf/2601.12212
Copy Paste: [[2601.12212]] Speculative Sampling with Reinforcement Learning(https://arxiv.org/abs/2601.12212)
Keywords: generation
Abstract: Inference time latency has remained an open challenge for real world applications of large language models (LLMs). State-of-the-art (SOTA) speculative sampling (SpS) methods for LLMs, like EAGLE-3, use tree-based drafting to explore multiple candidate continuations in parallel. However, the hyperparameters controlling the tree structure are static, which limits flexibility and efficiency across diverse contexts and domains. We introduce Reinforcement learning for Speculative Sampling (Re-SpS), the first reinforcement learning (RL)-based framework for draft tree hyperparameter optimization. Re-SpS dynamically adjusts draft tree hyperparameters in real-time, learning context-aware policies that maximize generation speed by balancing speculative aggression with computational overhead. It leverages efficient state representations from target model hidden states and introduces multi-step action persistence for better context modeling. Evaluation results across five diverse benchmarks demonstrate consistent improvements over the SOTA method EAGLE-3, achieving up to 5.45$\times$ speedup over the backbone LLM and up to 1.12$\times$ speedup compared to EAGLE-3 across five diverse benchmarks, with no loss in output fidelity.
摘要：对于大型语言模型 (LLM) 的现实世界应用来说，推理时间延迟仍然是一个开放的挑战。 LLM 最先进的 (SOTA) 推测抽样 (SpS) 方法（例如 EAGLE-3）使用基于树的绘图来并行探索多个候选延续。然而，控制树结构的超参数是静态的，这限制了跨不同上下文和域的灵活性和效率。我们引入了推测采样强化学习 (Re-SpS)，这是第一个基于强化学习 (RL) 的草图树超参数优化框架。 Re-SpS 实时动态调整草图树超参数，学习上下文感知策略，通过平衡推测攻击与计算开销来最大化生成速度。它利用目标模型隐藏状态的有效状态表示，并引入多步骤动作持久性以实现更好的上下文建模。五个不同基准的评估结果表明，SOTA 方法 EAGLE-3 取得了一致的改进，在五个不同基准中，与主干 LLM 相比实现了高达 5.45$\times$ 的加速，与 EAGLE-3 相比高达 1.12$\times$ 的加速，且输出保真度没有损失。

Title: S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection

Authors: Xiangyu Hu, Yicheng Hong, Hongchuang Zheng, Wenjun Zeng, Bingyao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12313
Pdf URL: https://arxiv.org/pdf/2601.12313
Copy Paste: [[2601.12313]] S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection(https://arxiv.org/abs/2601.12313)
Keywords: generative
Abstract: The rapid development of generative models has imposed an urgent demand for detection schemes with strong generalization capabilities. However, existing detection methods generally suffer from overfitting to specific source models, leading to significant performance degradation when confronted with unseen generative architectures. To address these challenges, this paper proposes a cross-model detection framework called S 2 F-Net, whose core lies in exploring and leveraging the inherent spectral discrepancies between real and synthetic textures. Considering that upsampling operations leave unique and distinguishable frequency fingerprints in both texture-poor and texture-rich regions, we focus our research on the detection of frequency-domain artifacts, aiming to fundamentally improve the generalization performance of the model. Specifically, we introduce a learnable frequency attention module that adaptively weights and enhances discriminative frequency bands by synergizing spatial texture analysis and spectral this http URL the AIGCDetectBenchmark, which includes 17 categories of generative models, S 2 F-Net achieves a detection accuracy of 90.49%, significantly outperforming various existing baseline methods in cross-domain detection scenarios.
摘要：生成模型的快速发展对具有强泛化能力的检测方案提出了迫切的需求。然而，现有的检测方法通常会过度拟合特定的源模型，导致在面对看不见的生成架构时性能显着下降。为了应对这些挑战，本文提出了一种称为 S 2 F-Net 的跨模型检测框架，其核心在于探索和利用真实纹理和合成纹理之间固有的光谱差异。考虑到上采样操作在纹理贫乏和纹理丰富的区域都会留下独特且可区分的频率指纹，我们将研究重点放在频域伪影的检测上，旨在从根本上提高模型的泛化性能。具体来说，我们引入了一个可学习的频率注意模块，通过协同空间纹理分析和频谱来自适应加权和增强判别频段。AIGCDetectBenchmark包含17类生成模型，S 2 F-Net实现了90.49%的检测准确率，在跨域检测场景中显着优于现有的各种基线方法。

Title: MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Authors: Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12346
Pdf URL: https://arxiv.org/pdf/2601.12346
Copy Paste: [[2601.12346]] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents(https://arxiv.org/abs/2601.12346)
Keywords: generation
Abstract: Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
摘要：深度研究代理 (DRA) 通过多步骤搜索和合成生成引用丰富的报告，但现有基准主要针对纯文本设置或简短的多模式 QA，缺少端到端多模式证据使用。我们引入了 MMDeepResearch-Bench (MMDR-Bench)，这是跨 21 个领域的 140 个专家制作的任务的基准，其中每个任务提供一个图像文本包来评估多模态理解和基于引文的报告生成。与之前的设置相比，MMDR-Bench 强调使用明确证据的报告式综合，其中模型必须将视觉工件与来源声明联系起来，并保持叙述、引文和视觉参考之间的一致性。我们进一步提出了一个统一的、可解释的评估管道：用于报告质量的 Formula-LLM 自适应评估 (FLAE)、用于基于引文的证据对齐的可信检索对齐引文评估 (TRACE) 和用于文本视觉完整性的多模态支持对齐完整性检查 (MOSAIC)，每个管道都会产生细粒度的信号，支持超出单一总体评分的错误诊断。跨 25 个最先进模型的实验揭示了生成质量、引用规则和多模态基础之间的系统权衡，强调仅靠强大的散文并不能保证忠实的证据使用，而多模态完整性仍然是深度研究代理的关键瓶颈。

Title: From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

Authors: Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.12358
Pdf URL: https://arxiv.org/pdf/2601.12358
Copy Paste: [[2601.12358]] From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles(https://arxiv.org/abs/2601.12358)
Keywords: generation
Abstract: Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.
摘要：自动驾驶汽车 (AV) 需要自适应行为规划器来安全地驾驭不可预测的现实环境。传统的行为树 (BT) 提供结构化决策逻辑，但本质上是静态的，并且需要劳动密集型的手动调整，限制了它们在 SAE 5 级自治中的适用性。本文提出了一个代理框架，该框架利用大型语言模型 (LLM) 和多模态视觉模型 (LVM) 来动态生成和调整 BT。专门的描述符代理应用符号链提示来评估场景关键性，规划器代理通过上下文学习构建高级子目标，生成器代理以 XML 格式合成可执行的 BT 子树。我们的系统集成到 CARLA+Nav2 模拟中，仅在基线 BT 故障时触发，展示了在没有人为干预的情况下成功绕过意外障碍（例如街道堵塞）的导航。与静态 BT 基线相比，这种方法是一种可扩展到不同驾驶场景的概念验证。

Title: Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection

Authors: Jiahui Sheng, Yidan Shi, Shu Xiang, Xiaorun Li, Shuhan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12379
Pdf URL: https://arxiv.org/pdf/2601.12379
Copy Paste: [[2601.12379]] Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection(https://arxiv.org/abs/2601.12379)
Keywords: generative
Abstract: Hyperspectral images (HSIs) are a type of image that contains abundant spectral information. As a type of real-world data, the high-dimensional spectra in hyperspectral images are actually determined by only a few factors, such as chemical composition and illumination. Thus, spectra in hyperspectral images are highly likely to satisfy the manifold hypothesis. Based on the hyperspectral manifold hypothesis, we propose a novel hyperspectral anomaly detection method (named ScoreAD) that leverages the time-dependent gradient field of the data distribution (i.e., the score), as learned by a score-based generative model (SGM). Our method first trains the SGM on the entire set of spectra from the hyperspectral image. At test time, each spectrum is passed through a perturbation kernel, and the resulting perturbed spectrum is fed into the trained SGM to obtain the estimated score. The manifold hypothesis of HSIs posits that background spectra reside on one or more low-dimensional manifolds. Conversely, anomalous spectra, owing to their unique spectral signatures, are considered outliers that do not conform to the background manifold. Based on this fundamental discrepancy in their manifold distributions, we leverage a generative SGM to achieve hyperspectral anomaly detection. Experiments on the four hyperspectral datasets demonstrate the effectiveness of the proposed method. The code is available at this https URL.
摘要：高光谱图像（HSIs）是一类包含丰富光谱信息的图像。作为现实世界数据的一种，高光谱图像中的高维光谱实际上仅由化学成分和光照等少数因素决定。因此，高光谱图像中的光谱很可能满足流形假设。基于高光谱流形假设，我们提出了一种新颖的高光谱异常检测方法（称为 ScoreAD），该方法利用基于分数的生成模型（SGM）学习的数据分布（即分数）的时间相关梯度场。我们的方法首先在高光谱图像的整组光谱上训练 SGM。在测试时，每个频谱都通过扰动核，并将得到的扰动频谱输入到经过训练的 SGM 中以获得估计分数。 HSI 的流形假设假设背景光谱位于一个或多个低维流形上。相反，异常光谱由于其独特的光谱特征，被认为是不符合背景流形的异常值。基于其流形分布的这种根本差异，我们利用生成 SGM 来实现高光谱异常检测。对四个高光谱数据集的实验证明了该方法的有效性。该代码可从此 https URL 获取。

Title: Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation

Authors: Dasith de Silva Edirimuni, Ajmal Saeed Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12391
Pdf URL: https://arxiv.org/pdf/2601.12391
Copy Paste: [[2601.12391]] Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation(https://arxiv.org/abs/2601.12391)
Keywords: generation
Abstract: Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering $\textit{class-partitioned codebook}$ where codevectors are labeled by class. To address the problem of $\textit{codebook collapse}$, we propose a $\textit{class-aware}$ running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE's class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.
摘要：大多数 3D 场景生成方法仅限于生成对象边界框参数，而较新的扩散方法还生成类标签和潜在特征。然后，他们使用对象大小或潜在特征从预定义的数据库中检索对象。对于各种多类别对象的复杂场景，当前的自动编码器无法有效地将基于扩散的潜伏解码为与目标类别一致的正确点云对象。我们引入了一种类划分向量量化变分自动编码器（CPVQ-VAE），通过采用开创性的 $\textit{类划分码本}$（其中码向量按类进行标记），经过训练可以有效地解码对象潜在特征。为了解决 $\textit{codebook crash}$ 的问题，我们提出了 $\textit{class-aware}$ 运行平均更新，它重新初始化每个分区内的死代码向量。在推理过程中，CPVQ-VAE 会消耗由专为场景生成设计的潜在空间流匹配模型 (LFMM) 生成的对象特征和类标签。然后，CPVQ-VAE 的类感知逆向查找将生成的潜在变量映射到解码为特定于类的点云形状的码本条目。因此，我们实现了纯粹的点云生成，而不依赖于外部对象数据库进行检索。大量实验表明，我们的方法可靠地恢复了合理的点云场景，在复杂的客厅场景中，Chamfer 和 Point2Mesh 错误分别减少了 70.4% 和 72.3%。

Title: Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation

Authors: Jinmei Liu, Haoru Li, Zhenhong Sun, Chaofeng Chen, Yatao Bian, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12401
Pdf URL: https://arxiv.org/pdf/2601.12401
Copy Paste: [[2601.12401]] Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation(https://arxiv.org/abs/2601.12401)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning large-scale generative models, such as diffusion and flow models, to align with complex human preferences and user-specified tasks. A fundamental limitation remains \textit{the curse of diversity collapse}, where the objective formulation and optimization landscape inherently collapse the policy to a Dirac delta distribution. To address this challenge, we propose \textbf{DRIFT} (\textbf{D}ive\textbf{R}sity-\textbf{I}ncentivized Reinforcement \textbf{F}ine-\textbf{T}uning for Versatile Image Generation), an innovative framework that systematically incentivizes output diversity throughout the on-policy fine-tuning process, reconciling strong task alignment with high generation diversity to enhance versatility essential for applications that demand diverse candidate generations. We approach the problem across three representative perspectives: i) \textbf{sampling} a reward-concentrated subset that filters out reward outliers to prevent premature collapse; ii) \textbf{prompting} with stochastic variations to expand the conditioning space, and iii) \textbf{optimization} of the intra-group diversity with a potential-based reward shaping mechanism. Experimental results show that DRIFT achieves superior Pareto dominance regarding task alignment and generation diversity, yielding a $ 9.08\%\!\sim\! 43.46\%$ increase in diversity at equivalent alignment levels and a $ 59.65\% \!\sim\! 65.86\%$ increase in alignment at equivalent levels of diversity.
摘要：强化学习 (RL) 已成为微调大规模生成模型（例如扩散模型和流动模型）的强大范例，以符合复杂的人类偏好和用户指定的任务。一个基本的限制仍然是\textit{多样性崩溃的诅咒}，其中目标制定和优化景观本质上将政策崩溃为狄拉克三角洲分布。为了应对这一挑战，我们提出了 \textbf{DRIFT} （\textbf{D}ive\textbf{R}sity-\textbf{I}ncentivized Reinforcement \textbf{F}ine-\textbf{T}uning for Versatile Image Generation），这是一种创新框架，可以在整个策略微调过程中系统地激励输出多样性，协调强任务一致性与高生成多样性，以增强对于图像生成至关重要的多功能性。需要不同候选代的应用程序。我们从三个有代表性的角度来解决这个问题： i) \textbf{sampling} 一个奖励集中的子集，它过滤掉奖励异常值以防止过早崩溃； ii) \textbf{提示} 使用随机变化来扩大调节空间，以及 iii) \textbf{通过基于潜力的奖励塑造机制优化组内多样性。实验结果表明，DRIFT 在任务对齐和生成多样性方面实现了卓越的 Pareto 优势，产生了 $ 9.08\%\!\sim\!在同等比对水平下，多样性增加了 43.46\%$，并且增加了 $59.65\%\!\sim\!在同等多样性水平下，一致性增加 65.86\%$。

Title: SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection

Authors: Ruo Qi, Linhui Dai, Yusong Qin, Chaolei Yang, Yanshan Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12507
Pdf URL: https://arxiv.org/pdf/2601.12507
Copy Paste: [[2601.12507]] SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection(https://arxiv.org/abs/2601.12507)
Keywords: super-resolution
Abstract: In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at this https URL.
摘要：在遥感图像中，复杂的背景、微弱的物体信号和较小的物体尺度使得精确检测特别具有挑战性，特别是在低质量成像条件下。常见的策略是在检测前集成单图像超分辨率（SR）；然而，这种串行管道通常会遇到优化目标不一致、特征冗余以及 SR 和检测之间缺乏有效交互的问题。为了解决这些问题，我们提出了一种显着性驱动的多任务协作网络（SDCoNet），它通过隐式特征共享将 SR 和检测结合起来，同时保留任务特异性。 SDCoNet 采用基于 swin Transformer 的共享编码器，其中分层窗口移位自注意力支持跨任务特征协作，并自适应地平衡纹理细化和语义表示之间的权衡。此外，多尺度显着性预测模块会生成重要性分数来选择关键标记，从而能够将注意力集中在弱对象区域、抑制背景杂波以及抑制多任务耦合引入的不利特征。此外，引入梯度路由策略来减轻优化冲突。它首先稳定检测语义，然后沿着面向检测的方向路由 SR 梯度，使框架能够指导 SR 分支生成明显有利于检测的高频细节。在NWPU VHR-10-Split、DOTAv1.5-Split和HRSSD-Split等公共数据集上的实验表明，该方法在保持有竞争力的计算效率的同时，在低质量遥感图像的小目标检测方面显着优于现有主流算法。我们的代码可以在这个 https URL 上找到。

Title: Towards Robust Universal Perturbation Attacks: A Float-Coded, Penalty-Driven Evolutionary Approach

Authors: Shiqi Wang, Mahdi Khosravy, Neeraj Gupta, Olaf Witkowski
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2601.12624
Pdf URL: https://arxiv.org/pdf/2601.12624
Copy Paste: [[2601.12624]] Towards Robust Universal Perturbation Attacks: A Float-Coded, Penalty-Driven Evolutionary Approach(https://arxiv.org/abs/2601.12624)
Keywords: generation
Abstract: Universal adversarial perturbations (UAPs) have garnered significant attention due to their ability to undermine deep neural networks across multiple inputs using a single noise pattern. Evolutionary algorithms offer a promising approach to generating such perturbations due to their ability to navigate non-convex, gradient-free landscapes. In this work, we introduce a float-coded, penalty-driven single-objective evolutionary framework for UAP generation that achieves lower visibility perturbations while enhancing attack success rates. Our approach leverages continuous gene representations aligned with contemporary deep learning scales, incorporates dynamic evolutionary operators with adaptive scheduling, and utilizes a modular PyTorch implementation for seamless integration with modern architectures. Additionally, we ensure the universality of the generated perturbations by testing across diverse models and by periodically switching batches to prevent overfitting. Experimental results on the ImageNet dataset demonstrate that our framework consistently produces perturbations with smaller norms, higher misclassification effectiveness, and faster convergence compared to existing evolutionary-based methods. These findings highlight the robustness and scalability of our approach for universal adversarial attacks across various deep learning architectures.
摘要：通用对抗性扰动（UAP）因其能够使用单一噪声模式破坏跨多个输入的深层神经网络而引起了广泛关注。进化算法提供了一种有前途的方法来产生这种扰动，因为它们能够导航非凸、无梯度的景观。在这项工作中，我们引入了一种用于 UAP 生成的浮点编码、惩罚驱动的单目标进化框架，该框架可实现较低的可见性扰动，同时提高攻击成功率。我们的方法利用与当代深度学习规模相一致的连续基因表示，将动态进化算子与自适应调度相结合，并利用模块化 PyTorch 实现与现代架构无缝集成。此外，我们通过跨不同模型进行测试并定期切换批次以防止过度拟合，确保生成的扰动的普遍性。 ImageNet 数据集上的实验结果表明，与现有的基于进化的方法相比，我们的框架始终能够产生具有更小范数、更高的误分类有效性和更快收敛的扰动。这些发现凸显了我们跨各种深度学习架构的通用对抗攻击方法的稳健性和可扩展性。

Title: VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

Authors: Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, Yi Zhang, Zhi-Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12672
Pdf URL: https://arxiv.org/pdf/2601.12672
Copy Paste: [[2601.12672]] VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness(https://arxiv.org/abs/2601.12672)
Keywords: generation, generative
Abstract: The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.
摘要：自动驾驶（AD）系统的安全部署从根本上受到长尾问题的阻碍，即罕见但关键的驾驶场景在现实世界数据中的代表性严重不足。现有的解决方案，包括安全关键场景生成和闭环学习，通常依赖于基于规则的启发式方法、重采样方法和从离线数据集中学习的生成模型，限制了它们产生多样化和新颖挑战的能力。虽然最近的工作利用视觉语言模型（VLM）来生成场景描述，指导单独的下游模型为代理生成危险轨迹，但这种两阶段框架限制了 VLM 的生成潜力，因为最终轨迹的多样性最终受到下游算法的泛化上限的限制。为了克服这些限制，我们引入了 VILTA（VLM-In-the-Loop Trajectory Adversary），这是一种将 VLM 集成到 AD 智能体闭环训练中的新颖框架。与之前的工作不同，VILTA 通过理解动态驾驶环境并通过直接、细粒度地编辑周围智能体的未来轨迹来战略性地生成具有挑战性的场景，从而积极参与训练循环。这种直接编辑方法充分利用 VLM 强大的泛化能力来创建多样化的课程，其中包含看似合理但具有挑战性的场景，超出了传统方法的范围。我们证明，我们的方法大大增强了最终 AD 政策的安全性和稳健性，特别是在应对关键长尾事件的能力方面。

Title: Fusion-Restoration Image Processing Algorithm to Improve the High-Temperature Deformation Measurement

Authors: Banglei Guan, Dongcai Tan, Jing Tao, Ang Su, Yang Shang, Qifeng Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12682
Pdf URL: https://arxiv.org/pdf/2601.12682
Copy Paste: [[2601.12682]] Fusion-Restoration Image Processing Algorithm to Improve the High-Temperature Deformation Measurement(https://arxiv.org/abs/2601.12682)
Keywords: restoration
Abstract: In the deformation measurement of high-temperature structures, image degradation caused by thermal radiation and random errors introduced by heat haze restrict the accuracy and effectiveness of deformation measurement. To suppress thermal radiation and heat haze using fusion-restoration image processing methods, thereby improving the accuracy and effectiveness of DIC in the measurement of high-temperature deformation. For image degradation caused by thermal radiation, based on the image layered representation, the image is decomposed into positive and negative channels for parallel processing, and then optimized for quality by multi-exposure image fusion. To counteract the high-frequency, random errors introduced by heat haze, we adopt the FSIM as the objective function to guide the iterative optimization of model parameters, and the grayscale average algorithm is applied to equalize anomalous gray values, thereby reducing measurement error. The proposed multi-exposure image fusion algorithm effectively suppresses image degradation caused by complex illumination conditions, boosting the effective computation area from 26% to 50% for under-exposed images and from 32% to 40% for over-exposed images without degrading measurement accuracy in the experiment. Meanwhile, the image restoration combined with the grayscale average algorithm reduces static thermal deformation measurement errors. The error in {\epsilon}_xx is reduced by 85.3%, while the errors in {\epsilon}_yy and {\gamma}_xy are reduced by 36.0% and 36.4%, respectively. We present image processing methods to suppress the interference of thermal radiation and heat haze in high-temperature deformation measurement using DIC. The experimental results verify that the proposed method can effectively improve image quality, reduce deformation measurement errors, and has potential application value in thermal deformation measurement.
摘要：在高温结构的变形测量中，热辐射引起的图像劣化和热雾引入的随机误差限制了变形测量的准确性和有效性。利用融合恢复图像处理方法抑制热辐射和热雾，从而提高DIC测量高温变形的准确性和有效性。针对热辐射引起的图像劣化，基于图像分层表示，将图像分解为正通道和负通道并行处理，然后通过多曝光图像融合进行质量优化。为了抵消热霾引入的高频随机误差，采用FSIM作为目标函数指导模型参数迭代优化，并采用灰度平均算法均衡异常灰度值，从而降低测量误差。所提出的多重曝光图像融合算法有效地抑制了复杂光照条件引起的图像劣化，将曝光不足图像的有效计算面积从26％提高到50％，将过度曝光图像的有效计算面积从32％提高到40％，而实验中没有降低测量精度。同时，图像复原结合灰度平均算法，降低了静态热变形测量误差。 {\epsilon}_xx 中的误差减少了 85.3%，而 {\epsilon}_yy 和 {\gamma}_xy 中的误差分别减少了 36.0% 和 36.4%。我们提出了图像处理方法，以抑制使用 DIC 进行高温变形测量中热辐射和热雾的干扰。实验结果验证了该方法能够有效提高图像质量、减少变形测量误差，在热变形测量中具有潜在的应用价值。

Title: S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

Authors: Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, Sergey Tulyakov, Yanzhi Wang, Anil Kag, Yanyu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12719
Pdf URL: https://arxiv.org/pdf/2601.12719
Copy Paste: [[2601.12719]] S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation(https://arxiv.org/abs/2601.12719)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.
摘要：扩散变压器 (DiT) 最近提高了视频生成质量。然而，其繁重的计算成本使得实时或设备上生成变得不可行。在这项工作中，我们介绍了 S2DiT，一种流式三明治扩散变压器，专为在移动硬件上生成高效、高保真和流式视频而设计。 S2DiT 生成更多令牌，但通过新颖的高效注意力保持效率：LinConv 混合注意力 (LCHA) 和跨步自注意力 (SSA) 的混合。在此基础上，我们通过预算敏感的动态规划搜索来揭示三明治设计，从而实现卓越的质量和效率。我们进一步提出了一个二合一蒸馏框架，将大型教师模型（例如，Wan 2.2-14B）的能力转移到紧凑的少步三明治模型。总之，S2DiT 的质量可与最先进的服务器视频模型相媲美，同时在 iPhone 上以超过 10 FPS 的速度进行流传输。

Title: SSPFormer: Self-Supervised Pretrained Transformer for MRI Images

Authors: Jingkai Li, Xiaoze Tian, Yuhang Shen, Jia Wang, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12747
Pdf URL: https://arxiv.org/pdf/2601.12747
Copy Paste: [[2601.12747]] SSPFormer: Self-Supervised Pretrained Transformer for MRI Images(https://arxiv.org/abs/2601.12747)
Keywords: super-resolution
Abstract: The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.
摘要：预训练的 Transformer 在自然图像处理中表现出卓越的泛化能力。然而，直接将其转移到磁共振图像面临着两个关键挑战：无法适应医学解剖结构的特殊性以及医疗数据的隐私性和稀缺性带来的限制。为了解决这些问题，本文提出了一种用于 MRI 图像的自监督预训练变换器（SSPFormer），它通过利用未标记的原始成像数据有效地学习医学图像的特定领域特征表示。为了解决域差距和数据稀缺问题，我们引入了逆频率投影掩蔽，它优先考虑高频解剖区域的重建，以实施结构感知的表示学习。同时，为了增强针对现实世界 MRI 伪影的鲁棒性，我们采用频率加权 FFT 噪声增强技术，将生理学真实噪声注入傅里叶域。这些策略共同使模型能够直接从原始扫描中学习域不变和工件稳健的特征。通过对分割、超分辨率和去噪任务的大量实验，所提出的SSPFormer实现了最先进的性能，充分验证了其捕捉细粒度MRI图像保真度和适应临床应用需求的能力。

Title: Moaw: Unleashing Motion Awareness for Video Diffusion Models

Authors: Tianqi Zhang, Ziyi Wang, Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Zhengyang Huang, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12761
Pdf URL: https://arxiv.org/pdf/2601.12761
Copy Paste: [[2601.12761]] Moaw: Unleashing Motion Awareness for Video Diffusion Models(https://arxiv.org/abs/2601.12761)
Keywords: generation, generative
Abstract: Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
摘要：在大规模数据集上训练的视频扩散模型可以自然地捕获跨帧共享特征的对应关系。最近的工作利用这一特性来完成零样本设置中的光流预测和跟踪等任务。受这些发现的启发，我们研究了监督训练是否可以更充分地利用视频扩散模型的跟踪能力。为此，我们提出了 Moaw，这是一个框架，可以释放视频扩散模型的运动意识，并利用它来促进运动传输。具体来说，我们训练运动感知的扩散模型，将其模式从图像到视频生成转变为视频到密集跟踪。然后，我们构建一个运动标记数据集来识别编码最强运动信息的特征，并将它们注入结构相同的视频生成模型中。由于两个网络之间的同质性，这些特征可以以零样本的方式自然地适应，从而无需额外的适配器即可实现运动传输。我们的工作为桥接生成建模和运动理解提供了一种新的范例，为更加统一和可控的视频学习框架铺平了道路。

Title: Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image

Authors: Shuling Zhao, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12770
Pdf URL: https://arxiv.org/pdf/2601.12770
Copy Paste: [[2601.12770]] Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image(https://arxiv.org/abs/2601.12770)
Keywords: generative
Abstract: Building 3D animatable head avatars from a single image is an important yet challenging problem. Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars. In this work, we propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass, enabling real-time animation and simultaneous 360$^\circ$ rendering views. To facilitate efficient animation control, we model 3D head avatars with Gaussian primitives embedded on the surface of a parametric face model within the UV space. To obtain knowledge of full-head geometry and textures, we leverage rich 3D full-head priors within a pretrained 3D generative adversarial network (GAN) for global full-head feature extraction and multi-view supervision. To increase the fidelity of the 3D reconstruction of the input image, we take advantage of the symmetric nature of the UV space and human faces to fuse local fine-grained input image features with the global full-head textures. Extensive experiments demonstrate the effectiveness of our method, achieving high-quality 3D full-head modeling as well as real-time animation, thereby improving the realism of 3D talking avatars.
摘要：从单个图像构建 3D 可动画头部头像是一个重要但具有挑战性的问题。现有方法通常会在较大的相机姿势变化下崩溃，从而损害 3D 化身的真实感。在这项工作中，我们提出了一种新的框架来解决单次前馈通道中一次性 3D 全头动画化身重建的新颖设置，从而实现实时动画和同步 360$^\circ$ 渲染视图。为了促进高效的动画控制，我们使用嵌入 UV 空间内参数化面部模型表面的高斯基元来建模 3D 头部头像。为了获得全头几何和纹理的知识，我们利用预训练的 3D 生成对抗网络 (GAN) 中丰富的 3D 全头先验来进行全局全头特征提取和多视图监督。为了提高输入图像 3D 重建的保真度，我们利用 UV 空间和人脸的对称性质，将局部细粒度输入图像特征与全局全头纹理融合。大量的实验证明了我们方法的有效性，实现了高质量的3D全头建模以及实时动画，从而提高了3D说话化身的真实感。

Title: A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling

Authors: Wei Chen, Liang Wu, Shuyi Lu, Yuanyuan Sun, Wenkai Bi, Zilong Yuan, Yaoyao He, Feng Wang, Junchi Ma, Shuyong Liu, Zhaoping Cheng, Xiaoyan Hu, Jianfeng Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12820
Pdf URL: https://arxiv.org/pdf/2601.12820
Copy Paste: [[2601.12820]] A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling(https://arxiv.org/abs/2601.12820)
Keywords: generation
Abstract: Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.
摘要：全身 PET/CT 可实现全系统分子成像，但异质解剖和代谢信号、约 2 m 轴向覆盖范围和结构化放射学语义对现有的医疗 AI 模型提出了挑战，这些模型假设单模态输入、局部视野和粗略图像文本对齐。我们推出 SDF-HOLO（系统双流融合 Holo 模型），这是一种用于整体全身 PET/CT 的多模式基础模型，已在超过 10,000 名患者身上进行了预训练。 SDF-HOLO 将 CT 和 PET 表示学习与双流编码器解耦，并通过跨模态交互模块将它们耦合起来，允许解剖背景细化 PET 聚合，同时代谢显着性指导微妙的形态推理。为了对整个身体的远程依赖性进行建模，分层上下文建模将高效的局部窗口与全局注意力结合起来。为了桥接体素和临床语言，我们使用解剖分割掩模作为显式语义锚，并在预训练期间执行体素-掩模-文本对齐。在肿瘤分割、低剂量病变检测和多语言诊断报告生成方面，SDF-HOLO 的性能优于强大的特定任务和临床参考基线，同时减少了定位错误和幻觉结果。除了焦点解释之外，该模型还能够进行全系统代谢分析，并揭示器官间代谢网络相互作用的肿瘤相关指纹，为全身 PET/CT 诊断和系统级精准肿瘤学提供可扩展的计算基础。

Title: Generating Cyclic Conformers with Flow Matching in Cremer-Pople Coordinates

Authors: Luca Schaufelberger, Aline Hartgers, Kjell Jorner
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2601.12859
Pdf URL: https://arxiv.org/pdf/2601.12859
Copy Paste: [[2601.12859]] Generating Cyclic Conformers with Flow Matching in Cremer-Pople Coordinates(https://arxiv.org/abs/2601.12859)
Keywords: generation, generative
Abstract: Cyclic molecules are ubiquitous across applications in chemistry and biology. Their restricted conformational flexibility provides structural pre-organization that is key to their function in drug discovery and catalysis. However, reliably sampling the conformer ensembles of ring systems remains challenging. Here, we introduce PuckerFlow, a generative machine learning model that performs flow matching on the Cremer-Pople space, a low-dimensional internal coordinate system capturing the relevant degrees of freedom of rings. Our approach enables generation of valid closed rings by design and demonstrates strong performance in generating conformers that are both diverse and precise. We show that PuckerFlow outperforms other conformer generation methods on nearly all quantitative metrics and illustrate the potential of PuckerFlow for ring systems relevant to chemical applications, particularly in catalysis and drug discovery. This work enables efficient and reliable conformer generation of cyclic structures, paving the way towards modeling structure-property relationships and the property-guided generation of rings across a wide range of applications in chemistry and biology.
摘要：环状分子在化学和生物学的应用中无处不在。它们有限的构象灵活性提供了结构预组织，这对于它们在药物发现和催化中的功能至关重要。然而，对环系统的构象异构体整体进行可靠采样仍然具有挑战性。在这里，我们介绍 PuckerFlow，一种生成机器学习模型，它在 Cremer-Pople 空间上执行流匹配，Cremer-Pople 空间是一个低维内部坐标系，捕获环的相关自由度。我们的方法能够通过设计生成有效的闭环，并在生成多样化和精确的构象异构体方面表现出强大的性能。我们证明 PuckerFlow 在几乎所有定量指标上都优于其他构象异构体生成方法，并说明了 PuckerFlow 在与化学应用相关的环系统中的潜力，特别是在催化和药物发现方面。这项工作能够高效、可靠地生成环状结构的构象异构体，为化学和生物学领域广泛应用的结构-性质关系建模和性质引导的环生成铺平了道路。

Title: Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

Authors: Zhenxuan Lu, Zhihua Xu, Zhijing Yang, Feng Gao, Yongyi Lu, Keze Wang, Tianshui Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12876
Pdf URL: https://arxiv.org/pdf/2601.12876
Copy Paste: [[2601.12876]] Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation(https://arxiv.org/abs/2601.12876)
Keywords: generation
Abstract: Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.
摘要：保留语音的面部表情操纵（SPFEM）是一种创新技术，旨在改变图像和视频中的面部表情，同时保留原始的嘴巴动作。尽管取得了进步，但由于面部表情和嘴型之间复杂的相互作用，SPFEM 仍然难以实现精确的唇形同步。利用音频驱动头部说话生成 (AD-THG) 模型在合成精确嘴唇运动方面的先进功能，我们的研究引入了这些模型与 SPFEM 的新颖集成。我们提出了一个新的框架，Talking Head Facial Expression Manipulation (THFEM)，它利用 AD-THG 模型根据音频输入和 SPFEM 改变的图像生成具有精确同步嘴唇运动的帧。然而，增加 AD-THG 模型生成的帧数往往会损害图像的真实感和表达保真度。为了解决这个问题，我们开发了一种相邻帧学习策略，该策略可以微调 AD-THG 模型以预测连续帧的序列。这种策略使模型能够合并来自相邻帧的信息，从而显着提高测试期间的图像质量。我们广泛的实验评估表明，该框架在表达操作过程中有效地保留了嘴部形状，突出了 AD-THG 与 SPFEM 集成的实质性好处。

Title: TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents

Authors: Chan Naseeb, Adeel Ashraf Cheema, Hassan Sami, Tayyab Afzal, Muhammad Omair, Usman Habib
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12895
Pdf URL: https://arxiv.org/pdf/2601.12895
Copy Paste: [[2601.12895]] TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents(https://arxiv.org/abs/2601.12895)
Keywords: generative
Abstract: The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.
摘要：复杂的生成人工智能模型的激增大大加剧了身份证件中合成操纵的威胁，特别是通过面部交换和文本修复攻击。本文提出了 TwoHead-SwinFPN，这是一种统一的深度学习架构，可同时执行 ID 文档中操作区域的二元分类和精确定位。我们的方法将 Swin Transformer 主干与特征金字塔网络 (FPN) 和 UNet 式解码器集成在一起，并通过卷积块注意力模块 (CBAM) 进行增强，以改进特征表示。该模型采用双头架构，利用不确定性加权多任务学习来联合优化检测和分割任务。 FantasyIDiap 数据集上的大量实验证明了其卓越的性能，分类准确率为 84.31\%，AUC 为 90.78\%，本地化平均 Dice 得分为 57.24\%。所提出的方法在二元分类方面实现了 88.61% 的 F1 分数，同时通过 FastAPI 实现保持适合实际部署的计算效率。我们的综合评估包括消融研究、跨设备泛化分析以及跨 10 种语言和 3 种采集设备的详细性能评估。

Title: Dual-Stream Collaborative Transformer for Image Captioning

Authors: Jun Wan, Jun Liu, Zhihui lai, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12926
Pdf URL: https://arxiv.org/pdf/2601.12926
Copy Paste: [[2601.12926]] Dual-Stream Collaborative Transformer for Image Captioning(https://arxiv.org/abs/2601.12926)
Keywords: generation
Abstract: Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.
摘要：当前基于区域特征的图像描述方法进展迅速并取得了显着的性能。然而，由于缺乏上下文信息以及过度依赖生成的部分描述来预测剩余单词，它们仍然容易生成不相关的描述。在本文中，我们提出了一种双流协作变压器（DSCT），通过引入分段功能来解决这个问题。所提出的 DSCT 合并然后融合区域和分割特征来指导字幕句子的生成。它包含多个特定于模式的相互注意编码器（PSMAE）和动态提名解码器（DND）。 PSMAE 通过相互查询有效地突出和合并两个表示的私有信息。 DND 动态搜索与输入文本表示最相关的学习块，并利用合并区域和分割特征之间的同质特征来生成更准确和更具描述性的字幕句子。据我们所知，这是第一项探索如何以动态方式融合不同模式特定特征以绕过图像字幕的语义不一致和空间错位问题的研究。流行基准数据集的实验结果表明，我们的 DSCT 优于文献中最先进的图像字幕模型。

Title: StyMam: A Mamba-Based Generator for Artistic Style Transfer

Authors: Zhou Hong, Rongsheng Hu, Yicheng Di, Xiaolong Xu, Ning Dong, Yihua Shao, Run Ling, Yun Wang, Juqin Wang, Zhanjie Zhang, Ao Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12954
Pdf URL: https://arxiv.org/pdf/2601.12954
Copy Paste: [[2601.12954]] StyMam: A Mamba-Based Generator for Artistic Style Transfer(https://arxiv.org/abs/2601.12954)
Keywords: generative
Abstract: Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.
摘要：图像风格迁移旨在将特定艺术风格的视觉模式整合到内容图像中，同时保留其内容结构。现有方法主要依赖生成对抗网络（GAN）或稳定扩散（SD）。使用 CNN 或 Transformer 的基于 GAN 的方法很难共同捕获局部和全局依赖关系，从而导致伪影和不和谐的模式。基于 SD 的方法减少了此类问题，但通常无法保留内容结构并且推理速度缓慢。为了解决这些问题，我们重新审视 GAN 并提出一种基于曼巴的生成器，称为 StyMam，以生成高质量的风格化图像，而不会引入伪影和不和谐的模式。具体来说，我们引入了一种基于曼巴的生成器，具有残差双路径条扫描机制和通道重加权空间注意模块。前者有效地捕获局部纹理特征，而后者则模拟全局依赖性。最后，广泛的定性和定量实验表明，所提出的方法在质量和速度上都优于最先进的算法。

Title: Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers

Authors: Sulaiman Khan, Md. Rafiul Biswas, Zubair Shah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12981
Pdf URL: https://arxiv.org/pdf/2601.12981
Copy Paste: [[2601.12981]] Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers(https://arxiv.org/abs/2601.12981)
Keywords: generative
Abstract: This study introduces a novel approach for early Type 2 Diabetes Mellitus (T2DM) risk prediction using a tabular transformer (TabTrans) architecture to analyze longitudinal patient data. By processing patients` longitudinal health records and bone-related tabular data, our model captures complex, long-range dependencies in disease progression that conventional methods often overlook. We validated our TabTrans model on a retrospective Qatar BioBank (QBB) cohort of 1,382 subjects, comprising 725 men (146 diabetic, 579 healthy) and 657 women (133 diabetic, 524 healthy). The study integrated electronic health records (EHR) with dual-energy X-ray absorptiometry (DXA) data. To address class imbalance, we employed SMOTE and SMOTE-ENN resampling techniques. The proposed model`s performance is evaluated against conventional machine learning (ML) and generative AI models, including Claude 3.5 Sonnet (Anthropic`s constitutional AI), GPT-4 (OpenAI`s generative pre-trained transformer), and Gemini Pro (Google`s multimodal language model). Our TabTrans model demonstrated superior predictive performance, achieving ROC AUC $\geq$ 79.7 % for T2DM prediction compared to both generative AI models and conventional ML approaches. Feature interpretation analysis identified key risk indicators, with visceral adipose tissue (VAT) mass and volume, ward bone mineral density (BMD) and bone mineral content (BMC), T and Z-scores, and L1-L4 scores emerging as the most important predictors associated with diabetes development in Qatari adults. These findings demonstrate the significant potential of TabTrans for analyzing complex tabular healthcare data, providing a powerful tool for proactive T2DM management and personalized clinical interventions in the Qatari population. Index Terms: tabular transformers, multimodal data, DXA data, diabetes, T2DM, feature interpretation, tabular data
摘要：本研究介绍了一种使用表格转换器 (TabTrans) 架构来分析纵向患者数据的早期 2 型糖尿病 (T2DM) 风险预测的新方法。通过处理患者的纵向健康记录和骨骼相关的表格数据，我们的模型捕获了传统方法经常忽视的疾病进展中复杂的、长期的依赖性。我们在卡塔尔生物银行 (QBB) 1,382 名受试者的回顾性队列中验证了我们的 TabTrans 模型，其中包括 725 名男性（146 名糖尿病患者，579 名健康患者）和 657 名女性（133 名糖尿病患者，524 名健康患者）。该研究将电子健康记录 (EHR) 与双能 X 射线吸收测定法 (DXA) 数据相结合。为了解决类别不平衡问题，我们采用了 SMOTE 和 SMOTE-ENN 重采样技术。所提出的模型的性能是根据传统机器学习（ML）和生成式人工智能模型进行评估的，包括 Claude 3.5 Sonnet（Anthropic 的宪法人工智能）、GPT-4（OpenAI 的生成式预训练变压器）和 Gemini Pro（谷歌的多模态语言模型）。我们的 TabTrans 模型展示了卓越的预测性能，与生成式 AI 模型和传统的 ML 方法相比，T2DM 预测的 ROC AUC $\geq$ 为 79.7%。特征解释分析确定了关键风险指标，其中内脏脂肪组织 (VAT) 质量和体积、病房骨矿物质密度 (BMD) 和骨矿物质含量 (BMC)、T 和 Z 评分以及 L1-L4 评分成为卡塔尔成人糖尿病发展相关的最重要的预测因素。这些发现证明了 TabTrans 在分析复杂的表格医疗数据方面的巨大潜力，为卡塔尔人群的主动 T2DM 管理和个性化临床干预提供了强大的工具。索引术语：表格转换器、多模态数据、DXA 数据、糖尿病、T2DM、特征解释、表格数据

Title: Prototype Learning-Based Few-Shot Segmentation for Low-Light Crack on Concrete Structures

Authors: Yulun Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13059
Pdf URL: https://arxiv.org/pdf/2601.13059
Copy Paste: [[2601.13059]] Prototype Learning-Based Few-Shot Segmentation for Low-Light Crack on Concrete Structures(https://arxiv.org/abs/2601.13059)
Keywords: generation
Abstract: Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: this https URL.
摘要：裂缝检测对于混凝土基础设施的安全至关重要，但现实世界中的裂缝经常出现在隧道和桥梁底部等低光环境中，从而降低了计算机视觉分割的准确性。低光裂纹图像的像素级注释非常耗时，但大多数深度学习方法都需要大型、照明良好的数据集。我们提出了一种将 Retinex 理论与少样本学习相结合的双分支原型学习网络，用于低光裂纹分割。基于 Retinex 的反射组件指导光照不变的全局表示学习，而度量学习则减少了对大型注释数据集的依赖。我们引入了一个交叉相似性先验掩模生成模块，该模块计算查询和支持特征之间的高维相似性以捕获裂纹位置和结构，以及一个多尺度特征增强模块，该模块将多尺度特征与先验掩模融合以减轻空间不一致。对多个基准的广泛实验证明了在低光条件下始终如一的最先进的性能。代码：此 https URL。

Title: Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement

Authors: Aaron R. Flouro, Shawn P. Chadwick
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13100
Pdf URL: https://arxiv.org/pdf/2601.13100
Copy Paste: [[2601.13100]] Recursive Meta-Distillation: An Axiomatic Framework for Iterative Knowledge Refinement(https://arxiv.org/abs/2601.13100)
Keywords: generation
Abstract: Recent work in probability-domain knowledge distillation has established axiomatic frameworks for temperature scaling, multi-teacher aggregation, and bias-variance trade-offs in single-stage settings. However, the mathematical behavior of recursive or multi-generation distillation remains poorly understood, with prior approaches relying primarily on empirical heuristics. In this work, we introduce an axiomatic and operator-theoretic framework for recursive meta-distillation, formalizing iterative knowledge distillation as a sequence of probability-distribution operators with explicit anchoring to base teachers. We define structural axioms for valid meta-teacher construction and prove the existence of non-trivial operator families satisfying these axioms without specifying particular algorithms or loss functions. Under mild realizability and convexity assumptions, we show that anchored recursive distillation induces contraction in KL divergence, yielding geometric convergence to base teacher distributions and a unique, globally attractive fixed point. The contribution is foundational rather than algorithmic: the framework characterizes when recursive distillation is mathematically well-posed and convergent rather than error-accumulating, independent of model architecture, optimization details, or specific operator instantiations. These results provide a theoretical basis for understanding stability, bias-variance behavior, and failure modes in iterative and multi-teacher distillation under capacity constraints.
摘要：最近在概率域知识蒸馏方面的工作已经为单阶段设置中的温度缩放、多教师聚合和偏差方差权衡建立了公理框架。然而，递归或多代蒸馏的数学行为仍然知之甚少，先前的方法主要依赖于经验启发法。在这项工作中，我们引入了用于递归元蒸馏的公理化和算子理论框架，将迭代知识蒸馏形式化为一系列概率分布算子，并明确锚定到基础教师。我们为有效的元教师构建定义了结构公理，并证明了满足这些公理的非平凡算子族的存在，而无需指定特定的算法或损失函数。在温和的可实现性和凸性假设下，我们表明锚定递归蒸馏会引起 KL 散度的收缩，从而产生基本教师分布的几何收敛和独特的、全局有吸引力的固定点。该贡献是基础性的而不是算法性的：该框架的特点是递归蒸馏在数学上是适定的和收敛的，而不是错误累积的，独立于模型架构、优化细节或特定的算子实例化。这些结果为理解容量约束下迭代和多教师蒸馏的稳定性、偏差-方差行为和故障模式提供了理论基础。

Title: PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain

Authors: Sung Ju Lee, Nam Ik Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13128
Pdf URL: https://arxiv.org/pdf/2601.13128
Copy Paste: [[2601.13128]] PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain(https://arxiv.org/abs/2601.13128)
Keywords: generation
Abstract: The proliferation of hyper-realistic images from Latent Diffusion Models (LDMs) demands robust watermarking, yet existing post-hoc methods are prohibitively slow due to iterative optimization or inversion processes. We introduce PhaseMark, a single-shot, optimization-free framework that directly modulates the phase in the VAE latent frequency domain. This approach makes PhaseMark thousands of times faster than optimization-based techniques while achieving state-of-the-art resilience against severe attacks, including regeneration, without degrading image quality. We analyze four modulation variants, revealing a clear performance-quality trade-off. PhaseMark demonstrates a new paradigm where efficient, resilient watermarking is achieved by exploiting intrinsic latent properties.
摘要：潜在扩散模型 (LDM) 中超现实图像的激增需要强大的水印，但由于迭代优化或反演过程，现有的事后方法速度过慢。我们引入了 PhaseMark，这是一种单次、免优化的框架，可直接调制 VAE 潜在频域中的相位。这种方法使 PhaseMark 比基于优化的技术快数千倍，同时实现针对严重攻击（包括再生）的最先进的恢复能力，而不会降低图像质量。我们分析了四种调制变体，揭示了明显的性能与质量权衡。 PhaseMark 展示了一种新的范例，其中通过利用内在的潜在属性来实现高效、有弹性的水印。

Title: FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference

Authors: Chaeyoung Jung, Youngjoon Jang, Seungwoo Lee, Joon Son Chung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13143
Pdf URL: https://arxiv.org/pdf/2601.13143
Copy Paste: [[2601.13143]] FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference(https://arxiv.org/abs/2601.13143)
Keywords: generation
Abstract: In this work, we present FastAV, the first token pruning framework tailored for audio-visual large language models (AV-LLMs). While token pruning has been actively explored in standard large language models (LLMs) and vision-language models (LVLMs), its application to AV-LLMs has received little attention, even though multimodal integration substantially increases their token demands. To address this gap, we introduce a pruning strategy that utilizes attention weights to identify tokens emphasized at different stages and estimates their importance. Building on this analysis, FastAV applies a two-stage pruning strategy: (1) global pruning in intermediate layers to remove broadly less influential tokens, and (2) fine pruning in later layers considering the impact on next token generation. Notably, our method does not rely on full attention maps, which makes it fully compatible with efficient attention mechanisms such as FlashAttention. Extensive experiments demonstrate that FastAV reduces FLOPs by more than 40% on two representative AV-LLMs, while preserving or even improving model performance.
摘要：在这项工作中，我们提出了 FastAV，这是第一个专为视听大语言模型（AV-LLM）量身定制的标记修剪框架。虽然令牌修剪已在标准大语言模型 (LLM) 和视觉语言模型 (LVLM) 中得到积极探索，但其在 AV-LLM 中的应用却很少受到关注，尽管多模式集成大大增加了其令牌需求。为了解决这一差距，我们引入了一种剪枝策略，该策略利用注意力权重来识别不同阶段强调的标记并估计它们的重要性。在此分析的基础上，FastAV 采用了两阶段修剪策略：(1) 在中间层进行全局修剪，以删除影响力较小的令牌，(2) 在后面的层中进行精细修剪，考虑对下一代令牌生成的影响。值得注意的是，我们的方法不依赖于完整的注意力图，这使得它与 FlashAttention 等高效的注意力机制完全兼容。大量实验表明，FastAV 在两个代表性 AV-LLM 上将 FLOP 减少了 40% 以上，同时保持甚至提高了模型性能。

Title: LAViG-FLOW: Latent Autoregressive Video Generation for Fluid Flow Simulations

Authors: Vittoria De Pellegrini, Tariq Alkhalifah
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2601.13190
Pdf URL: https://arxiv.org/pdf/2601.13190
Copy Paste: [[2601.13190]] LAViG-FLOW: Latent Autoregressive Video Generation for Fluid Flow Simulations(https://arxiv.org/abs/2601.13190)
Keywords: generation
Abstract: Modeling and forecasting subsurface multiphase fluid flow fields underpin applications ranging from geological CO2 sequestration (GCS) operations to geothermal production. This is essential for ensuring both operational performance and long-term safety. While high fidelity multiphase simulators are widely used for this purpose, they become prohibitively expensive once many forward runs are required for inversion purposes and quantify uncertainty. To tackle this challenge we propose LAViG-FLOW, a latent autoregressive video generation diffusion framework that explicitly learns the coupled evolution of saturation and pressure fields. Each state variable is compressed by a dedicated 2D autoencoder, and a Video Diffusion Transformer (VDiT) models their coupled distribution across time. We first train the model on a given time horizon to learn their coupled relationship and then fine-tune it autoregressively so it can extrapolate beyond the observed time window. Evaluated on an open-source CO2 sequestration dataset, LAViG-FLOW generates saturation and pressure fields that stay consistent across time while running orders of magnitude faster than traditional numerical solvers.
摘要：地下多相流体流场的建模和预测支撑着从地质二氧化碳封存 (GCS) 作业到地热生产等各种应用。这对于确保运营绩效和长期安全至关重要。虽然高保真多相模拟器广泛用于此目的，但一旦需要多次正向运行来实现反演目的并量化不确定性，它们就会变得非常昂贵。为了应对这一挑战，我们提出了 LAViG-FLOW，这是一种潜在的自回归视频生成扩散框架，可以明确学习饱和度和压力场的耦合演化。每个状态变量都由专用的 2D 自动编码器压缩，视频扩散变换器 (VDiT) 对其随时间的耦合分布进行建模。我们首先在给定的时间范围内训练模型以了解它们的耦合关系，然后对其自回归进行微调，以便它可以推断出观察到的时间窗口之外。 LAViG-FLOW 在开源二氧化碳封存数据集上进行评估，生成的饱和度和压力场在一段时间内保持一致，同时运行速度比传统数值求解器快几个数量级。

Title: A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms

Authors: Yapeng Li, Jiakuo Yu, Zhixin Liu, Xinnan Liu, Jing Yu, Songze Li, Tonghua Su
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13243
Pdf URL: https://arxiv.org/pdf/2601.13243
Copy Paste: [[2601.13243]] A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms(https://arxiv.org/abs/2601.13243)
Keywords: generation
Abstract: Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms - such as Chain-of-Thought (CoT) and multi-agent systems (MAS) - play a critical role, yet their relative effectiveness and cost-accuracy trade-offs remain poorly understood. In this work, we conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS workflows, characterizing their reasoning performance across a diverse suite of closed-form benchmarks. Beyond overall performance, we probe role-specific capability demands in MAS using targeted role isolation analyses, and analyze cost-accuracy trade-offs to identify which MAS workflows offer a favorable balance between cost and accuracy, and which incur prohibitive overhead for marginal gains. We further introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities - semantic abstraction and contrastive discrimination - thereby providing an alternative evaluation axis beyond closed-form accuracy and enabling fine-grained assessment of semantic competence that is difficult to capture with existing benchmarks. Our results show that increased structural complexity does not consistently lead to improved reasoning performance, with its benefits being highly dependent on the properties and suitability of the reasoning paradigm itself. The codes are released at this https URL.
摘要：大型语言模型（LLM）越来越多地被部署为推理系统，其中推理范式（例如思想链（CoT）和多智能体系统（MAS））发挥着关键作用，但它们的相对有效性和成本准确性权衡仍然知之甚少。在这项工作中，我们对推理范式进行了全面、统一的评估，涵盖直接单模型生成、CoT 增强单模型推理和代表性 MAS 工作流程，表征了它们在各种封闭式基准套件中的推理性能。除了整体性能之外，我们还使用有针对性的角色隔离分析来探究 MAS 中特定于角色的能力需求，并分析成本与准确性之间的权衡，以确定哪些 MAS 工作流程在成本和准确性之间提供了良好的平衡，哪些会因边际收益而产生过高的开销。我们进一步介绍了 MIMeBench，这是一种新的开放式基准，它针对两种基本但尚未充分开发的语义能力（语义抽象和对比辨别），从而提供了超越封闭形式准确性的替代评估轴，并实现了现有基准难以捕获的语义能力的细粒度评估。我们的结果表明，结构复杂性的增加并不总是会导致推理性能的提高，其好处高度依赖于推理范式本身的属性和适用性。代码在此 https URL 发布。

Title: Enginuity: Building an Open Multi-Domain Dataset of Complex Engineering Diagrams

Authors: Ethan Seefried, Prahitha Movva, Naga Harshita Marupaka, Tilak Kasturi, Tirthankar Ghosal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13299
Pdf URL: https://arxiv.org/pdf/2601.13299
Copy Paste: [[2601.13299]] Enginuity: Building an Open Multi-Domain Dataset of Complex Engineering Diagrams(https://arxiv.org/abs/2601.13299)
Keywords: generation
Abstract: We propose Enginuity - the first open, large-scale, multi-domain engineering diagram dataset with comprehensive structural annotations designed for automated diagram parsing. By capturing hierarchical component relationships, connections, and semantic elements across diverse engineering domains, our proposed dataset would enable multimodal large language models to address critical downstream tasks including structured diagram parsing, cross-modal information retrieval, and AI-assisted engineering simulation. Enginuity would be transformative for AI for Scientific Discovery by enabling artificial intelligence systems to comprehend and manipulate the visual-structural knowledge embedded in engineering diagrams, breaking down a fundamental barrier that currently prevents AI from fully participating in scientific workflows where diagram interpretation, technical drawing analysis, and visual reasoning are essential for hypothesis generation, experimental design, and discovery.
摘要：我们提出 Enginuity - 第一个开放的、大规模的、多领域的工程图表数据集，具有专为自动图表解析而设计的全面结构注释。通过捕获跨不同工程领域的分层组件关系、连接和语义元素，我们提出的数据集将使多模态大语言模型能够解决关键的下游任务，包括结构化图解析、跨模态信息检索和人工智能辅助工程模拟。 Enginuity 将使人工智能系统能够理解和操纵工程图表中嵌入的视觉结构知识，从而打破目前阻止人工智能充分参与科学工作流程的基本障碍，在科学工作流程中，图表解释、技术绘图分析和视觉推理对于假设生成、实验设计和发现至关重要。

Title: Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

Authors: Junyi Zhang, Yiming Wang, Yunhong Lu, Qichao Wang, Wenzhe Qian, Xiaoyin Xu, David Gu, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13371
Pdf URL: https://arxiv.org/pdf/2601.13371
Copy Paste: [[2601.13371]] Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations(https://arxiv.org/abs/2601.13371)
Keywords: generation, generative
Abstract: A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.
摘要：文本到 3D 人脸生成的一个基本挑战是实现高质量的几何形状。核心困难在于 3D 空间中顶点的任意且复杂的分布，这使得现有模型难以建立清晰的连接并导致几何形状不理想。为了解决这个问题，我们的核心见解是通过将分布限制在简单且规则的流形（拓扑球体）上来简化基础几何结构。在此基础上，我们首先提出了球面几何表示，这是一种新颖的面部表示，将几何信号锚定到统一的球坐标。这保证了规则的点分布，从中可以稳健地重建网格连接。至关重要的是，这个规范球体可以无缝地展开为 2D 地图，与强大的 2D 生成模型形成完美的协同作用。然后，我们介绍球形几何扩散，这是一个基于此 2D 贴图构建的条件扩散框架。它通过联合建模几何体和纹理来实现多样化且可控的生成，其中几何体明确地调节纹理合成过程。我们的方法的有效性通过其在各种任务中的成功得到了证明：文本到 3D 生成、面部重建和基于文本的 3D 编辑。大量的实验表明，我们的方法在几何质量、文本保真度和推理效率方面远远优于现有方法。

Title: Leveraging Transformer Decoder for Automotive Radar Object Detection

Authors: Changxu Zhang, Zhaoze Wang, Tai Fei, Christopher Grimm, Yi Jin, Claas Tebruegge, Ernst Warsitz, Markus Gardill
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2601.13386
Pdf URL: https://arxiv.org/pdf/2601.13386
Copy Paste: [[2601.13386]] Leveraging Transformer Decoder for Automotive Radar Object Detection(https://arxiv.org/abs/2601.13386)
Keywords: generation
Abstract: In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
摘要：在本文中，我们提出了一种基于 Transformer 的 3D 雷达目标检测架构，该架构使用新颖的 Transformer 解码器作为预测头，直接从雷达特征表示中回归 3D 边界框和类别分数。为了桥接多尺度雷达特征和解码器，我们提出了金字塔令牌融合（PTF），这是一个轻量级模块，可将特征金字塔转换为统一的、尺度感知的令牌序列。通过将检测公式化为具有可学习对象查询和位置编码的集合预测问题，我们的设计对远程时空相关性和跨特征交互进行建模。这种方法消除了密集的提案生成和启发式后处理，例如广泛的非极大值抑制（NMS）调整。我们在 RADDet 上评估了拟议的框架，它比最先进的仅雷达基线实现了显着改进。

Title: Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

Authors: Peter A. Massih, Eric Cosatto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13401
Pdf URL: https://arxiv.org/pdf/2601.13401
Copy Paste: [[2601.13401]] Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics(https://arxiv.org/abs/2601.13401)
Keywords: generation
Abstract: Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.
摘要：当前的视觉语言模型（VLM）无法进行定量空间推理，因为它们的架构破坏了计数和测量所需的像素级信息。视觉编码器通过补丁嵌入来压缩图像，减少空间索引并失去精确计数所需的精确像素级跟踪。我们提出了两项贡献来解决这一基本限制。首先，我们介绍 SQuID（卫星定量情报数据集），它是 2000 个卫星图像问答对的基准，包含数值范围和分类答案，旨在评估定量空间推理。该数据集跨越三个难度层，具有根据人类标签及其学习到的可变性自动生成的注释。其次，我们提出了 QVLM（定量视觉语言模型），这是一种代码生成架构，通过将语言理解与视觉分析解耦来保持像素精度。 QVLM 不是将图像编码为嵌入，而是生成可执行代码，该代码首先调用分割模型来获取像素级掩模，然后直接对这些掩模进行操作，在整个推理过程中保留空间索引。我们的实验表明，使用 GPT-5 作为编码器的 QVLM 在 SQuID 上的准确率达到 42.0%，而使用图像-问题对提示的 VLM 的准确率仅为 28.1%。我们的工作表明，对于定量空间推理，架构解耦可以提高定量任务的准确性。

Title: Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study

Authors: A. Nieto Juscafresa, Á. Mazcuñán Herreros, J. Sullivan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13416
Pdf URL: https://arxiv.org/pdf/2601.13416
Copy Paste: [[2601.13416]] Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study(https://arxiv.org/abs/2601.13416)
Keywords: generation, generative
Abstract: Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
摘要：扩散模型已成为最先进的图像合成生成方法，但其作为通用特征编码器的潜力仍未得到充分开发。经过去噪和无标签生成的训练，它们可以被解释为能够捕获低级和高级结构的自我监督学习者。我们表明，冻结扩散主干通过跨层和时间步探测中间去噪特征并为每对训练线性分类器，可以实现强大的细粒度识别。我们在具有实际影响的现实世界浮游生物监测环境中对此进行了评估，根据已建立的监督和自我监督基线使用受控和可比较的训练设置。冻结扩散特征与监督基线具有竞争力，并且在平衡和自然长尾设置中优于其他自监督方法。对时间和地理上变化的浮游生物数据集的分布外评估进一步表明，冻结扩散特征在大量分布变化下保持了很高的准确性和宏观 F1。

Title: BladeSDF : Unconditional and Conditional Generative Modeling of Representative Blade Geometries Using Signed Distance Functions

Authors: Ashish S. Nair, Sandipp Krishnan Ravi, Itzel Salgado, Changjie Sun, Sayan Ghosh, Liping Wang
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2601.13445
Pdf URL: https://arxiv.org/pdf/2601.13445
Copy Paste: [[2601.13445]] BladeSDF : Unconditional and Conditional Generative Modeling of Representative Blade Geometries Using Signed Distance Functions(https://arxiv.org/abs/2601.13445)
Keywords: generation, generative
Abstract: Generative AI has emerged as a transformative paradigm in engineering design, enabling automated synthesis and reconstruction of complex 3D geometries while preserving feasibility and performance relevance. This paper introduces a domain-specific implicit generative framework for turbine blade geometry using DeepSDF, addressing critical gaps in performance-aware modeling and manufacturable design generation. The proposed method leverages a continuous signed distance function (SDF) representation to reconstruct and generate smooth, watertight geometries with quantified accuracy. It establishes an interpretable, near-Gaussian latent space that aligns with blade-relevant parameters, such as taper and chord ratios, enabling controlled exploration and unconditional synthesis through interpolation and Gaussian sampling. In addition, a compact neural network maps engineering descriptors, such as maximum directional strains, to latent codes, facilitating the generation of performance-informed geometry. The framework achieves high reconstruction fidelity, with surface distance errors concentrated within $1\%$ of the maximum blade dimension, and demonstrates robust generalization to unseen designs. By integrating constraints, objectives, and performance metrics, this approach advances beyond traditional 2D-guided or unconstrained 3D pipelines, offering a practical and interpretable solution for data-driven turbine blade modeling and concept generation.
摘要：生成式 AI 已成为工程设计领域的变革范例，能够自动合成和重建复杂的 3D 几何形状，同时保留可行性和性能相关性。本文介绍了使用 DeepSDF 的涡轮叶片几何特定领域隐式生成框架，解决了性能感知建模和可制造设计生成中的关键差距。所提出的方法利用连续符号距离函数（SDF）表示来重建和生成具有量化精度的平滑、无懈可击的几何形状。它建立了一个可解释的、接近高斯的潜在空间，与叶片相关参数（例如锥度和弦比）保持一致，从而通过插值和高斯采样实现受控探索和无条件合成。此外，紧凑的神经网络将工程描述符（例如最大方向应变）映射到潜在代码，从而促进生成基于性能的几何形状。该框架实现了高重建保真度，表面距离误差集中在最大叶片尺寸的 1\%$ 以内，并展示了对未见过的设计的强大泛化能力。通过集成约束、目标和性能指标，该方法超越了传统的 2D 引导或无约束 3D 管道，为数据驱动的涡轮叶片建模和概念生成提供了实用且可解释的解决方案。

Title: MN-TSG:Continuous Time Series Generation with Irregular Observations

Authors: Xu Zhang, Junwei Deng, Chang Xu, Hao Li, Jiang Bian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13534
Pdf URL: https://arxiv.org/pdf/2601.13534
Copy Paste: [[2601.13534]] MN-TSG:Continuous Time Series Generation with Irregular Observations(https://arxiv.org/abs/2601.13534)
Keywords: generation
Abstract: Time series generation (TSG) plays a critical role in a wide range of domains, such as healthcare. However, most existing methods assume regularly sampled observations and fixed output resolutions, which are often misaligned with real-world scenarios where data are irregularly sampled and sparsely observed. This mismatch is particularly problematic in applications such as clinical monitoring, where irregular measurements must support downstream tasks requiring continuous and high-resolution time series. Neural Controlled Differential Equations (NCDEs) have shown strong potential for modeling irregular time series, yet they still face challenges in capturing complex dynamic temporal patterns and supporting continuous TSG. To address these limitations, we propose MN-TSG, a novel framework that explores Mixture-of-Experts (MoE)-based NCDEs and integrates them with existing TSG models for irregular and continuous generation tasks. The core of MN-TSG lies in a MoE-NCDE architecture with dynamically parameterized expert functions and a decoupled design that facilitates more effective optimization of MoE dynamics. Furthermore, we leverage existing TSG models to learn the joint distribution over the mixture of experts and the generated time series. This enables the framework not only to generate new samples, but also to produce appropriate expert configurations tailored to each sample, thereby supporting refined continuous TSG. Extensive experiments on ten public and synthetic datasets demonstrate the effectiveness of MN-TSG, consistently outperforming strong TSG baselines on both irregular-to-regular and irregular-to-continuous generation tasks.
摘要：时间序列生成 (TSG) 在医疗保健等广泛领域发挥着关键作用。然而，大多数现有方法假设定期采样观测值和固定输出分辨率，这通常与数据不规则采样和稀疏观测的现实场景不一致。这种不匹配在临床监测等应用中尤其成问题，其中不规则测量必须支持需要连续和高分辨率时间序列的下游任务。神经控制微分方程 (NCDE) 在建模不规则时间序列方面显示出强大的潜力，但它们在捕获复杂的动态时间模式和支持连续 TSG 方面仍然面临挑战。为了解决这些限制，我们提出了 MN-TSG，这是一种新颖的框架，它探索基于专家混合 (MoE) 的 NCDE，并将其与现有的 TSG 模型集成，以执行不规则和连续的生成任务。 MN-TSG 的核心在于 MoE-NCDE 架构，具有动态参数化专家功能和解耦设计，有助于更有效地优化 MoE 动态。此外，我们利用现有的 TSG 模型来学习专家混合和生成的时间序列的联合分布。这使得该框架不仅能够生成新样本，还能生成针对每个样本量身定制的适当专家配置，从而支持精细化的连续 TSG。对 10 个公共和合成数据集的广泛实验证明了 MN-TSG 的有效性，在不规则到规则和不规则到连续生成任务上始终优于强大的 TSG 基线。

Title: DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

Authors: Feng Ding, Wenhui Yi, Xinan He, Mengyao Xiao, Jianfeng Xu, Jianqiang Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13551
Pdf URL: https://arxiv.org/pdf/2601.13551
Copy Paste: [[2601.13551]] DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis(https://arxiv.org/abs/2601.13551)
Keywords: generative
Abstract: Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at this https URL.
摘要：生成模型现在可以生成难以察觉的、细粒度的操纵面孔，带来重大的隐私风险。然而，现有的人工智能生成的人脸数据集通常缺乏对具有细粒度区域操作的样本的关注。此外，还没有研究人员研究过真实样本和操纵样本之间发生的拼接攻击对检测器的真正影响。我们将这些称为探测器规避样本。基于此，我们引入了 DiffFace-Edit 数据集，它具有以下优点：1）它包含超过 200 万张 AI 生成的假图像。 2）它具有跨八个面部区域（例如眼睛、鼻子）的编辑功能，并且包括更丰富的编辑组合，例如单区域和多区域编辑。此外，我们还专门分析了检测器规避样本对检测模型的影响。我们对数据集进行了全面分析，并提出了结合 IMDL 方法的跨领域评估。数据集将在此 https URL 中提供。

Title: Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework

Authors: Yanheng Li, Zhichen Pu, Lijiang Yang, Zehao Zhou, Yi Qin Gao
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2601.13564
Pdf URL: https://arxiv.org/pdf/2601.13564
Copy Paste: [[2601.13564]] Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework(https://arxiv.org/abs/2601.13564)
Keywords: generative
Abstract: Designing fluorescent small molecules with tailored optical and physicochemical properties requires navigating vast, underexplored chemical space while satisfying multiple objectives and constraints. Conventional generate-score-screen approaches become impractical under such realistic design specifications, owing to their low search efficiency, unreliable generalizability of machine-learning prediction, and the prohibitive cost of quantum chemical calculation. Here we present LUMOS, a data-and-physics driven framework for inverse design of fluorescent molecules. LUMOS couples generator and predictor within a shared latent representation, enabling direct specification-to-molecule design and efficient exploration. Moreover, LUMOS combines neural networks with a fast time-dependent density functional theory (TD-DFT) calculation workflow to build a suite of complementary predictors spanning different trade-offs in speed, accuracy, and generalizability, enabling reliable property prediction across diverse scenarios. Finally, LUMOS employs a property-guided diffusion model integrated with multi-objective evolutionary algorithms, enabling de novo design and molecular optimization under multiple objectives and constraints. Across comprehensive benchmarks, LUMOS consistently outperforms baseline models in terms of accuracy, generalizability and physical plausibility for fluorescence property prediction, and demonstrates superior performance in multi-objective scaffold- and fragment-level molecular optimization. Further validation using TD-DFT and molecular dynamics (MD) simulations demonstrates that LUMOS can generate valid fluorophores that meet various target specifications. Overall, these results establish LUMOS as a data-physics dual-driven framework for general fluorophore inverse design.
摘要：设计具有定制光学和物理化学性质的荧光小分子需要探索广阔的、尚未开发的化学空间，同时满足多个目标和约束。在如此现实的设计规范下，传统的生成评分屏幕方法变得不切实际，因为它们的搜索效率低、机器学习预测的通用性不可靠以及量子化学计算的成本高昂。在这里，我们介绍 LUMOS，一种用于荧光分子逆向设计的数据和物理驱动框架。 LUMOS 在共享潜在表示中耦合生成器和预测器，从而实现直接规范到分子设计和高效探索。此外，LUMOS 将神经网络与快速瞬态密度泛函理论 (TD-DFT) 计算工作流程相结合，构建一套涵盖速度、准确性和泛化性不同权衡的互补预测器，从而实现跨不同场景的可靠属性预测。最后，LUMOS 采用与多目标进化算法集成的属性引导扩散模型，实现多目标和约束下的从头设计和分子优化。在综合基准测试中，LUMOS 在荧光特性预测的准确性、通用性和物理合理性方面始终优于基线模型，并在多目标支架和片段级分子优化方面表现出卓越的性能。使用 TD-DFT 和分子动力学 (MD) 模拟进行的进一步验证表明，LUMOS 可以生成满足各种目标规格的有效荧光团。总体而言，这些结果将 LUMOS 确立为通用荧光团逆设计的数据物理双驱动框架。

Title: Diffusion In Diffusion: Breaking the Autoregressive Bottleneck in Block Diffusion Models

Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13599
Pdf URL: https://arxiv.org/pdf/2601.13599
Copy Paste: [[2601.13599]] Diffusion In Diffusion: Breaking the Autoregressive Bottleneck in Block Diffusion Models(https://arxiv.org/abs/2601.13599)
Keywords: generative
Abstract: Block diffusion language models, operating as semi-autoregressive paradigms, combine the strengths of both autoregressive and diffusion paradigms. However, their strict unidirectional block dependencies introduce irreversibility and sacrifice the global planning capabilities for which diffusion models are renowned. In order to address these issues, we propose Diffusion in Diffusion, a draft-then-refine framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilise snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model's global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using just 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.
摘要：作为半自回归范式运行的块扩散语言模型结合了自回归和扩散范式的优点。然而，它们严格的单向块依赖性引入了不可逆性，并牺牲了扩散模型著名的全局规划能力。为了解决这些问题，我们提出了 Diffusion in Diffusion，这是一个先草拟后细化的框架，旨在克服块扩散模型固有的不可逆性和近视问题。我们的方法首先采用块扩散来使用小块生成快速草稿，然后通过具有更大双向感受野的全局双向扩散来细化这些草稿。我们利用快照置信度重新屏蔽来识别需要修改的最关键令牌，并应用混合规模训练来扩展块扩散模型的全局能力。实证结果表明，我们的方法为 OpenWebText 数据集上的离散扩散模型设定了新的基准。仅使用基线模型微调预算的 26%，我们将生成复杂度从 25.7 降低到 21.9，显着缩小了与自回归模型的性能差距。

Title: ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

Authors: Zheng Liu, Honglin Lin, Chonghan Qin, Xiaoyang Wang, Xin Gao, Yu Li, Mengzhang Cai, Yun Zhu, Zhanping Zhong, Qizhi Pei, Zhuoshi Pan, Xiaoran Shang, Bin Cui, Conghui He, Wentao Zhang, Lijun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13606
Pdf URL: https://arxiv.org/pdf/2601.13606
Copy Paste: [[2601.13606]] ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch(https://arxiv.org/abs/2601.13606)
Keywords: generation
Abstract: Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking.
摘要：图表推理是视觉语言模型 (VLM) 的一项关键功能。然而，高质量训练数据的缺乏严重阻碍了开源模型的发展。现有数据集面临双重挑战：合成图表通常过于简单且重复，而相关的 QA 对很容易产生幻觉，并且缺乏复杂任务所需的推理深度。为了弥补这一差距，我们提出了 ChartVerse，这是一个可扩展的框架，旨在从头开始合成复杂的图表和可靠的推理数据。 (1) 为了解决简单模式的瓶颈，我们首先引入 Rollout 后验熵 (RPE)，这是一种量化图表复杂性的新颖指标。在 RPE 的指导下，我们开发了复杂性感知图表编码器，通过可执行程序自动合成多样化、高复杂性的图表。 (2) 为了保证推理的严谨性，我们开发了真相锚定的逆向问答合成。与标准生成不同，我们采用答案优先的范式：我们直接从源代码中提取确定性答案，根据这些锚生成条件问题，并强制执行严格的一致性验证。为了进一步提升难度和推理深度，我们根据模型失败率过滤样本，并提炼出高质量的思想链（CoT）推理。我们以 Qwen3-VL-30B-A3B-Thinking 为老师来策划 ChartVerse-SFT-600K 和 ChartVerse-RL-40K。实验结果表明，ChartVerse-8B 实现了最先进的性能，明显超越了其老师，并可与更强大的 Qwen3-VL-32B-Thinking 相媲美。

Title: Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Authors: Boyuan Cao, Xingbo Yao, Chenhui Wang, Jiaxin Ye, Yujie Wei, Hongming Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13683
Pdf URL: https://arxiv.org/pdf/2601.13683
Copy Paste: [[2601.13683]] Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation(https://arxiv.org/abs/2601.13683)
Keywords: generation, generative
Abstract: Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.
摘要：扩散变压器（DiT）已成为高保真图像生成的强大架构，但自注意力的二次成本构成了主要的可扩展性瓶颈。为了解决这个问题，采用了线性注意力机制来降低计算成本；不幸的是，由此产生的线性扩散变压器（LiT）模型通常以牺牲生成性能为代价，经常产生过度平滑的注意力权重，从而限制了表达能力。在这项工作中，我们引入了动态微分线性注意力（DyDiLA），这是一种新颖的线性注意力公式，它通过缓解过度平滑问题和提高生成质量来增强 LiT 的有效性。具体来说，DyDiLA 的新颖性在于三个关键设计：（i）动态投影模块，通过动态分配的知识进行学习，有助于解耦 token 表示； (ii) 动态测量内核，它提供了更好的相似性测量，通过动态分配用于标记处理的内核函数来捕获标记之间的细粒度语义区别； (iii)令牌差分算子，它通过计算令牌之间的差异以及动态测量内核产生的相应信息冗余来实现更鲁棒的查询到键检索。为了利用 DyDiLA，我们引入了一种改进的 LiT，称为 DyDi-LiT，它系统地整合了我们的进步。大量实验表明，DyDi-LiT 在多个指标上始终优于当前最先进的 (SOTA) 模型，凸显了其强大的实际潜力。

Title: Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

Authors: Yujin Jo, Sangyoon Bae, Taesup Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13707
Pdf URL: https://arxiv.org/pdf/2601.13707
Copy Paste: [[2601.13707]] Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs(https://arxiv.org/abs/2601.13707)
Keywords: generation
Abstract: Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.
摘要：当语言先验支配视觉证据时，大型视觉语言模型（LVLM）中经常会出现幻觉，导致对象错误识别和视觉上不一致的描述。我们通过将幻觉缓解作为对比指导来解决这个问题，引导生成视觉基础和语义忠实的文本。这种方法通过减少对语言先验的过度依赖并将基于视觉的表示与仅语言表示进行对比来调节模型的内部行为。我们提出了注意力空间对比指导（ACG），这是一种单通道机制，在自注意力层中运行，以在单个前向计算中构建视觉语言和仅语言注意力路径。这种集成可以将计算效率高的指导直接嵌入到模型的表示情境化中。为了纠正单通道公式引入的近似偏差，我们进一步应用正交校正，删除与仅语言路径对齐的组件，选择性地放大视觉贡献。 CHAIR 和 POPE 基准测试表明，ACG 实现了最先进的忠实度和字幕质量，同时显着降低了计算成本。我们的方法建立了一种有原则且高效的替代方案，与之前需要多次前向传递的对比解码方法相比，延迟最多减少了 2 倍。

Title: Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction

Authors: Sayeed Shafayet Chowdhury, Snehasis Mukhopadhyay, Shiaofen Fang, Vijay R. Ramakrishnan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.13710
Pdf URL: https://arxiv.org/pdf/2601.13710
Copy Paste: [[2601.13710]] Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction(https://arxiv.org/abs/2601.13710)
Keywords: generative
Abstract: Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.
摘要：人工智能重塑了医学影像，但人工智能在临床数据上用于前瞻性决策支持的应用仍然有限。我们研究了对慢性鼻窦炎 (CRS) 有临床意义的改善的术前预测，将成功定义为 6 个月时 SNOT-22 (MCID) 降低超过 8.9 分。在一个前瞻性收集的队列中，所有患者都接受了手术，我们询问仅使用术前临床数据的模型是否可以识别那些结果不佳的患者，即那些应该避免手术的患者。我们将监督式 ML（逻辑回归、树集成和内部 MLP）与生成式 AI（ChatGPT、Claude、Gemini、Perplexity）进行基准测试，为每个机器学习提供相同的结构化输入，并将输出约束为二进制推荐。我们最好的 ML 模型 (MLP) 凭借卓越的校准和决策曲线净效益实现了 85% 的准确度。 GenAI 模型在零样本设置的辨别和校准方面表现不佳。值得注意的是，GenAI 的理由与临床医生启发法和 MLP 的特征重要性相一致，反复强调基线 SNOT-22、CT/内窥镜检查严重程度、息肉表型和生理/疼痛合并症。我们提供可重复的表格到 GenAI 评估方案和亚组分析。研究结果支持机器学习优先、GenAI 增强的工作流程：部署经过校准的机器学习对手术候选者进行初步分类，并使用 GenAI 作为解释器来提高透明度和共享决策。

Title: Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

Authors: Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2601.13719
Pdf URL: https://arxiv.org/pdf/2601.13719
Copy Paste: [[2601.13719]] Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search(https://arxiv.org/abs/2601.13719)
Keywords: generation
Abstract: Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
摘要：由于上下文窗口极长，长视频理解对视觉语言模型提出了重大挑战。现有的解决方案依赖于带有检索增强生成的朴素分块策略，通常会遭受信息碎片化和全局一致性丧失的困扰。我们提出了 HAVEN，这是一个用于长视频理解的统一框架，它通过将视听实体内聚性和分层视频索引与代理搜索相结合，实现连贯和全面的推理。首先，我们通过集成视觉和听觉流中的实体级表示来保持语义一致性，同时将内容组织成跨越全局摘要、场景、片段和实体级别的结构化层次结构。然后，我们采用代理搜索机制来实现跨这些层的动态检索和推理，从而促进连贯的叙述重建和细粒度的实体跟踪。大量实验表明，我们的方法实现了良好的时间连贯性、实体一致性和检索效率，在 LVBench 上建立了新的最先进技术，总体准确率为 84.1%。值得注意的是，它在挑战性推理类别中表现出色，达到了80.1%。这些结果凸显了结构化、多模态推理对于对长视频进行全面且上下文一致的理解的有效性。

Title: Orthogonium : A Unified, Efficient Library of Orthogonal and 1-Lipschitz Building Blocks

Authors: Thibaut Boissin (IRIT-MISFIT), Franck Mamalet, Valentin Lafargue (ANITI, IMT), Mathieu Serrurier (IRIT-MISFIT)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.13776
Pdf URL: https://arxiv.org/pdf/2601.13776
Copy Paste: [[2601.13776]] Orthogonium : A Unified, Efficient Library of Orthogonal and 1-Lipschitz Building Blocks(https://arxiv.org/abs/2601.13776)
Keywords: generative
Abstract: Orthogonal and 1-Lipschitz neural network layers are essential building blocks in robust deep learning architectures, crucial for certified adversarial robustness, stable generative models, and reliable recurrent networks. Despite significant advancements, existing implementations remain fragmented, limited, and computationally demanding. To address these issues, we introduce Orthogonium , a unified, efficient, and comprehensive PyTorch library providing orthogonal and 1-Lipschitz layers. Orthogonium provides access to standard convolution features-including support for strides, dilation, grouping, and transposed-while maintaining strict mathematical guarantees. Its optimized implementations reduce overhead on large scale benchmarks such as ImageNet. Moreover, rigorous testing within the library has uncovered critical errors in existing implementations, emphasizing the importance of standardized and reliable tools. Orthogonium thus significantly lowers adoption barriers, enabling scalable experimentation and integration across diverse applications requiring orthogonality and robust Lipschitz constraints. Orthogonium is available at this https URL.
摘要：正交和 1-Lipschitz 神经网络层是稳健深度学习架构中的重要构建块，对于经过认证的对抗稳健性、稳定的生成模型和可靠的循环网络至关重要。尽管取得了显着的进步，但现有的实现仍然分散、有限且计算要求高。为了解决这些问题，我们引入了 Ortagonium，这是一个统一、高效且全面的 PyTorch 库，提供正交层和 1-Lipschitz 层。 Ortagonium 提供对标准卷积功能的访问，包括对跨步、膨胀、分组和转置的支持，同时保持严格的数学保证。其优化的实现减少了 ImageNet 等大规模基准测试的开销。此外，库内的严格测试发现了现有实施中的严重错误，强调了标准化和可靠工具的重要性。因此，Ortagonium 显着降低了采用障碍，从而能够在需要正交性和强大 Lipschitz 约束的不同应用程序中进行可扩展的实验和集成。 Ortagonium 可通过此 https URL 获取。

Title: Principled Latent Diffusion for Graphs via Laplacian Autoencoders

Authors: Antoine Siraudin, Christopher Morris
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13780
Pdf URL: https://arxiv.org/pdf/2601.13780
Copy Paste: [[2601.13780]] Principled Latent Diffusion for Graphs via Laplacian Autoencoders(https://arxiv.org/abs/2601.13780)
Keywords: generation
Abstract: Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes -- and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion there. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps each node into a fixed-dimensional embedding from which the full adjacency is provably recoverable, enabling near-lossless reconstruction for both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, eliminating the quadratic bottleneck and making it feasible to train larger and more expressive models. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models, while achieving up to $1000\times$ speed-up.
摘要：图扩散模型在图生成方面实现了最先进的性能，但受到节点数量的二次复杂性的影响，并且其大部分容量都浪费在对稀疏图中缺少边的建模上。受到其他模态中潜在扩散的启发，一个自然的想法是将图压缩到低维潜在空间并在那里进行扩散。然而，与图像或文本不同，图形生成需要几乎无损的重建，因为即使解码邻接矩阵时出现一个错误也可能导致整个样本无效。这一挑战基本上仍未得到解决。我们提出了 LG-Flow，一种直接克服这些障碍的潜在图扩散框架。排列等变自动编码器将每个节点映射到固定维度的嵌入中，从中证明可以恢复完整的邻接关系，从而实现无向图和 DAG 的近乎无损重建。这种潜在表示的维度与节点数量成线性比例，消除了二次瓶颈，使得训练更大、更具表现力的模型成为可能。在这个潜在空间中，我们训练具有流匹配的扩散变压器，从而实现高效且富有表现力的图形生成。我们的方法取得了与最先进的图扩散模型相比具有竞争力的结果，同时实现了高达 1000 美元\times$ 的加速。

Title: PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

Authors: Gabriele Serussi, David Vainshtein, Jonathan Kouchly, Dotan Di Castro, Chaim Baskin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13797
Pdf URL: https://arxiv.org/pdf/2601.13797
Copy Paste: [[2601.13797]] PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval(https://arxiv.org/abs/2601.13797)
Keywords: generation
Abstract: Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.
摘要：组合视频检索（CoVR）旨在基于查询视频和修改文本来检索视频。当前的 CoVR 方法无法充分利用现代视觉语言模型 (VLM)，要么使用过时的架构，要么需要计算成本高昂的微调和缓慢的字幕生成。我们引入了 PREGEN（PRE GENeration extract），这是一个高效且强大的 CoVR 框架，可以克服这些限制。我们的方法独特地将冻结的、预先训练的 VLM 与轻量级编码模型配对，从而无需任何 VLM 微调。我们将查询视频和修改文本输入 VLM，并从每一层提取最终标记的隐藏状态。然后，在这些池表示上训练一个简单的编码器，创建语义丰富且紧凑的嵌入用于检索。 PREGEN 显着提升了现有技术水平，超越了标准 CoVR 基准上的所有现有方法，Recall@1 大幅提升，分别为 +27.23 和 +69.59。我们的方法展示了跨不同 VLM 主干的鲁棒性，并对更复杂的文本修改表现出强大的零样本泛化能力，突出了其有效性和语义功能。

Title: Inverting Self-Organizing Maps: A Unified Activation-Based Framework

Authors: Alessandro Londei, Matteo Benati, Denise Lanzieri, Vittorio Loreto
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.13851
Pdf URL: https://arxiv.org/pdf/2601.13851
Copy Paste: [[2601.13851]] Inverting Self-Organizing Maps: A Unified Activation-Based Framework(https://arxiv.org/abs/2601.13851)
Keywords: generative
Abstract: Self-Organizing Maps provide topology-preserving projections of high-dimensional data and have been widely used for visualization, clustering, and vector quantization. In this work, we show that the activation pattern of a SOM - the squared distances to its prototypes - can be inverted to recover the exact input under mild geometric conditions. This follows from a classical fact in Euclidean distance geometry: a point in $D$ dimensions is uniquely determined by its distances to $D{+}1$ affinely independent references. We derive the corresponding linear system and characterize the conditions under which the inversion is well-posed. Building upon this mechanism, we introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule, which enables controlled, semantically meaningful trajectories in latent space. MUSIC modifies squared distances to selected prototypes while preserving others, resulting in a deterministic geometric flow aligned with the SOM's piecewise-linear structure. Tikhonov regularization stabilizes the update rule and ensures smooth motion on high-dimensional datasets. Unlike variational or probabilistic generative models, MUSIC does not rely on sampling, latent priors, or encoder-decoder architectures. If no perturbation is applied, inversion recovers the exact input; when a target cluster or prototype is specified, MUSIC produces coherent semantic variations while remaining on the data manifold. This leads to a new perspective on data augmentation and controllable latent exploration based solely on prototype geometry. We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset. Across all settings, MUSIC produces smooth, interpretable trajectories that reveal the underlying geometry of the learned manifold, illustrating the advantages of SOM-based inversion over unsupervised clustering.
摘要：自组织映射提供高维数据的拓扑保持投影，并已广泛用于可视化、聚类和矢量量化。在这项工作中，我们展示了 SOM 的激活模式（到其原型的平方距离）可以反转，以在温和的几何条件下恢复精确的输入。这源自欧几里得距离几何中的一个经典事实：$D$ 维度中的点由其到 $D{+}1$ 仿射独立参考的距离唯一确定。我们推导了相应的线性系统并描述了反演适定的条件。在此机制的基础上，我们引入了流形感知统一 SOM 反转和控制 (MUSIC) 更新规则，该规则可以在潜在空间中实现受控的、语义上有意义的轨迹。 MUSIC 修改所选原型的平方距离，同时保留其他原型，从而产生与 SOM 的分段线性结构一致的确定性几何流。吉洪诺夫正则化稳定了更新规则并确保高维数据集上的平滑运动。与变分或概率生成模型不同，MUSIC 不依赖于采样、潜在先验或编码器-解码器架构。如果不施加扰动，反演将恢复精确的输入；当指定目标集群或原型时，MUSIC 会产生连贯的语义变化，同时保留在数据流形上。这带来了仅基于原型几何的数据增强和可控潜在探索的新视角。我们使用合成高斯混合、MNIST 和 Faces in the Wild 数据集验证了该方法。在所有设置中，MUSIC 都会生成平滑、可解释的轨迹，揭示学习流形的基础几何形状，说明基于 SOM 的反演相对于无监督聚类的优势。

Title: Multi-Objective Hierarchical Optimization with Large Language Models

Authors: Andrej Schwanke, Lyubomir Ivanov, David Salinas, Frank Hutter, Arber Zela
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13892
Pdf URL: https://arxiv.org/pdf/2601.13892
Copy Paste: [[2601.13892]] Multi-Objective Hierarchical Optimization with Large Language Models(https://arxiv.org/abs/2601.13892)
Keywords: generative
Abstract: Despite their widespread adoption in various domains, especially due to their powerful reasoning capabilities, Large Language Models (LLMs) are not the off-the-shelf choice to drive multi-objective optimization yet. Conventional strategies rank high in benchmarks due to their intrinsic capabilities to handle numerical inputs and careful modelling choices that balance exploration and Pareto-front exploitation, as well as handle multiple (conflicting) objectives. In this paper, we close this gap by leveraging LLMs as surrogate models and candidate samplers inside a structured hierarchical search strategy. By adaptively partitioning the input space into disjoint hyperrectangular regions and ranking them with a composite score function, we restrict the generative process of the LLM to specific, high-potential sub-spaces, hence making the problem easier to solve as the LLM doesn't have to reason about the global structure of the problem, but only locally instead. We show that under standard regularity assumptions, our algorithm generates candidate solutions that converge to the true Pareto set in Hausdorff distance. Empirically, it consistently outperforms the global LLM-based multi-objective optimizer and is on par with standard evolutionary and Bayesian optimization algorithm on synthetic and real-world benchmarks.
摘要：尽管大型语言模型（LLM）在各个领域得到广泛采用，特别是由于其强大的推理能力，但它还不是驱动多目标优化的现成选择。传统策略在基准测试中排名靠前，因为它们具有处理数值输入和谨慎建模选择的内在能力，可以平衡探索和帕累托前沿开发，以及处理多个（冲突）目标。在本文中，我们通过利用法学硕士作为结构化分层搜索策略中的替代模型和候选采样器来缩小这一差距。通过自适应地将输入空间划分为不相交的超矩形区域并使用复合评分函数对它们进行排名，我们将 LLM 的生成过程限制在特定的高潜力子空间，从而使问题更容易解决，因为 LLM 不必推理问题的全局结构，而只需局部推理。我们证明，在标准正则性假设下，我们的算法生成收敛于豪斯多夫距离中的真实帕累托集的候选解。根据经验，它始终优于基于 LLM 的全局多目标优化器，并且在合成和现实世界基准上与标准进化和贝叶斯优化算法相当。

Title: VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

Authors: Shengyi Wu, Yan Hong, Shengyao Chen, Zheng Wang, Xianbing Sun, Jiahui Zhan, Jun Lan, Jianfu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13951
Pdf URL: https://arxiv.org/pdf/2601.13951
Copy Paste: [[2601.13951]] VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content(https://arxiv.org/abs/2601.13951)
Keywords: generative
Abstract: With the rapid advancement of generative AI, virtual try-on (VTON) systems are becoming increasingly common in e-commerce and digital entertainment. However, the growing realism of AI-generated try-on content raises pressing concerns about authenticity and responsible use. To address this, we present VTONGuard, a large-scale benchmark dataset containing over 775,000 real and synthetic try-on images. The dataset covers diverse real-world conditions, including variations in pose, background, and garment styles, and provides both authentic and manipulated examples. Based on this benchmark, we conduct a systematic evaluation of multiple detection paradigms under unified training and testing protocols. Our results reveal each method's strengths and weaknesses and highlight the persistent challenge of cross-paradigm generalization. To further advance detection, we design a multi-task framework that integrates auxiliary segmentation to enhance boundary-aware feature learning, achieving the best overall performance on VTONGuard. We expect this benchmark to enable fair comparisons, facilitate the development of more robust detection models, and promote the safe and responsible deployment of VTON technologies in practice.
摘要：随着生成式人工智能的快速发展，虚拟试穿（VTON）系统在电子商务和数字娱乐中变得越来越普遍。然而，人工智能生成的试穿内容日益真实，引发了人们对真实性和负责任使用的紧迫担忧。为了解决这个问题，我们推出了 VTONGuard，这是一个包含超过 775,000 张真实和合成试穿图像的大型基准数据集。该数据集涵盖了各种现实世界条件，包括姿势、背景和服装风格的变化，并提供了真实的和经过处理的示例。基于这个基准，我们在统一的训练和测试协议下对多种检测范式进行了系统评估。我们的结果揭示了每种方法的优点和缺点，并强调了跨范式泛化所面临的持续挑战。为了进一步推进检测，我们设计了一个多任务框架，集成了辅助分割以增强边界感知特征学习，从而在 VTONGuard 上实现最佳整体性能。我们希望该基准能够实现公平比较，促进更强大的检测模型的开发，并促进 VTON 技术在实践中安全、负责任的部署。

Title: Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution

Authors: Samuel W. Remedios, Zhangxing Bian, Shuwen Wei, Aaron Carass, Jerry L. Prince, Blake E. Dewey
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14030
Pdf URL: https://arxiv.org/pdf/2601.14030
Copy Paste: [[2601.14030]] Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution(https://arxiv.org/abs/2601.14030)
Keywords: super-resolution, generative
Abstract: Diffusion models are the current state-of-the-art for solving inverse problems in imaging. Their impressive generative capability allows them to approximate sampling from a prior distribution, which alongside a known likelihood function permits posterior sampling without retraining the model. While recent methods have made strides in advancing the accuracy of posterior sampling, the majority focuses on single-image inverse problems. However, for modalities such as magnetic resonance imaging (MRI), it is common to acquire multiple complementary measurements, each low-resolution along a different axis. In this work, we generalize common diffusion-based inverse single-image problem solvers for multi-image super-resolution (MISR) MRI. We show that the DPS likelihood correction allows an exactly-separable gradient decomposition across independently acquired measurements, enabling MISR without constructing a joint operator, modifying the diffusion model, or increasing network function evaluations. We derive MISR versions of DPS, DMAP, DPPS, and diffusion-based PnP/ADMM, and demonstrate substantial gains over SISR across $4\times/8\times/16\times$ anisotropic degradations. Our results achieve state-of-the-art super-resolution of anisotropic MRI volumes and, critically, enable reconstruction of near-isotropic anatomy from routine 2D multi-slice acquisitions, which are otherwise highly degraded in orthogonal views.
摘要：扩散模型是当前解决成像逆问题的最先进技术。它们令人印象深刻的生成能力使它们能够从先验分布中进行近似采样，与已知的似然函数一起允许后采样而无需重新训练模型。虽然最近的方法在提高后验采样的准确性方面取得了长足的进步，但大多数方法都集中在单图像逆问题上。然而，对于磁共振成像 (MRI) 等模式，通常会获取多个互补测量值，每个测量值沿不同轴的分辨率较低。在这项工作中，我们推广了用于多图像超分辨率（MISR）MRI 的常见基于扩散的逆单图像问题求解器。我们表明，DPS 似然校正允许在独立获取的测量中进行精确可分离的梯度分解，从而无需构建联合算子、修改扩散模型或增加网络函数评估即可实现 MISR。我们推导出 DPS、DMAP、DPPS 和基于扩散的 PnP/ADMM 的 MISR 版本，并在 $4\times/8\times/16\times$ 各向异性退化方面证明了相对 SISR 的显着增益。我们的结果实现了各向异性 MRI 体积的最先进的超分辨率，并且至关重要的是，能够从常规 2D 多切片采集中重建近各向同性解剖结构，否则这些解剖结构在正交视图中会严重退化。

Title: Human detectors are surprisingly powerful reward models

Authors: Kumar Ashutosh, XuDong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14037
Pdf URL: https://arxiv.org/pdf/2601.14037
Copy Paste: [[2601.14037]] Human detectors are surprisingly powerful reward models(https://arxiv.org/abs/2601.14037)
Keywords: generation
Abstract: Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.
摘要：视频生成模型最近实现了令人印象深刻的视觉保真度和时间连贯性。然而，它们仍然难以应对复杂的非刚性运动，尤其是在合成人类执行运动、舞蹈等动态动作时。生成的视频通常会表现出肢体缺失或多余、姿势扭曲或物理上难以置信的动作。在这项工作中，我们提出了一个非常简单的奖励模型 HuDA，来量化和改善生成视频中的人体运动。 HuDA 集成了外观质量的人类检测置信度和捕捉运动真实感的时间提示对齐分数。我们展示了这个简单的奖励函数，它利用现成的模型，无需任何额外的训练，其性能优于使用手动注释数据进行微调的专用模型。使用 HuDA 进行视频模型的群体奖励政策优化 (GRPO) 后训练，我们显着增强了视频生成，特别是在生成复杂的人体动作时，优于 Wan 2.1 等最先进的模型，胜率达到 73%。最后，我们证明 HuDA 不仅提高了人类的生成质量，例如，显着改善了动物视频的生成和人与物体的交互。

Title: Federated Balanced Learning

Authors: Jiaze Li, Haoran Xu, Wanyi Wu, Changwei Wang, Shuaiguang Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Youyang Qu, Longxiang Gao, Xudong Yang, Lumin Xing
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.14042
Pdf URL: https://arxiv.org/pdf/2601.14042
Copy Paste: [[2601.14042]] Federated Balanced Learning(https://arxiv.org/abs/2601.14042)
Keywords: generation
Abstract: Federated learning is a paradigm of joint learning in which clients collaborate by sharing model parameters instead of data. However, in the non-iid setting, the global model experiences client drift, which can seriously affect the final performance of the model. Previous methods tend to correct the global model that has already deviated based on the loss function or gradient, overlooking the impact of the client samples. In this paper, we rethink the role of the client side and propose Federated Balanced Learning, i.e., FBL, to prevent this issue from the beginning through sample balance on the client side. Technically, FBL allows unbalanced data on the client side to achieve sample balance through knowledge filling and knowledge sampling using edge-side generation models, under the limitation of a fixed number of data samples on clients. Furthermore, we design a Knowledge Alignment Strategy to bridge the gap between synthetic and real data, and a Knowledge Drop Strategy to regularize our method. Meanwhile, we scale our method to real and complex scenarios, allowing different clients to adopt various methods, and extend our framework to further improve performance. Numerous experiments show that our method outperforms state-of-the-art baselines. The code is released upon acceptance.
摘要：联邦学习是联合学习的一种范例，其中客户端通过共享模型参数而不是数据来进行协作。然而，在非独立同分布的情况下，全局模型会经历客户端漂移，这会严重影响模型的最终性能。以前的方法倾向于根据损失函数或梯度来校正已经偏离的全局模型，而忽略了客户端样本的影响。在本文中，我们重新思考客户端的角色，并提出联邦平衡学习（Federated Balanced Learning），即FBL，通过客户端的样本平衡从源头上防止这个问题。从技术上讲，FBL允许客户端不平衡的数据在客户端数据样本数量固定的情况下，通过边缘生成模型的知识填充和知识采样来实现样本平衡。此外，我们设计了知识对齐策略来弥合合成数据和真实数据之间的差距，并设计了知识删除策略来规范我们的方法。同时，我们将我们的方法扩展到真实和复杂的场景，允许不同的客户采用不同的方法，并扩展我们的框架以进一步提高性能。大量实验表明，我们的方法优于最先进的基线。代码在接受后发布。

Title: LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

Authors: Badri N. Patro, Vijay S. Agneeswaran
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, eess.IV
Abstract URL: https://arxiv.org/abs/2601.14053
Pdf URL: https://arxiv.org/pdf/2601.14053
Copy Paste: [[2601.14053]] LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems(https://arxiv.org/abs/2601.14053)
Keywords: generation, generative
Abstract: The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at <$0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.
摘要：人工智能领域经历了一场革命，从基础的 Transformer 架构到接近人类性能的推理系统。我们推出了 LLMOrbit，这是一种全面的循环分类法，可在 2019 年至 2025 年的大型语言模型领域中进行导航。这项调查通过八个相互关联的轨道维度检查了 15 个组织的 50 多个模型，记录了定义现代法学硕士、生成人工智能和代理系统的架构创新、培训方法和效率模式。我们确定了三个关键危机：(1) 数据稀缺（2026-2028 年耗尽 9-27T 代币），(2) 指数成本增长（5 年内 300 万美元到 3 亿美元以上），以及 (3) 不可持续的能源消耗（增加 22 倍），建立了限制暴力方法的扩展墙。我们的分析揭示了打破这堵墙的六种范式：(1) 测试时计算（o1，DeepSeek-R1 通过 10 倍推理计算实现 GPT-4 性能）、(2) 量化（4-8 倍压缩）、(3) 分布式边缘计算（10 倍成本降低）、(4) 模型合并、(5) 高效训练（ORPO 减少内存 50%）和 (6) 小型专用模型（Phi-4 14B 匹配更大的模型）。出现了三种范式转变：(1) 训练后增益（RLHF、GRPO、纯 RL 贡献显着，DeepSeek-R1 实现了 79.8% MATH），(2) 效率革命（MoE 路由 18 倍效率，多头潜在注意力 8 倍 KV 缓存压缩以 <0.30 美元/M 代币实现 GPT-4 级性能），以及 (3) 民主化（开源 Llama 3） 88.6% MMLU 超过 GPT-4 86.4%）。我们提供对技术（RLHF、PPO、DPO、GRPO、ORPO）的见解，追踪从被动生成到使用工具的代理（ReAct、RAG、多代理系统）的演变，并分析培训后创新。

Title: POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion

Authors: Andrea Rigo, Luca Stornaiuolo, Weijie Wang, Mauro Martino, Bruno Lepri, Nicu Sebe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14056
Pdf URL: https://arxiv.org/pdf/2601.14056
Copy Paste: [[2601.14056]] POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion(https://arxiv.org/abs/2601.14056)
Keywords: generation, generative
Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.
摘要：我们提出了一种基于扩散的文本到图像 (T2I) 生成方法，具有一致且交互式的 3D 布局控制和编辑。虽然先前的方法使用 2D 提示或迭代复制扭曲粘贴策略来提高空间依从性，但它们经常会扭曲对象几何形状并且无法保持编辑之间的一致性。为了解决这些限制，我们引入了一致和交互地定位对象（POCI-Diff）的框架，这是一种在统一扩散过程中联合执行 3D 几何约束和实例级语义绑定的新颖公式。我们的方法通过混合潜在扩散将单个文本描述绑定到特定的 3D 边界框，从而实现对每个对象的显式语义控制，从而允许一次性合成复杂的多对象场景。我们进一步提出了一种无扭曲的生成编辑管道，它支持通过再生而不是像素变形来插入、删除和转换对象。为了在编辑过程中保持对象的身份和一致性，我们使用 IP-Adapter 在参考图像上调节扩散过程，从而在整个交互式 3D 编辑过程中实现一致的对象外观，同时保持全局场景一致性。实验结果表明，POCI-Diff 可生成与指定 3D 布局和编辑一致的高质量图像，在视觉保真度和布局一致性方面均优于最先进的方法，同时消除了扭曲引起的几何伪影。

Title: Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration

Authors: Yongcong Ye, Kai Zhang, Yanghai Zhang, Enhong Chen, Longfei Li, Jun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14060
Pdf URL: https://arxiv.org/pdf/2601.14060
Copy Paste: [[2601.14060]] Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration(https://arxiv.org/abs/2601.14060)
Keywords: generation
Abstract: Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at this https URL.
摘要：零样本合成图像检索（ZS-CIR）是一个快速发展的领域，具有重要的实际应用，允许用户通过提供参考图像和描述所需修改的相关标题来检索目标图像。现有的 ZS-CIR 方法通常难以捕获细粒度的变化并有效地集成视觉和语义信息。它们主要依赖于使用图像到文本模型将多模态查询转换为单个文本，或者采用大型语言模型来生成目标图像描述，这些方法通常无法捕获互补的视觉信息和完整的语义上下文。为了解决这些限制，我们提出了一种具有互补视觉语义集成（CVSI）的新型细粒度零样本合成图像检索方法。具体来说，CVSI利用了三个关键组件：（1）视觉信息提取，它不仅提取全局图像特征，还使用预先训练的映射网络将图像转换为伪标记，将其与修改文本和最有可能添加的对象相结合。 (2)语义信息提取，涉及使用预训练的字幕模型为参考图像生成多个字幕，然后利用LLM生成修改后的字幕和最有可能添加的对象。 (3)补充信息检索，集成从查询图像和数据库图像中提取的信息来检索目标图像，使系统能够有效地处理各种情况下的检索查询。对三个公共数据集（例如 CIRR、CIRCO 和 FashionIQ）的广泛实验表明，CVSI 显着优于现有的最先进方法。我们的代码可以在这个 https URL 上找到。

Title: Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing

Authors: Xiaolu Liu, Yicong Li, Qiyuan He, Jiayin Zhu, Wei Ji, Angela Yao, Jianke Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14103
Pdf URL: https://arxiv.org/pdf/2601.14103
Copy Paste: [[2601.14103]] Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing(https://arxiv.org/abs/2601.14103)
Keywords: generation, generative
Abstract: Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at this https URL.
摘要：纹理 3D 变形旨在在两个 3D 资源之间生成平滑且合理的过渡，同时保持结构连贯性和细粒度外观。这种能力不仅对于推进 3D 生成研究至关重要，而且对于动画、编辑和数字内容创建的实际应用也至关重要。现有方法要么直接对几何体进行操作，将其限制为仅形状变形而忽略纹理，要么将 2D 插值策略扩展到 3D，这通常会导致语义模糊、结构错位和纹理模糊。这些挑战强调了在整个过渡过程中共同保持几何一致性、纹理对齐和鲁棒性的必要性。为了解决这个问题，我们提出了 Interp3D，这是一种用于纹理 3D 变形的新型免训练框架。它利用生成先验并采用渐进对齐原则来确保几何保真度和纹理连贯性。 Interp3D 从条件空间中的语义对齐插值开始，通过 SLAT（结构化潜在）引导的结构插值强制结构一致性，最后通过细粒度纹理融合传递外观细节。为了进行全面评估，我们构建了一个具有分级难度级别的专用数据集 Interp3DData，并从保真度、过渡平滑度和合理性方面评估生成结果。定量指标和人类研究都证明了我们提出的方法比以前的方法具有显着优势。源代码可从此 https URL 获取。

Title: The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

Authors: Renmiao Chen, Yida Lu, Shiyao Cui, Xuan Ouyang, Victor Shea-Jay Huang, Shumin Zhang, Chengwei Pan, Han Qiu, Minlie Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.14127
Pdf URL: https://arxiv.org/pdf/2601.14127
Copy Paste: [[2601.14127]] The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning(https://arxiv.org/abs/2601.14127)
Keywords: generation
Abstract: As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at this https URL.
摘要：随着多模态大型语言模型（MLLM）获得更强的推理能力来处理复杂的多图像指令，这一进步可能会带来新的安全风险。我们通过引入 MIR-SafetyBench 来研究这个问题，这是第一个专注于多图像推理安全性的基准，它由 9 个多图像关系分类中的 2,676 个实例组成。我们对 19 个 MLLM 的广泛评估揭示了一个令人不安的趋势：具有更先进的多图像推理的模型在 MIR-SafetyBench 上可能更容易受到攻击。除了攻击成功率之外，我们发现许多标记为安全的响应都是肤浅的，通常是由于误解或回避、不置可否的答复而导致的。我们进一步观察到，平均而言，不安全一代的注意力熵比安全一代低。这种内部特征表明模型可能会过度关注任务解决而忽视安全约束。我们的代码和数据可在此 https URL 中获取。

Title: One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Authors: Yitong Dong, Qi Zhang, Minchao Jiang, Zhiqiang Wu, Qingnan Fan, Ying Feng, Huaqi Zhang, Hujun Bao, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14161
Pdf URL: https://arxiv.org/pdf/2601.14161
Copy Paste: [[2601.14161]] One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion(https://arxiv.org/abs/2601.14161)
Keywords: restoration, generation, generative
Abstract: We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.
摘要：我们提出了一种从稀疏图像进行高保真新颖视图合成 (NVS) 的新颖框架，解决了最近基于 Vision Transformer (ViT) 主干的前馈 3D 高斯分布 (3DGS) 方法的关键限制。虽然基于 ViT 的管道提供了强大的几何先验，但由于计算成本，它们通常受到低分辨率输入的限制。此外，现有的生成增强方法往往与 3D 无关，导致视图之间的结构不一致，尤其是在看不见的区域。为了克服这些挑战，我们设计了一个双域细节感知模块，它能够处理高分辨率图像而不受ViT主干的限制，并赋予高斯额外的功能来存储高频细节。我们开发了一个特征引导的扩散网络，它可以在恢复过程中保留高频细节。我们引入了统一的训练策略，可以联合优化基于 ViT 的几何主干和基于扩散的细化模块。实验表明，我们的方法可以在多个数据集上保持卓越的生成质量。

Title: Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment

Authors: Punit Kumar, Vaibhav Saran, Divyesh Patel, Nitin Kulkarni, Alina Vereshchaka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14228
Pdf URL: https://arxiv.org/pdf/2601.14228
Copy Paste: [[2601.14228]] Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment(https://arxiv.org/abs/2601.14228)
Keywords: generation
Abstract: Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.
摘要：脓毒症仍然是重症监护病房死亡的主要原因之一，及时、准确的治疗决策可以显着影响患者的治疗结果。在这项工作中，我们提出了一个可解释的决策支持框架。我们的系统集成了四个核心组件：（1）基于聚类的分层模块，使用具有统计验证的聚类将患者入院后分为低、中和高风险组； (2) 利用变分自动编码器 (VAE) 和扩散模型的合成数据增强管道来丰富代表性不足的轨迹，例如液体或血管加压药给药； (3) 离线强化学习 (RL) 代理，使用优势加权回归 (AWR) 和轻量级注意力编码器进行训练，并由集成模型支持，以提供保守的、安全意识的治疗建议； (4) 由多模态大语言模型 (LLM) 提供支持的理由生成模块，该模块可生成基于临床背景和检索的专家知识的自然语言理由。通过对 MIMIC-III 和 eICU 数据集进行评估，我们的方法实现了较高的治疗准确性，同时为临床医生提供了可解释且稳健的政策建议。

Title: Q-learning with Adjoint Matching

Authors: Qiyang Li, Sergey Levine
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2601.14234
Pdf URL: https://arxiv.org/pdf/2601.14234
Copy Paste: [[2601.14234]] Q-learning with Adjoint Matching(https://arxiv.org/abs/2601.14234)
Keywords: generative
Abstract: We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
摘要：我们提出了带有伴随匹配 (QAM) 的 Q 学习，这是一种基于 TD 的新型强化学习 (RL) 算法，它解决了连续动作 RL 中长期存在的挑战：针对参数化 Q 函数有效优化表达扩散或流匹配策略。有效的优化需要利用批评家的一阶信息，但对于流或扩散策略来说这样做具有挑战性，因为通过多步去噪过程进行反向传播的直接基于梯度的优化在数值上不稳定。现有方法通过仅使用值并丢弃梯度信息来解决此问题，或者依靠牺牲策略表达性或使学习的策略产生偏差的近似值。 QAM 通过利用伴随匹配（一种最近在生成建模中提出的技术）回避了这两个挑战，该技术将批评家的动作梯度进行转换，形成一个不受不稳定反向传播影响的逐步目标函数，同时在最佳状态下提供无偏的、富有表现力的策略。与批判性学习的时差备份相结合，QAM 在离线和离线到在线 RL 中的困难、稀疏奖励任务上始终优于先前的方法。

Title: Soft Tail-dropping for Adaptive Visual Tokenization

Authors: Zeyuan Chen, Kai Zhang, Zhuowen Tu, Yuanjun Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14246
Pdf URL: https://arxiv.org/pdf/2601.14246
Copy Paste: [[2601.14246]] Soft Tail-dropping for Adaptive Visual Tokenization(https://arxiv.org/abs/2601.14246)
Keywords: generation, generative
Abstract: We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.
摘要：我们提出了软尾部丢弃自适应分词器（STAT），这是一种一维离散视觉分词器，可根据其结构复杂性和细节级别自适应地选择每个图像的输出标记数量。 STAT 将图像编码为一系列离散代码以及每个令牌的保留概率。除了标准自动编码器目标之外，我们将这些保持概率规范化为沿序列单调递减，并显式地将它们的分布与图像级复杂性度量对齐。因此，STAT 生成长度自适应的 1D 视觉标记，这些标记与因果 1D 自回归 (AR) 视觉生成模型自然兼容。在 ImageNet-1k 上，与其他概率模型系列相比，为普通因果 AR 模型配备 STAT 可产生具有竞争力或卓越的视觉生成质量，同时还表现出先前普通 AR 视觉生成尝试中难以捉摸的良好缩放行为。

Title: OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Authors: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14250
Pdf URL: https://arxiv.org/pdf/2601.14250
Copy Paste: [[2601.14250]] OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer(https://arxiv.org/abs/2601.14250)
Keywords: generation
Abstract: Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
摘要：视频比图像或文本传达更丰富的信息，捕捉空间和时间动态。然而，大多数现有的视频定制方法依赖于参考图像或特定于任务的时间先验，未能充分利用视频固有的丰富的时空信息，从而限制了视频生成的灵活性和泛化性。为了解决这些限制，我们提出了 OmniTransfer，一个用于时空视频传输的统一框架。它利用跨帧的多视图信息来增强外观一致性，并利用时间线索来实现细粒度的时间控制。为了统一各种视频传输任务，OmniTransfer 采用了三个关键设计：任务感知位置偏差，自适应地利用参考视频信息来提高时间对齐或外观一致性；参考解耦因果学习将参考分支和目标分支分开，以实现精确的参考传输，同时提高效率；任务自适应多模态对齐使用多模态语义指导来动态区分和处理不同的任务。大量实验表明，OmniTransfer 在外观（ID 和风格）和时间传输（相机运动和视频效果）方面优于现有方法，同时在不使用姿势的运动传输中匹配姿势引导方法，为灵活、高保真视频生成建立了新范例。

Title: Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Authors: Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, Anpei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14253
Pdf URL: https://arxiv.org/pdf/2601.14253
Copy Paste: [[2601.14253]] Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis(https://arxiv.org/abs/2601.14253)
Keywords: generation
Abstract: We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at this https URL.
摘要：我们提出了 Motion 3-to-4，这是一个前馈框架，用于从单个单目视频和可选的 3D 参考网格合成高质量的 4D 动态对象。虽然最近的进展显着改进了 2D、视频和 3D 内容生成，但由于训练数据有限以及从单目角度恢复几何和运动的固有模糊性，4D 合成仍然很困难。 Motion 3-to-4 通过将 4D 合成分解为静态 3D 形状生成和运动重建来解决这些挑战。使用规范参考网格，我们的模型学习紧凑的运动潜在表示并预测每帧顶点轨迹以恢复完整的、时间相干的几何形状。可扩展的逐帧变换器进一步实现了对不同序列长度的鲁棒性。对标准基准和具有精确地面实况几何的新数据集的评估表明，与之前的工作相比，Motion 3-to-4 提供了卓越的保真度和空间一致性。项目页面可通过此 https URL 获取。

Title: VideoMaMa: Mask-Guided Video Matting via Generative Prior

Authors: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14255
Pdf URL: https://arxiv.org/pdf/2601.14255
Copy Paste: [[2601.14255]] VideoMaMa: Mask-Guided Video Matting via Generative Prior(https://arxiv.org/abs/2601.14255)
Keywords: generative
Abstract: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
摘要：由于标记数据的稀缺，将视频抠图模型推广到现实世界的视频仍然是一个重大挑战。为了解决这个问题，我们提出了视频遮罩到遮罩模型 (VideoMaMa)，它利用预训练的视频扩散模型将粗分割遮罩转换为像素精确的 alpha 遮罩。 VideoMaMa 展示了对现实世界镜头的强大的零镜头泛化能力，尽管它仅基于合成数据进行训练。在此功能的基础上，我们开发了一个用于大规模视频抠图的可扩展伪标签管道，并构建了视频抠图（MA-V）数据集，该数据集为超过 50K 个跨越不同场景和动作的真实视频提供了高质量的抠图注释。为了验证该数据集的有效性，我们在 MA-V 上微调 SAM2 模型以获得 SAM2-Matte，就野外视频的鲁棒性而言，该模型优于在现有抠图数据集上训练的相同模型。这些发现强调了大规模伪标记视频抠图的重要性，并展示了生成先验和可访问的分割线索如何推动视频抠图研究的可扩展进展。

Title: Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Authors: Matthew Gwilliam, Xiao Wang, Xuefeng Hu, Zhenheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14256
Pdf URL: https://arxiv.org/pdf/2601.14256
Copy Paste: [[2601.14256]] Implicit Neural Representation Facilitates Unified Universal Vision Encoding(https://arxiv.org/abs/2601.14256)
Keywords: generation, generative
Abstract: Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
摘要：图像表示学习的模型通常是为识别或生成而设计的。各种形式的对比学习可帮助模型学习将图像转换为可用于分类、检测和分割的嵌入。另一方面，可以训练模型来重建具有像素级、感知性和对抗性损失的图像，以便学习对图像生成有用的潜在空间。我们寻求通过首个模型来统一这两个方向，该模型学习同时可用于识别和生成的表示。我们将模型训练为隐式神经表示的超网络，它学习将图像映射到模型权重，以实现快速、准确的重建。我们进一步将 INR 超网络与知识蒸馏相结合，以提高其泛化性和性能。除了新颖的训练设计之外，该模型还学习了前所未有的压缩嵌入空间，对于各种视觉任务具有出色的性能。完整的模型可与图像表示学习的最先进结果相媲美，同时还通过其高质量的微小嵌入实现生成能力。该代码可从此 https URL 获取。