2026-02-27

Title: Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials

Authors: Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22251
Pdf URL: https://arxiv.org/pdf/2602.22251
Copy Paste: [[2602.22251]] Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials(https://arxiv.org/abs/2602.22251)
Keywords: generation, generative
Abstract: General-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, the first foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.
摘要：通用 3D 化学建模涵盖分子和材料，需要生成和预测能力。然而，大多数现有的人工智能方法都是针对单个领域（分子或材料）和单个任务（生成或预测）进行优化，这限制了表示共享和传输。我们推出了 Zatom-1，这是第一个统一 3D 分子和材料的生成学习和预测学习的基础模型。 Zatom-1 是一款经过多模态流匹配目标训练的 Transformer，可联合建模离散原子类型和连续 3D 几何形状。这种方法支持可扩展的预训练，随着模型容量的增加可预测增益，同时实现快速稳定的采样。我们使用联合生成预训练作为属性、能量和力的下游多任务预测的通用初始化。根据经验，Zatom-1 在生成基准和预测基准上均匹配或优于专门基准，同时将生成推理时间缩短了一个数量级以上。我们的实验证明了联合生成预训练在化学域之间的积极预测转移：预训练期间的材料建模提高了分子特性预测的准确性。

Title: BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning

Authors: Mingi Kim, Yongjun Kim, Jungwoo Kang, Hyungki Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.22284
Pdf URL: https://arxiv.org/pdf/2602.22284
Copy Paste: [[2602.22284]] BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning(https://arxiv.org/abs/2602.22284)
Keywords: generation
Abstract: Recent advancements in deep learning have actively addressed complex challenges within the Computer-Aided Design (CAD) this http URL, most existing approaches rely on task-specifi c models requiring structural modifi cations for new tasks, and they predominantly focus on point clouds or images rather than the industry-standard Boundary Representation (B-rep) format. To address these limitations, we propose BrepCoder, a unifi ed Multimodal Large Language Model (MLLM) that performs diverse CAD tasks from B-rep inputs. By leveraging the code generation capabilities of Large Language Models (LLMs), we convert CAD modeling sequences into Python-like code and align them with B-rep. We then adopt a two-stage training strategy: First, pre-training on reverse engineering to learn geometric features and design logic. Second, eff ectively extending the model to various downstream tasks such as completion, error correction, and CAD-QA. Consequently, by interpreting B-rep as structural code, BrepCoder achieves superior generalization across diverse tasks, demonstrating its potential as a general-purpose CAD agent.
摘要：深度学习的最新进展积极解决了计算机辅助设计 (CAD) 这个 http URL 中的复杂挑战，大多数现有方法依赖于需要对新任务进行结构修改的特定于任务的模型，并且它们主要关注点云或图像，而不是行业标准的边界表示 (B-rep) 格式。为了解决这些限制，我们提出了 BrepCoder，这是一种统一的多模态大型语言模型 (MLLM)，可以根据 B-rep 输入执行各种 CAD 任务。通过利用大型语言模型 (LLM) 的代码生成功能，我们将 CAD 建模序列转换为类似 Python 的代码，并将其与 B-rep 对齐。然后我们采取两阶段的训练策略：首先，逆向工程预训练，学习几何特征和设计逻辑。其次，将模型有效扩展到各种下游任务，例如完成、纠错和CAD-QA。因此，通过将 B-rep 解释为结构代码，BrepCoder 在不同的任务中实现了卓越的泛化，展示了其作为通用 CAD 代理的潜力。

Title: MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

Authors: Raiyan Jahangir, Nafiz Imtiaz Khan, Amritanand Sudheerkumar, Vladimir Filkov
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2602.22462
Pdf URL: https://arxiv.org/pdf/2602.22462
Copy Paste: [[2602.22462]] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation(https://arxiv.org/abs/2602.22462)
Keywords: generation
Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.
摘要：乳腺X线筛查筛查量大、时间敏感、文件记录繁重。放射科医生必须将微妙的视觉发现转化为一致的 BI-RADS 评估、乳腺密度类别和结构化叙述报告。虽然最近的视觉语言模型 (VLM) 支持图像到文本的报告，但许多依赖于封闭的云系统或紧密耦合的架构，这限制了隐私、可重复性和适应性。我们推出了 MammoWise，这是一种本地多模型管道，可将开源 VLM 转换为乳房 X 光检查报告生成器和多任务分类器。 MammoWise 支持任何 Ollama 托管的 VLM 和乳腺 X 线摄影数据集，并支持零样本、少样本和思维链提示，并具有可选的多模式检索增强生成 (RAG)，使用向量数据库来实现特定案例的上下文。我们在 VinDr-Mammo 和 DMID 数据集上评估 MedGemma、LLaVA-Med 和 Qwen2.5-VL，评估报告质量（BERTScore、ROUGE-L）、BI-RADS 分类、乳腺密度和主要发现。报告生成始终强劲，并通过少量提示和 RAG 得到改进。分类是可行的，但对模型和数据集的选择敏感。 MedGemma 的参数高效微调 (QLoRA) 提高了可靠性，实现 0.7545 的 BI-RADS 准确度、0.8840 的密度准确度和 0.9341 的钙化准确度，同时保持报告质量。 MammoWise 提供了一个实用且可扩展的框架，用于在统一且可重复的工作流程中部署用于乳腺 X 光检查报告的本地 VLM。

Title: Space Syntax-guided Post-training for Residential Floor Plan Generation

Authors: Zhuoyang Jiang, Dongqing Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2602.22507
Pdf URL: https://arxiv.org/pdf/2602.22507
Copy Paste: [[2602.22507]] Space Syntax-guided Post-training for Residential Floor Plan Generation(https://arxiv.org/abs/2602.22507)
Keywords: generation, generative
Abstract: Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at $\leq 7$ rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.
摘要：住宅平面图的预训练生成模型通常经过优化以适应大规模数据分布，这可能会低估关键的建筑先验，例如国内公共空间（例如客厅和门厅）的配置优势和连通性。本文提出了空间句法引导的后训练（SSPT），这是一种后训练范式，通过不可微的预言机将空间句法知识明确地注入平面图生成中。预言机通过贪婪最大矩形分解和门介导的邻接构造将 RPLAN 式布局转换为矩形空间图，然后计算基于集成的测量以量化公共空间主导地位和功能层次结构。为了实现一致的评估和诊断，我们进一步引入了 SSPT-Bench (Eval-8)，这是一种分布外基准，它使用上限为 $\leq 7$ 房间的条件对模型进行后期训练，同时评估 8 房间项目，以及用于优势、稳定性和配置文件对齐的统一指标套件。 SSPT 通过两种策略进行实例化：(i) 通过空间语法过滤和扩散微调进行迭代再训练，以及 (ii) 通过具有空间语法奖励的 PPO 进行强化学习。实验表明，与分布拟合基线相比，这两种策略都提高了公共空间的主导地位并恢复了更清晰的功能层次结构，而 PPO 通过大幅提高的计算效率和减少的方差获得了更大的收益。 SSPT 提供了一种可扩展的途径，用于将架构理论集成到数据驱动的计划生成中，并且在给定事后评估预言机的情况下与其他生成主干兼容。

Title: DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Authors: Zhechao Wang, Yiming Zeng, Lufan Ma, Zeqing Fu, Chen Bai, Ziyao Lin, Cheng Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22549
Pdf URL: https://arxiv.org/pdf/2602.22549
Copy Paste: [[2602.22549]] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation(https://arxiv.org/abs/2602.22549)
Keywords: generation
Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
摘要：不同驾驶场景的合成是验证自动驾驶系统的鲁棒性和通用性的关键数据增强技术。当前的方法将高清 (HD) 地图和 3D 边界框聚合为扩散模型中的几何条件，以生成条件场景。然而，当控制条件独立变化时，隐式条件间依赖性会导致生成失败。此外，这些方法在语义和结构方面都存在细节不足的问题。具体来说，简短且视图不变的字幕限制了语义上下文，导致背景建模较弱。同时，具有均匀空间权重的标准去噪损失忽略了前景结构细节，导致视觉扭曲和模糊。为了应对这些挑战，我们提出了 DrivePTS，它包含三项关键创新。首先，我们的框架采用渐进式学习策略来减轻几何条件之间的相互依赖性，并通过明确的互信息约束得到加强。其次，利用视觉语言模型生成跨六个语义方面的多视图分层描述，提供细粒度的文本指导。第三，引入频率引导结构损失来增强模型对高频元素的敏感性，提高前景结构保真度。大量实验表明，我们的 DrivePTS 在生成多样化驾驶场景方面实现了最先进的保真度和可控性。值得注意的是，DrivePTS 成功生成了先前方法失败的罕见场景，凸显了其强大的泛化能力。

Title: Autoregressive Visual Decoding from EEG Signals

Authors: Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22555
Pdf URL: https://arxiv.org/pdf/2602.22555
Copy Paste: [[2602.22555]] Autoregressive Visual Decoding from EEG Signals(https://arxiv.org/abs/2602.22555)
Keywords: generation, generative
Abstract: Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.
摘要：脑电图（EEG）信号因其成本效益和高时间分辨率而成为解码视觉信息的流行媒介。然而，当前的方法在弥合脑电图和图像数据之间的模态差距方面面临着重大挑战。这些方法通常依赖于涉及多个阶段的复杂适应过程，因此很难保持一致性和管理复合错误。此外，大规模扩散模型带来的计算开销限制了它们在现实世界脑机接口（BCI）应用中的实用性。在这项工作中，我们提出了 AVDE，这是一种轻量级且高效的脑电图信号视觉解码框架。首先，我们利用预先训练的脑电图模型 LaBraM，并通过对比学习对其进行微调，以对齐脑电图和图像表示。其次，我们采用基于“下一个尺度预测”策略的自回归生成框架：使用预先训练的 VQ-VAE 将图像编码为多尺度标记图，并训练变压器以自回归预测更精细尺度的标记，从 EEG 嵌入作为最粗略的表示。这种设计能够实现相干生成，同时保留输入脑电图信号和重建图像之间的直接连接。对两个数据集的实验表明，AVDE 在图像检索和重建任务中都优于以前最先进的方法，同时仅使用 10% 的参数。此外，中间输出的可视化表明，AVDE 的生成过程反映了人类视觉感知的层次性质。这些结果凸显了自回归模型作为实际 BCI 应用的高效且可解释工具的潜力。

Title: Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Authors: Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22570
Pdf URL: https://arxiv.org/pdf/2602.22570
Copy Paste: [[2602.22570]] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation(https://arxiv.org/abs/2602.22570)
Keywords: generation
Abstract: Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.
摘要：无分类器引导（CFG）帮助扩散模型在各个领域实现了良好的条件生成。最近，随着生成质量和人类偏好的提高，出现了更多的扩散引导方法。然而，这些新兴的扩散引导方法真的能够取得扎实而显着的改进吗？在本文中，我们重新思考扩散指导的最新进展。我们的工作主要包括四个贡献。首先，我们揭示了一个关键的评估陷阱，即常见的人类偏好模型对大指导尺度表现出强烈的偏见。即使图像质量严重受损（例如，过饱和和伪影），由于强大的语义对齐，简单地增加 CFG 尺度也可以轻松提高定量评估分数。其次，我们引入了一种新颖的制导感知评估（GA-Eval）框架，该框架采用有效的制导尺度校准，通过识别与 CFG 效应正交和平行的效应，实现当前制导方法和 CFG 之间的公平比较。第三，出于评估陷阱的动机，我们设计了超越扩散指导（TDG）方法，该方法可以在传统评估框架中显着提高人类偏好分数，但实际上在实践中不起作用。第四，在广泛的实验中，我们在传统的评估框架和提出的 GA-Eval 框架内对最近的八种扩散引导方法进行了实证评估。值得注意的是，简单地增加 CFG 尺度就可以与大多数研究的扩散引导方法竞争，而所有方法都比标准 CFG 的获胜率下降严重。我们的工作将强烈激励社区重新思考该领域的评估范式和未来方向。

Title: GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Authors: Tianyu Chen, Wei Xiang, Kang Han, Yu Lu, Di Wu, Gaowen Liu, Ramana Rao Kompella
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22571
Pdf URL: https://arxiv.org/pdf/2602.22571
Copy Paste: [[2602.22571]] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views(https://arxiv.org/abs/2602.22571)
Keywords: generative
Abstract: Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
摘要：与按场景优化相比，前馈 3D 重建具有显着的运行时间优势，而按场景优化的推理速度仍然很慢，并且在稀疏视图下通常很脆弱。然而，现有的前馈方法仍然具有进一步提高性能的潜力，特别是对于域外数据，并且一旦引入生成先验，就很难保留二级推理时间。这些限制源于现有前馈管道中的一次性预测范式：模型受到容量的严格限制，缺乏推理时间的细化，并且不适合连续注入生成先验。我们引入了 GIFSplat，这是一个纯粹的前馈迭代细化框架，用于从稀疏的未定视图进行 3D 高斯分布。少量的仅前向残差更新使用渲染证据逐步细化当前的3D场景，在效率和质量之间实现良好的平衡。此外，我们从增强的新颖渲染中将冻结扩散先验提炼为高斯级线索，无需梯度反向传播或不断增加的视图集扩展，从而在保持前馈效率的同时，能够利用生成先验进行每个场景的适应。在 DL3DV、RealEstate10K 和 DTU 中，GIFSplat 始终优于最先进的前馈基线，将 PSNR 提高高达 +2.1 dB，并且无需相机姿势或任何测试时梯度优化即可保持秒级推理时间。

Title: TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Authors: Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen, Muhan Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2602.22586
Pdf URL: https://arxiv.org/pdf/2602.22586
Copy Paste: [[2602.22586]] TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion(https://arxiv.org/abs/2602.22586)
Keywords: generation
Abstract: Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical--language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
摘要：合成表格数据生成由于其对数据增强、基础模型和隐私的重要性而引起了越来越多的关注。然而，现实世界的表格数据集越来越多地包含自由格式的文本字段（例如评论或临床笔记）以及结构化数字和分类属性。通过不同模态的联合建模生成此类异构表仍然具有挑战性。现有的方法大致分为两类：基于扩散的方法和基于法学硕士的方法。扩散模型可以捕获连续或离散空间中数字和分类特征的复杂依赖性，但将它们扩展到开放式文本并非易事，并且通常会导致文本质量下降。相比之下，基于 LLM 的生成器自然会生成流畅的文本，但它们的离散标记化可能会扭曲精确或大范围的数值，从而阻碍数字和语言的准确建模。在这项工作中，我们提出了 TabDLM，这是一种通过基于掩码扩散语言模型 (MDLM) 构建的联合数值语言扩散模型来生成自由形式表格数据的统一框架。 TabDLM 通过掩码扩散对文本和分类特征进行建模，同时通过学习的专门数字标记嵌入，使用连续扩散过程对数字特征进行建模；然后，双向注意力捕获单个模型内的跨模态交互。对不同基准的大量实验证明了 TabDLM 与强大的基于扩散和 LLM 的基准相比的有效性。

Title: Causal Motion Diffusion Models for Autoregressive Motion Generation

Authors: Qing Yu, Akihisa Watanabe, Kent Fujiwara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22594
Pdf URL: https://arxiv.org/pdf/2602.22594
Copy Paste: [[2602.22594]] Causal Motion Diffusion Models for Autoregressive Motion Generation(https://arxiv.org/abs/2602.22594)
Keywords: generation
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
摘要：运动扩散模型的最新进展极大地提高了人体运动合成的真实性。然而，现有方法要么依赖于双向生成的全序列扩散模型，这限制了时间因果关系和实时适用性，要么依赖于不稳定和累积误差的自回归模型。在这项工作中，我们提出了因果运动扩散模型（CMDM），这是一个基于因果扩散变换器的自回归运动生成的统一框架，该变换器在语义对齐的潜在空间中运行。 CMDM 建立在运动语言对齐因果 VAE (MAC-VAE) 的基础上，它将运动序列编码为时间因果潜在表示。在此潜在表示之上，使用因果扩散强迫来训练自回归扩散变换器，以在运动帧上执行时间排序的去噪。为了实现快速推理，我们引入了具有因果不确定性的逐帧采样计划，其中每个后续帧都是根据部分去噪的先前帧来预测的。由此产生的框架支持高质量的文本到运动生成、流合成和交互式速率的长视野运动生成。 HumanML3D 和 SnapMoGen 上的实验表明，CMDM 在语义保真度和时间平滑度方面均优于现有的扩散和自回归模型，同时大幅减少了推理延迟。

Title: BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Authors: Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22596
Pdf URL: https://arxiv.org/pdf/2602.22596
Copy Paste: [[2602.22596]] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model(https://arxiv.org/abs/2602.22596)
Keywords: generative
Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
摘要：我们提出了 BetterScene，这是一种使用极其稀疏、无约束的照片来增强各种现实世界场景的新颖视图合成 (NVS) 质量的方法。 BetterScene 利用在数十亿帧上预训练的可投入生产的稳定视频扩散 (SVD) 模型作为强大的骨干，旨在减少伪影并在推理时恢复视图一致的细节。传统方法已经开发了类似的基于扩散的解决方案来解决新颖视图合成的这些挑战。尽管有了显着的改进，这些方法通常依赖于现成的预训练扩散先验，并且仅对 UNet 模块进行微调，同时保持其他组件冻结，即使在合并深度或语义条件等几何感知正则化时，这仍然会导致细节和伪影不一致。为了解决这个问题，我们研究了扩散模型的潜在空间，并引入了两个组件：(1) 时间等方差正则化和 (2) 视觉基础模型对齐表示，两者都应用于 SVD 管道内的变分自动编码器 (VAE) 模块。 BetterScene 集成了前馈 3D 高斯泼溅 (3DGS) 模型，将特征渲染为 SVD 增强器的输入，并生成连续、无伪影、一致的新颖视图。我们对具有挑战性的 DL3DV-10K 数据集进行评估，并展示了与最先进的方法相比的卓越性能。

Title: Transformers converge to invariant algorithmic cores

Authors: Joshua S. Schiffman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22600
Pdf URL: https://arxiv.org/pdf/2602.22600
Copy Paste: [[2602.22600]] Transformers converge to invariant algorithmic cores(https://arxiv.org/abs/2602.22600)
Keywords: generation
Abstract: Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.
摘要：大型语言模型展现出复杂的功能，但理解它们内部的工作方式仍然是一个核心挑战。一个根本的障碍是训练选择的是行为，而不是电路，因此许多权重配置可以实现相同的功能。哪些内部结构反映了计算，哪些是特定训练运行的意外情况？这项工作提取了算法核心：任务性能所需且充分的紧凑子空间。独立训练的 Transformer 学习不同的权重，但收敛到相同的核心。马尔可夫链变换器将 3D 核心嵌入到几乎正交的子空间中，但恢复相同的跃迁谱。模加法变压器在 grokking 时发现紧凑的循环算子，随后膨胀，产生记忆到泛化转变的预测模型。 GPT-2 语言模型通过单个轴控制主谓一致，当翻转时，会在整个代际范围内反转语法数字。这些结果揭示了在训练运行和规模中持续存在的低维不变量，这表明变压器计算是围绕紧凑的共享算法结构组织的。机械可解释性可以受益于针对这些不变量（计算本质）而不是特定于实现的细节。

Title: LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

Authors: Ziqi Zhao, Abhijit Mishra, Shounak Roychowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22607
Pdf URL: https://arxiv.org/pdf/2602.22607
Copy Paste: [[2602.22607]] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals(https://arxiv.org/abs/2602.22607)
Keywords: generation
Abstract: We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.
摘要：我们提出了 LoR-LUT，这是一种用于紧凑且可解释的 3D 查找表 (LUT) 生成的统一低秩公式。与依赖于基础 LUT（通常是密集张量）融合的传统基于 3D-LUT 的技术不同，我们的统一方法通过联合使用残差校正（实际上是低秩张量）和一组基础 LUT 来扩展当前框架。这里描述的方法提高了图像的现有感知质量，这主要归功于该技术对残差校正的新颖使用。同时，我们使用更少数量的网络、残差校正和 LUT 参数，实现了相同水平的三线性插值复杂度。 LoR-LUT 在 MIT-Adobe FiveK 数据集上进行训练，获得的实验结果以高感知保真度和亚兆字节模型大小再现了专家级修饰特征。此外，我们还引入了一种交互式可视化工具，称为 LoR-LUT Viewer，它通过控制不同参数的多个滑动条将输入图像转换为经过 LUT 调整的输出图像。该工具提供了一种有效的方法来增强可视化结果的可解释性和用户信心。总的来说，我们提出的公式为未来基于 LUT 的图像增强和风格迁移提供了一个紧凑、可解释且高效的方向。

Title: Instruction-based Image Editing with Planning, Reasoning, and Generation

Authors: Liya Ji, Chenyang Qi, Qifeng Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22624
Pdf URL: https://arxiv.org/pdf/2602.22624
Copy Paste: [[2602.22624]] Instruction-based Image Editing with Planning, Reasoning, and Generation(https://arxiv.org/abs/2602.22624)
Keywords: generation
Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.
摘要：通过指令编辑图像提供了一种生成交互式内容的自然方式，但由于对场景理解和生成的更高要求，这是一个很大的挑战。先前的工作利用一系列大型语言模型、对象分割模型和编辑模型来完成此任务。然而，理解模型仅提供单一模态能力，限制了编辑质量。我们的目标是通过新的多模态模型来架起理解和生成的桥梁，该模型为基于指令的图像编辑模型提供针对更复杂情况的智能能力。为了实现这一目标，我们将指令编辑任务与多模态思维提示链分开，即思维链（CoT）规划、编辑区域推理和编辑。对于思想链规划，大语言模型可以考虑提供的指令和编辑网络的能力来推理适当的子提示。对于编辑区域推理，我们使用多模态大语言模型训练基于指令的编辑区域生成网络。最后，提出了一种基于提示引导的指令编辑网络，用于基于相当大的文本到图像扩散模型来编辑图像生成，以接受生成提示。大量的实验表明，我们的方法对复杂的现实世界图像具有竞争性的编辑能力。

Title: CRAG: Can 3D Generative Models Help 3D Assembly?

Authors: Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22629
Pdf URL: https://arxiv.org/pdf/2602.22629
Copy Paste: [[2602.22629]] CRAG: Can 3D Generative Models Help 3D Assembly?(https://arxiv.org/abs/2602.22629)
Keywords: generation, generative
Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.
摘要：大多数现有的 3D 装配方法将问题视为纯粹的姿态估计，通过刚性变换重新排列观察到的零件。相比之下，人类组装自然地将结构推理与整体形状推理结合起来。受这种直觉的启发，我们将 3D 装配重新表述为装配和生成的联合问题。我们表明，这两个过程是相辅相成的：装配为生成提供零件级结构先验，而生成注入整体形状上下文，解决装配中的歧义。与之前无法合成缺失几何体的方法不同，我们提出了 CRAG，它同时生成合理的完整形状并预测输入部件的姿势。大量的实验证明了具有不同几何形状、不同零件数量和缺失零件的野外物体的最先进的性能。我们的代码和模型将被发布。

Title: Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22654
Pdf URL: https://arxiv.org/pdf/2602.22654
Copy Paste: [[2602.22654]] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache(https://arxiv.org/abs/2602.22654)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at this https URL.
摘要：扩散模型在图像和视频生成方面取得了显着的成功，但其实际部署仍然受到多步迭代采样的大量计算开销的阻碍。在加速策略中，基于缓存的方法通过跨时间步重用或预测特征来提供免训练且有效的解决方案。然而，现有方法依赖于固定或局部自适应调度，而不考虑去噪轨迹的全局结构，通常导致误差累积和视觉伪影。为了克服这一限制，我们提出了 DPCache，这是一种新颖的免训练加速框架，它将扩散采样加速表述为全局路径规划问题。 DPCache 从一个小的校准集中构造一个路径感知成本张量，以量化以前面的关键时间步长为条件的跳过时间步长的路径相关误差。利用该张量，DPCache 采用动态编程来选择关键时间步的最佳序列，从而最大限度地降低总路径成本，同时保持轨迹保真度。在推理过程中，模型仅在这些关键时间步执行完整计算，而使用缓存的特征有效地预测中间输出。在 DiT、FLUX 和 HunyuanVideo 上进行的大量实验表明，DPCache 以最小的质量损失实现了强大的加速，在 4.87$\times$ 加速下比之前的加速方法高出 $+$0.031 ImageReward，甚至在 FLUX 上以 3.54$\times$ 加速超过全步基线 $+$0.028 ImageReward，验证了我们的路径感知全局调度框架的有效性。代码将在此 https URL 发布。

Title: Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Authors: Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Yilin Wang, Balu Adsumilli, Weisi Lin
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2602.22659
Pdf URL: https://arxiv.org/pdf/2602.22659
Copy Paste: [[2602.22659]] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing(https://arxiv.org/abs/2602.22659)
Keywords: quality assessment
Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at this https URL
摘要：视听质量评估（AVQA）研究因现有数据集的局限性而陷入停滞：它们通常规模较小，内容和质量缺乏多样性，并且仅用总体分数进行注释。这些缺点为模型开发和多模态感知研究提供了有限的支持。我们提出了一种用于 AVQA 数据集构建的实用方法。首先，我们为 AVQA 设计了一个众包主观实验框架，打破了实验室设置的限制，实现了跨不同环境的可靠注释。其次，进一步采用系统的数据准备策略，以确保质量水平和语义场景的广泛覆盖。第三，我们通过附加注释扩展了数据集，从而能够研究多模式感知机制及其与内容的关系。最后，我们通过 YT-NTU-AVQ 验证了这种方法，YT-NTU-AVQ 是迄今为止最大、最多样化的 AVQA 数据集，由 1,620 个用户生成的音频和视频 (A/V) 序列组成。数据集和平台代码可在此 https URL 获取

Title: Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support

Authors: Md Tanvir Hasan Turja
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2602.22673
Pdf URL: https://arxiv.org/pdf/2602.22673
Copy Paste: [[2602.22673]] Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support(https://arxiv.org/abs/2602.22673)
Keywords: generation
Abstract: Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population-level resistance trends from this data. This paper presents a two-component framework for AMR trend forecasting and evidence-grounded policy decision support. We benchmark six models -- Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM -- on 5,909 WHO GLASS observations across six WHO regions (2021-2023). XGBoost achieved the best performance with a test MAE of 7.07% and R-squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior-year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). We additionally implemented a Retrieval-Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model, producing source-attributed, hallucination-constrained policy answers. Code and data are available at this https URL
摘要：抗生素耐药性 (AMR) 是一场日益严重的全球危机，预计到 2050 年每年将导致 1000 万人死亡。虽然世界卫生组织全球抗生素耐药性和使用监测系统 (GLASS) 提供了 44 个国家的标准化监测数据，但很少有研究应用机器学习来根据这些数据预测人口层面的耐药性趋势。本文提出了一个由两部分组成的框架，用于 AMR 趋势预测和基于证据的政策决策支持。我们根据 WHO GLASS 六个区域（2021-2023 年）的 5,909 个 WHO GLASS 观察结果对六种模型（Naive、线性回归、岭回归、XGBoost、LightGBM 和 LSTM）进行了基准测试。 XGBoost 取得了最佳性能，测试 MAE 为 7.07%，R 平方为 0.854，比朴素基线高出 83.1%。特征重要性分析将上年耐药率确定为主要预测因子（重要性为 50.5%），而区域 MAE 范围为 4.16%（欧洲区域）至 10.14%（东南亚区域）。我们还实施了检索增强生成 (RAG) 管道，将 WHO 政策文件的 ChromaDB 向量存储与本地部署的 Phi-3 Mini 语言模型相结合，生成源归因、幻觉约束的政策答案。代码和数据可在此 https URL 获取

Title: SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Authors: Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22683
Pdf URL: https://arxiv.org/pdf/2602.22683
Copy Paste: [[2602.22683]] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses(https://arxiv.org/abs/2602.22683)
Keywords: generation
Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
摘要：作为最热门的可穿戴设备之一，人工智能驱动的智能眼镜的快速发展开启了多模式交互的新领域，其中基于外部知识源的视觉问答（VQA）成为核心应用。适用于智能眼镜的现有视觉语言模型（VLM）通常在传统多模态数据集上进行训练和评估；然而，这些数据集缺乏反映智能眼镜使用场景所需的多样性和真实性，并且偏离了其特定的挑战，其中准确识别感兴趣的对象必须先于任何外部知识检索。为了弥补这一差距，我们推出了 SUPERGLASSES，这是第一个基于完全由智能眼镜设备收集的真实世界数据构建的综合 VQA 基准测试。 SUPERGLASSES 包含 2,422 个以自我为中心的图像-问题对，涵盖 14 个图像域和 8 个查询类别，并通过完整的搜索轨迹和推理注释进行了丰富。我们在此基准测试中评估了 26 个具有代表性的 VLM，揭示了显着的性能差距。为了解决现有模型的局限性，我们进一步提出了 SUPERLENS，这是一种多模式智能眼镜代理，它通过集成自动对象检测、查询解耦和多模式网络搜索来实现检索增强答案生成。我们的代理实现了最先进的性能，比 GPT-4o 提高了 2.19%，并强调了智能眼镜 VQA 场景中对特定任务解决方案的需求。

Title: No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2602.22689
Pdf URL: https://arxiv.org/pdf/2602.22689
Copy Paste: [[2602.22689]] No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings(https://arxiv.org/abs/2602.22689)
Keywords: generation, generative
Abstract: Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
摘要：潜在扩散模型在高保真文本到图像生成方面取得了显着的成功，但它们记忆训练数据的倾向引起了严重的隐私和知识产权问题。成员推理攻击 (MIA) 提供了一种通过确定给定样本是否包含在训练中来审核此类记忆的原则方法。然而，现有的方法假设可以访问真实的字幕。这种假设在只有图像可用且其文本注释仍未公开的现实场景中失败，导致现有方法在用视觉语言模型 (VLM) 字幕替代时无效。在这项工作中，我们提出了 MoFit，一个无描述的 MIA 框架，它构建了明显过度拟合目标模型生成流形的合成条件输入。给定查询图像，MoFit 分两个阶段进行：(i) 模型拟合代理优化，其中对应用于图像的扰动进行优化，以在从成员样本学习到的模型无条件先验区域中构建代理；(ii) 代理驱动嵌入提取，其中从代理中导出模型拟合嵌入，然后用作查询图像的不匹配条件。这种嵌入放大了成员样本的条件损失响应，同时使保留样本受到的影响相对较小，从而在没有真实说明的情况下增强了可分离性。我们对多个数据集和扩散模型进行的综合实验表明，MoFit 始终优于之前的 VLM 条件基线，并实现了与字幕相关方法相媲美的性能。

Title: Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Authors: Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.22703
Pdf URL: https://arxiv.org/pdf/2602.22703
Copy Paste: [[2602.22703]] Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning(https://arxiv.org/abs/2602.22703)
Keywords: generation
Abstract: Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at this https URL to ensure reproducibility.
摘要：由于对基本图表元素的感知有限，视觉语言模型 (VLM) 经常难以进行几何推理。为了应对这一挑战，我们引入了 GeoPerceive，这是一个基准测试，由与特定领域语言 (DSL) 表示配对的图表实例以及高效的自动数据生成管道组成。这种设计能够独立于推理来独立评估几何感知。为了利用 GeoPerceive 提供的数据来增强 VLM 的几何感知能力，我们提出了 GeoDPO，一种翻译器引导的强化学习 (RL) 框架。 GeoDPO 采用 NL 到 DSL 转换器，该转换器接受由 GeoPerceive 数据引擎生成的合成对的训练，以桥接自然语言和 DSL。该转换器有助于计算细粒度的 DSL 级别分数，这些分数可用作强化学习中的奖励信号。我们在域内和域外数据集上评估 GeoDPO，涵盖几何感知和下游推理的任务。实验结果表明，虽然监督微调 (SFT) 仅提供了边际改进，甚至可能会损害域外场景中的性能，但 GeoDPO 实现了实质性收益：在域内数据上实现了 $+26.5\%$，在域外数据上实现了 $+8.0\%$，在下游推理任务上实现了 $+39.0\%$。这些发现强调了 GeoDPO 相对于 SFT 的卓越性能和泛化能力。所有代码均在此 https URL 发布，以确保可重复性。

Title: IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

Authors: Shuoqi Chen, Yujia Wu, Geoffrey P. Luke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22717
Pdf URL: https://arxiv.org/pdf/2602.22717
Copy Paste: [[2602.22717]] IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling(https://arxiv.org/abs/2602.22717)
Keywords: restoration
Abstract: Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.
摘要：超声成像广泛用于实时、无创诊断，但散斑和相关伪影会降低图像质量并妨碍解释。我们提出了一种基于图像恢复随机微分方程框架的基于扩散的超声去斑方法。为了实现监督训练，我们通过使用 Matlab UltraSound Toolbox 模拟来自无散斑磁共振图像的超声图像来管理大型配对数据集。所提出的模型重建散斑抑制图像，同时保留解剖学上有意义的边缘和对比度。在保留的模拟测试集上，我们的方法始终优于经典滤波器和最近基于学习的去斑基线。我们通过跨模型方差量化预测不确定性，并表明较高的不确定性与较高的重建误差相关，从而提供了困难或易于失败区域的实用指标。最后，我们评估对模拟探针设置的敏感性并观察域转移，激发多样化的培训和适应以实现稳健的临床部署。

Title: SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Authors: Fengming Liu, Tat-Jen Cham, Chuanxia Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22745
Pdf URL: https://arxiv.org/pdf/2602.22745
Copy Paste: [[2602.22745]] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation(https://arxiv.org/abs/2602.22745)
Keywords: generation
Abstract: Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
摘要：大多数文本转视频 (T2V) 生成器优先考虑美观质量，但往往忽略生成视频中的空间限制。在这项工作中，我们提出了 SPATIALALIGN，这是一个自我改进框架，可增强 T2V 模型描述文本提示中指定的动态空间关系 (DSR) 的能力。我们提出了零阶正则化直接偏好优化 (DPO) 来微调 T2V 模型，以更好地与 DSR 保持一致。具体来说，我们设计了 DSR-SCORE，这是一种基于几何的指标，可以定量测量生成的视频与提示中指定的 DSR 之间的对齐情况，这比之前依赖 VLM 进行评估的工作向前迈出了一步。我们还建立了具有不同 DSR 的文本视频对数据集，以促进研究。大量的实验表明，我们的微调模型在空间关系方面的表现显着优于基线。代码将在Link中发布。

Title: Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Authors: Yuan-Chih Chen, Chun-Shien Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22759
Pdf URL: https://arxiv.org/pdf/2602.22759
Copy Paste: [[2602.22759]] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval(https://arxiv.org/abs/2602.22759)
Keywords: restoration, generation
Abstract: Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.
摘要：图像真实性方面的最新进展主要集中在深度伪造检测和定位上，而用于事实检索的篡改内容恢复的探索相对不足。我们提出了一个统一的隐藏代码恢复框架，可以从事后和代内水印范例中进行检索和恢复。我们的方法将语义和感知信息编码为紧凑的隐藏代码表示，通过多尺度矢量量化进行细化，并通过条件 Transformer 模块增强上下文推理。为了能够对自然图像进行系统评估，我们构建了 ImageNet-S，这是一个提供配对图像标签事实检索任务的基准。 ImageNet-S 上的大量实验表明，我们的方法表现出有前途的检索和重建性能，同时保持与各种水印管道完全兼容。该框架为检测和定位之外的通用图像恢复奠定了基础。

Title: SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Authors: Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun, Pengkun Liu, Hang Xiao, Zhong Wang, Guangyuan Fu, Eric Li, Yang Liu, Yikai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22785
Pdf URL: https://arxiv.org/pdf/2602.22785
Copy Paste: [[2602.22785]] SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation(https://arxiv.org/abs/2602.22785)
Keywords: generation
Abstract: We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at this https URL.
摘要：我们介绍 SceneTransporter，这是一个用于从单个图像生成结构化 3D 场景的端到端框架。虽然现有方法生成零件级 3D 对象，但它们通常无法将这些零件组织成开放世界场景中的不同实例。通过去偏聚类探测，我们揭示了一个重要的见解：这种失败源于模型内部分配机制内缺乏结构约束。基于这一发现，我们将结构化 3D 场景生成任务重新定义为全局相关分配问题。为了解决这个问题，SceneTransporter 在组合 DiT 模型的去噪循环中制定并求解了熵最佳传输 (OT) 目标。这种表述施加了两个强大的结构性约束。首先，由此产生的传输计划控制交叉注意力，以强制将图像块一对一路由到部分级 3D 潜伏，从而防止纠缠。其次，传输的竞争性质鼓励相似斑块的分组，这一过程通过基于边缘的成本进一步规范化，以形成连贯的对象并防止碎片。大量实验表明，SceneTransporter 在开放世界场景生成方面优于现有方法，显着提高了实例级一致性和几何保真度。代码和模型将在此 https URL 上公开提供。

Title: GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

Authors: Hanliang Du, Zhangji Lu, Zewei Cai, Qijian Tang, Qifeng Yu, Xiaoli Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22800
Pdf URL: https://arxiv.org/pdf/2602.22800
Copy Paste: [[2602.22800]] GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation(https://arxiv.org/abs/2602.22800)
Keywords: restoration
Abstract: Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at this https URL.
摘要：由于像素位移（倾斜）和模糊，大气湍流会导致显着的图像质量下降，特别是在远程成像应用中。在本文中，我们提出了一种新的大气湍流缓解框架 GSTurb，它集成了光流引导倾斜校正和高斯泼溅来建模非等晕模糊。该框架采用高斯参数来表示倾斜和模糊，并在多个帧上对其进行优化以增强恢复。 ATSyn 静态数据集上的实验结果证明了我们方法的有效性，实现了 27.67 dB 的峰值 PSNR 和 0.8735 的 SSIM。与最先进的方法相比，GSTurb 将 PSNR 提高了 1.3 dB（提高了 4.5%），将 SSIM 提高了 0.048（提高了 5.8%）。此外，在真实数据集（包括 TSRWGAN Real-World 和 CLEAR 数据集）上，GSTurb 优于现有方法，在定性和定量性能方面均显示出显着改进。这些结果表明，将光流引导倾斜校正与高斯散射相结合，可以有效增强合成和真实湍流条件下的图像恢复。此方法的代码将在此 https URL 中提供。

Title: PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Authors: Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22809
Pdf URL: https://arxiv.org/pdf/2602.22809
Copy Paste: [[2602.22809]] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning(https://arxiv.org/abs/2602.22809)
Keywords: generative
Abstract: With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is this https URL.
摘要：随着近年来生成模型的快速发展，基于指令的图像编辑在生成高质量图像方面显示出了巨大的潜力。然而，编辑的质量很大程度上取决于精心设计的指令，将任务分解和排序的负担完全放在用户身上。为了实现自主图像编辑，我们推出了 PhotoAgent，这是一个通过明确的美学规划推进图像编辑的系统。具体来说，PhotoAgent 将自主图像编辑视为一个长期决策问题。它对用户的审美意图进行推理，通过树搜索计划多步骤编辑操作，并通过带有记忆和视觉反馈的闭环执行迭代地细化结果，而不需要逐步的用户提示。为了支持现实场景中的可靠评估，我们引入了 UGC-Edit，这是一个由 7,000 张照片和学习的美学奖励模型组成的美学评估基准。我们还构建了一个包含 1,017 张照片的测试集，以系统地评估自主照片编辑性能。大量实验表明，与基线方法相比，PhotoAgent 始终如一地提高了指令依从性和视觉质量。项目页面就是这个 https URL。

Title: A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Authors: Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez, Faidra Patsatzi, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Sergios Gatidis, Akshay S. Chaudhari, Curtis P. Langlotz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22843
Pdf URL: https://arxiv.org/pdf/2602.22843
Copy Paste: [[2602.22843]] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling(https://arxiv.org/abs/2602.22843)
Keywords: generation
Abstract: Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.
摘要：医学成像的基础模型通常是在越来越大的数据集上进行预训练的，遵循“不惜一切代价扩展”的范式。然而，这种策略面临两个关键挑战：大规模医疗数据集通常包含大量冗余和严重的类别不平衡，使表示学习偏向于过度表示的模式，并且不加区别地进行训练，无论数据质量的异质性如何，都会导致相当大的计算效率低下。在这里，我们证明了预训练期间主动、有原则的数据管理可以作为暴力数据集扩大的可行、经济有效的替代方案。我们引入了 CheXficient，这是一种胸部 X 射线 (CXR) 基础模型，可选择性地优先考虑信息丰富的训练样本。 CheXficient 仅对 1,235,004 个配对 CXR 图像和报告中的 22.7% 进行了预训练，同时消耗了总计算预算的 27.3%，但实现了与全数据对应模型和其他大规模预训练模型相当或更高的性能。我们跨 5 个任务类型的 20 个单独基准评估了 CheXficient，包括非适应性现成评估（零样本结果分类和跨模态检索）和适应性下游任务（疾病预测、语义分割和放射学报告生成）。进一步的分析表明，CheXficient 系统地优先考虑代表性不足的训练样本，提高了长尾或罕见条件下的通用性。总的来说，我们的工作为医学视觉语言基础模型的有效预训练和下游适应的数据和计算需求提供了实用的见解。

Title: MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction

Authors: Yi He (1 and 4), Yina Cao (2), Jixiu Zhai (3 and 4), Di Wang (1 and 4), Junxiao Kong (4), Tianchi Lu (4 and 5) ((1) Cuiying Honors College, Lanzhou University, Lanzhou, Gansu, China, (2) School of Management, Lanzhou University, Lanzhou, Gansu, China, (3) Shanghai Innovation Institute, Shanghai, China, (4) School of Mathematics and Statistics, Lanzhou University, Lanzhou, Gansu, China, (5) Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22850
Pdf URL: https://arxiv.org/pdf/2602.22850
Copy Paste: [[2602.22850]] MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction(https://arxiv.org/abs/2602.22850)
Keywords: generation
Abstract: Accurate computational identification of DNA methylation is essential for understanding epigenetic regulation. Although deep learning excels in this binary classification task, its "black-box" nature impedes biological insight. We address this by introducing a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms. Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns, achieving robust distinction across diverse species. Validation on external independent datasets confirms that the model's generalization is driven by conserved intrinsic motifs (e.g., GC content) rather than phylogenetic proximity. Furthermore, applying our developed algorithms extracted motifs with significantly higher reliability than prior studies. Finally, empirical evidence from a Drosophila 6mA case study prompted us to propose a "sequence-structure synergy" hypothesis, suggesting that the GAGG core motif and an upstream A-tract element function cooperatively. We further validated this hypothesis via in silico mutagenesis, confirming that the ablation of either or both elements significantly degrades the model's recognition capabilities. This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.
摘要：DNA 甲基化的准确计算鉴定对于理解表观遗传调控至关重要。尽管深度学习在二元分类任务中表现出色，但其“黑匣子”性质阻碍了生物学洞察。我们通过引入高性能模型 MEDNA-DFM 以及受机制启发的信号净化算法来解决这个问题。我们的研究表明，MEDNA-DFM 有效捕获保守的甲基化模式，实现不同物种之间的强大区分。对外部独立数据集的验证证实，该模型的泛化是由保守的内在基序（例如 GC 含量）而不是系统发育邻近性驱动的。此外，应用我们开发的算法提取的图案比之前的研究具有显着更高的可靠性。最后，来自果蝇 6mA 案例研究的经验证据促使我们提出“序列-结构协同”假设，表明 GAGG 核心基序和上游 A-tract 元件协同发挥作用。我们通过计算机诱变进一步验证了这一假设，确认其中一个或两个元素的消融会显着降低模型的识别能力。这项工作为甲基化预测提供了强大的工具，并展示了可解释的深度学习如何推动方法创新和生物学假设的产生。

Title: From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22859
Pdf URL: https://arxiv.org/pdf/2602.22859
Copy Paste: [[2602.22859]] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models(https://arxiv.org/abs/2602.22859)
Keywords: generation
Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at this https URL.
摘要：随着大型多模态模型 (LMM) 规模的扩大和强化学习 (RL) 方法的成熟，LMM 在复杂推理和决策方面取得了显着进展。然而，训练仍然依赖于静态数据和固定方案，因此很难诊断能力盲点或提供动态、有针对性的强化。受测试驱动的错误暴露和基于反馈的纠正优于重复实践的发现的启发，我们提出了诊断驱动的渐进式进化（DPE），这是一个螺旋循环，其中诊断引导数据生成和强化，并且每次迭代重新诊断更新的模型以驱动下一轮有针对性的改进。 DPE 有两个关键组件。首先，多个代理对大量未标记的多模态数据进行注释和质量控制，使用网络搜索和图像编辑等工具来生成多样化的、真实的样本。其次，DPE 将失败归因于特定的弱点，动态调整数据混合，并指导智能体生成针对弱点的数据以进行有针对性的强化。 Qwen3-VL-8B-Instruct 和 Qwen2.5-VL-7B-Instruct 上的实验显示出在 11 个基准中稳定、持续的增益，表明 DPE 作为开放任务分布下连续 LMM 训练的可扩展范例。我们的代码、模型和数据可通过此 https URL 公开获取。

Title: Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

Authors: Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22919
Pdf URL: https://arxiv.org/pdf/2602.22919
Copy Paste: [[2602.22919]] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins(https://arxiv.org/abs/2602.22919)
Keywords: generative
Abstract: A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.
摘要：临床上可行的心脏数字孪生 (CDT) 应该重建个性化的心脏解剖和生理学，根据多模态信号更新其内部状态，并实现孤立任务之外的广泛下游模拟。然而，现有的 CDT 框架仍然仅限于特定任务的预测因子，而不是构建特定于患者的、可操作的虚拟心脏。在这项工作中，我们介绍了流链 (COF)，这是一种基础心电图驱动的生成框架，可从单个心动周期重建完整的 4D 心脏结构和运动。该方法在训练过程中集成了电影 CMR 和 12 导联心电图，以学习心脏几何形状、电生理学和运动动力学的统一表示。我们对不同人群的血流链进行评估，并展示心脏解剖结构、腔室功能和动态运动模式的准确恢复。重建的 4D 心脏进一步支持下游 CDT 任务，例如体积测定、区域功能分析和虚拟电影合成。通过直接根据 ECG 进行完整的 4D 器官重建，COF 将心脏数字孪生从狭窄的预测模型转变为完全生成的、针对患者的虚拟心脏。代码将在审核后发布。

Title: OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality

Authors: Federico Nesti, Gianluca D'Amico, Mauro Marinoni, Giorgio Buttazzo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22920
Pdf URL: https://arxiv.org/pdf/2602.22920
Copy Paste: [[2602.22920]] OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality(https://arxiv.org/abs/2602.22920)
Keywords: generation
Abstract: Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: this https URL
摘要：尽管深度学习显着提高了智能交通系统的感知能力，但铁路应用仍然缺乏用于障碍物检测等安全关键任务的高质量注释数据。虽然真实感模拟器提供了一种解决方案，但它们经常难以解决“模拟与真实”的差距；相反，简单的图像遮罩技术缺乏获得具有正确外观和尺寸的增强单帧和多帧场景所需的时空连贯性。本文介绍了一种多模态增强现实框架，旨在通过将真实感虚拟对象集成到 OSDaR23 数据集中的真实世界铁路序列中来弥补这一差距。利用虚幻引擎 5 功能，我们的管道利用LiDAR 点云和 INS/GNSS 数据可确保 RGB 帧中的准确物体放置和时间稳定性。本文还提出了一种基于分段的 INS/GNSS 数据细化策略，以显着提高增强序列的真实性，正如本文中提出的比较研究所证实的那样，收集了精心设计的增强序列以生成 OSDaR-AR，这是一个旨在支持下一代铁路感知系统开发的公共数据集。

Title: MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Authors: Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22932
Pdf URL: https://arxiv.org/pdf/2602.22932
Copy Paste: [[2602.22932]] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding(https://arxiv.org/abs/2602.22932)
Keywords: generation
Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
摘要：有效理解长视频仍然是多模态大语言模型（MLLM）的基本挑战。在本文中，我们提出了 MLLM-采样器联合进化（MSJoE），这是一种联合进化 MLLM 和轻量级关键帧采样器的新颖框架，用于高效的长格式视频理解。 MSJoE 建立在一个关键假设之上，即只有一小部分关键帧能够真正为回答视频的每个问题提供信息。具体来说，MSJoE 首先推理出几个查询，这些查询描述了与问题相关的不同视觉视角。然后，这些查询与冻结的 CLIP 模型交互以生成查询帧相似度矩阵。最后，轻量级采样器根据该矩阵预测关键帧采样权重，选择一组紧凑的信息帧，然后将其馈送到 MLLM 中以生成答案。 MLLM 和采样器都通过强化学习进行联合优化，从而实现查询推理、帧采样和关键帧理解的共同适应。收集了一个新的长视频 QA 数据集，其中包含 2.8K 视频和 7K 问答对，以支持训练过程。在 VideoMME、LongVideoBench、LVBench 和 MLVU 上进行的大量实验表明，MSJoE 在基础 MLLM 的基础上实现了 8.0% 的精度增益，比最强基线方法的精度提高了 1.1%。

Title: ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Authors: Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22948
Pdf URL: https://arxiv.org/pdf/2602.22948
Copy Paste: [[2602.22948]] ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization(https://arxiv.org/abs/2602.22948)
Keywords: generation
Abstract: Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.
摘要：视觉自回归（VAR）模型提高了发电质量，但在后期面临着关键的效率瓶颈。在本文中，我们提出了一种新颖的 VAR 模型优化框架，该框架与 FastVAR 和 SkipVAR 等先前方法有根本的不同。我们的方法不依赖启发式跳过策略，而是利用注意力熵来表征模型架构不同维度的语义投影。这使得能够在不同的令牌粒度级别、语义范围和生成规模下精确识别参数动态。在此分析的基础上，我们进一步揭示了三个关键维度（令牌、层和规模）的稀疏模式，并提出了一组针对这些模式量身定制的细粒度优化策略。广泛的评估表明，我们的方法实现了生成过程的积极加速，同时显着保留了语义保真度和精细细节，在效率和质量上都优于传统方法。 Infinity-2B 和 Infinity-8B 模型上的实验表明，ToProVAR 在质量损失最小的情况下实现了高达 3.4 倍的加速，有效缓解了之前工作中发现的问题。我们的代码将公开。

Title: MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Authors: Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, Mingkun Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.22955
Pdf URL: https://arxiv.org/pdf/2602.22955
Copy Paste: [[2602.22955]] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis(https://arxiv.org/abs/2602.22955)
Keywords: generation
Abstract: Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: this https URL
摘要：准确的脑肿瘤诊断需要模型不仅能够检测病变，而且能够生成基于影像表现的临床可解释的推理，但现有的公共数据集在注释丰富度和诊断语义方面仍然有限。为了弥补这一差距，我们引入了 MM-NeuroOnco，这是一个用于脑肿瘤 MRI 理解的大规模多模态基准和指令调整数据集，由来自 20 个数据源的 24,726 个 MRI 切片组成，搭配大约 200,000 个语义丰富的多模态指令，涵盖不同的肿瘤亚型和成像模式。为了缓解诊断语义注释的稀缺性和高成本，我们开发了一种用于自动化医疗信息完成和质量控制的多模型协作管道，从而能够生成除仅掩模注释之外的与诊断相关的语义。在此数据集的基础上，我们进一步构建了 MM-NeuroOnco-Bench，这是一个手动注释的评估基准，具有拒绝感知设置，以减少封闭式问题格式中固有的偏见。对十个代表性模型的评估表明，即使是最强的基线 Gemini 3 Flash，在诊断相关问题上也只能达到 41.88% 的准确率，凸显了多模式脑肿瘤诊断理解的巨大挑战。利用 MM-NeuroOnco，我们进一步提出 NeuroOnco-GPT，经过微调后，诊断问题的绝对准确度提高了 27%。这一结果证明了我们的数据集和基准在推进基于临床的多模态诊断推理方面的有效性。代码和数据集可在以下位置公开获取：此 https URL

Title: UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Authors: Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.22960
Pdf URL: https://arxiv.org/pdf/2602.22960
Copy Paste: [[2602.22960]] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models(https://arxiv.org/abs/2602.22960)
Keywords: generation
Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
摘要：基于视频生成的世界模型在模拟交互环境方面表现出了巨大的潜力，但在两个关键领域面临着持续的困难：重新访问场景时保持长期内容一致性以及通过用户提供的输入实现精确的摄像机控制。基于显式 3D 重建的现有方法通常会损害无界场景和细粒度结构的灵活性。替代方法直接依赖于先前生成的帧，而不建立明确的空间对应关系，从而限制了可控性和一致性。为了解决这些限制，我们提出了 UCM，这是一种新颖的框架，通过时间感知的位置编码扭曲机制将长期记忆和精确的相机控制结合起来。为了减少计算开销，我们设计了一种用于高保真生成的高效双流扩散变压器。此外，我们引入了一种可扩展的数据管理策略，利用基于点云的渲染来模拟场景重访，从而促进对超过 500K 单目视频的训练。对现实世界和综合基准的大量实验表明，UCM 在长期场景一致性方面显着优于最先进的方法，同时在高保真视频生成中实现了精确的摄像机可控性。

Title: DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

Authors: Xinglong Luo, Ao Luo, Zhengning Wang, Yueqi Yang, Chaoyu Feng, Lei Lei, Bing Zeng, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23022
Pdf URL: https://arxiv.org/pdf/2602.23022
Copy Paste: [[2602.23022]] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis(https://arxiv.org/abs/2602.23022)
Keywords: generation
Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at this https URL.
摘要：图像对齐是计算机视觉中的一项基本任务，具有广泛的应用。现有方法主要采用基于光流的图像扭曲。然而，该技术容易受到遮挡和照明变化等常见挑战的影响，导致对齐视觉质量下降并影响下游任务的准确性。在本文中，我们提出了 DMAaligner，这是一种基于扩散的框架，通过面向对齐的视图合成进行图像对齐。 DMAligner 旨在从新的角度应对图像对齐的挑战，采用基于生成的解决方案，展示强大的功能并避免与基于流的图像扭曲相关的问题。具体来说，我们提出了一种动态感知扩散训练方法，用于学习条件图像生成，合成图像对齐的新颖视图。它结合了动态感知掩模生成（DMP）模块，可以自适应地区分动态前景区域和静态背景，使扩散模型能够更有效地应对经典方法难以解决的挑战。此外，我们使用 Blender 开发了动态场景图像对齐 (DSIA) 数据集，其中包括 1,033 个室内和室外场景，以及为图像对齐量身定制的超过 30K 图像对。大量的实验结果证明了所提出的方法在 DSIA 基准以及一系列广泛使用的视频数据集上进行定性比较的优越性。我们的代码可以在这个 https URL 上找到。

Title: RhythmBERT: A Self-Supervised Language Model Based on Latent Representations of ECG Waveforms for Heart Disease Detection

Authors: Xin Wang, Burcu Ozek, Aruna Mohan, Amirhossein Ravari, Or Zilbershot, Fatemeh Afghah
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.23060
Pdf URL: https://arxiv.org/pdf/2602.23060
Copy Paste: [[2602.23060]] RhythmBERT: A Self-Supervised Language Model Based on Latent Representations of ECG Waveforms for Heart Disease Detection(https://arxiv.org/abs/2602.23060)
Keywords: generative
Abstract: Electrocardiogram (ECG) analysis is crucial for diagnosing heart disease, but most self-supervised learning methods treat ECG as a generic time series, overlooking physiologic semantics and rhythm-level structure. Existing contrastive methods utilize augmentations that distort morphology, whereas generative approaches employ fixed-window segmentation, which misaligns cardiac cycles. To address these limitations, we propose RhythmBERT, a generative ECG language model that considers ECG as a language paradigm by encoding P, QRS, and T segments into symbolic tokens via autoencoder-based latent representations. These discrete tokens capture rhythm semantics, while complementary continuous embeddings retain fine-grained morphology, enabling a unified view of waveform structure and rhythm. RhythmBERT is pretrained on approximately 800,000 unlabeled ECG recordings with a masked prediction objective, allowing it to learn contextual representations in a label-efficient manner. Evaluations show that despite using only a single lead, RhythmBERT achieves comparable or superior performance to strong 12-lead baselines. This generalization extends from prevalent conditions such as atrial fibrillation to clinically challenging cases such as subtle ST-T abnormalities and myocardial infarction. Our results suggest that considering ECG as structured language offers a scalable and physiologically aligned pathway for advancing cardiac analysis.
摘要：心电图 (ECG) 分析对于诊断心脏病至关重要，但大多数自我监督学习方法将心电图视为通用时间序列，忽略了生理语义和节律级结构。现有的对比方法利用扭曲形态的增强，而生成方法则采用固定窗口分割，这会导致心动周期错位。为了解决这些限制，我们提出了 RhythmBERT，一种生成 ECG 语言模型，它将 ECG 视为一种语言范式，通过基于自动编码器的潜在表示将 P、QRS 和 T 段编码为符号标记。这些离散标记捕获节奏语义，而互补的连续嵌入保留细粒度的形态，从而实现波形结构和节奏的统一视图。 RhythmBERT 在大约 800,000 个未标记的心电图记录上进行了预训练，具有屏蔽的预测目标，使其能够以标签有效的方式学习上下文表示。评估表明，尽管仅使用单导联，RhythmBERT 仍实现了与强大的 12 导联基线相当或更好的性能。这种概括从心房颤动等普遍病症延伸到微妙的 ST-T 异常和心肌梗塞等临床挑战性病例。我们的结果表明，将心电图视为结构化语言为推进心脏分析提供了可扩展且生理上一致的途径。

Title: Benchmarking Temporal Web3 Intelligence: Lessons from the FinSurvival 2025 Challenge

Authors: Oshani Seneviratne, Fernando Spadea, Adrien Pavao, Aaron Micah Green, Kristin P. Bennett
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.23159
Pdf URL: https://arxiv.org/pdf/2602.23159
Copy Paste: [[2602.23159]] Benchmarking Temporal Web3 Intelligence: Lessons from the FinSurvival 2025 Challenge(https://arxiv.org/abs/2602.23159)
Keywords: generation
Abstract: Temporal Web analytics increasingly relies on large-scale, longitudinal data to understand how users, content, and systems evolve over time. A rapidly growing frontier is the \emph{Temporal Web3}: decentralized platforms whose behavior is recorded as immutable, time-stamped event streams. Despite the richness of this data, the field lacks shared, reproducible benchmarks that capture real-world temporal dynamics, specifically censoring and non-stationarity, across extended horizons. This absence slows methodological progress and limits the transfer of techniques between Web3 and broader Web domains. In this paper, we present the \textit{FinSurvival Challenge 2025} as a case study in benchmarking \emph{temporal Web3 intelligence}. Using 21.8 million transaction records from the Aave v3 protocol, the challenge operationalized 16 survival prediction tasks to model user behavior this http URL detail the benchmark design and the winning solutions, highlighting how domain-aware temporal feature construction significantly outperformed generic modeling approaches. Furthermore, we distill lessons for next-generation temporal benchmarks, arguing that Web3 systems provide a high-fidelity sandbox for studying temporal challenges, such as churn, risk, and evolution that are fundamental to the wider Web.
摘要：时态 Web 分析越来越依赖大规模纵向数据来了解用户、内容和系统如何随时间演变。一个快速发展的前沿领域是\emph{Temporal Web3}：去中心化平台，其行为被记录为不可变的、带时间戳的事件流。尽管这些数据很丰富，但该领域缺乏共享的、可重复的基准来捕获现实世界的时间动态，特别是跨越扩展视野的审查和非平稳性。这种缺失减缓了方法论的进展，并限制了 Web3 和更广泛的 Web 领域之间的技术转移。在本文中，我们将 \textit{FinSurvival Challenge 2025} 作为 \emph{temporal Web3 Intelligence} 基准测试的案例研究。该挑战赛使用 Aave v3 协议中的 2180 万条交易记录，操作了 16 个生存预测任务来对用户行为进行建模。该 http URL 详细介绍了基准设计和获胜解决方案，强调了领域感知时间特征构建如何显着优于通用建模方法。此外，我们还为下一代时间基准提炼了经验教训，认为 Web3 系统提供了一个高保真沙箱来研究时间挑战，例如对更广泛的 Web 至关重要的流失、风险和演化。

Title: MetaOthello: A Controlled Study of Multiple World Models in Transformers

Authors: Aviral Chawla, Galen Hall, Juniper Lovato
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.23164
Pdf URL: https://arxiv.org/pdf/2602.23164
Copy Paste: [[2602.23164]] MetaOthello: A Controlled Study of Multiple World Models in Transformers(https://arxiv.org/abs/2602.23164)
Keywords: generative
Abstract: Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.
摘要：基础模型必须处理多个生成过程，但机械可解释性主要研究孤立的能力；目前尚不清楚单个变压器如何组织多个可能相互冲突的“世界模型”。之前关于黑白棋玩神经网络的实验测试了世界模型学习，但重点关注具有一组规则的单个游戏。我们引入了 MetaOthello，这是一套受控的 Othello 变体套件，具有共享语法但不同的规则或标记化，并在混合变体数据上训练小型 GPT，以研究如何在共享表示空间中组织多个世界模型。我们发现，在混合游戏数据上训练的 Transformer 不会将其容量划分为孤立的子模型；相反，它们集中在一个大部分共享的董事会状态表示上，该表示在变体之间因果转移。在一种变体上训练的线性探针可以干预另一种变体的内部状态，其有效性接近匹配探针。对于具有令牌重新映射的同构游戏，表示相当于跨层泛化的单个正交旋转。当规则部分重叠时，早期层保持与游戏无关的表示，而中间层识别游戏身份，而后面的层则专门化。 MetaOthello 不仅提供了一条了解变形金刚是否学习世界模型的途径，还提供了一条了解它们如何同时组织多个世界模型的途径。

Title: DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Authors: Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23165
Pdf URL: https://arxiv.org/pdf/2602.23165
Copy Paste: [[2602.23165]] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation(https://arxiv.org/abs/2602.23165)
Keywords: generation
Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
摘要：生成逼真的对话手势对于与数字人类实现自然的、具有社交吸引力的交互至关重要。然而，现有方法通常将单个音频流映射到单个说话者的运动，而不考虑社会背景或对参与对话的两个人之间的相互动态进行建模。我们提出了 DyaDiT，一种多模态扩散变压器，可以从二元音频信号中生成适合上下文的人体运动。 DyaDiT 在无缝交互数据集上进行训练，采用二元音频和可选的社交上下文标记来产生适合上下文的动作。它融合来自两个说话者的信息来捕获交互动态，使用运动字典对运动先验进行编码，并且可以选择利用对话伙伴的手势来产生更具响应性的运动。我们在标准动作生成指标上评估 DyaDiT 并进行定量用户研究，证明它不仅在客观指标上超越了现有方法，而且受到用户的强烈青睐，突出了其稳健性和社会有利的动作生成。代码和模型将在接受后发布。

Title: Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration

Authors: Xiaole Tang, Xiaoyi He, Jiayi Xu, Xiang Gu, Jian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23169
Pdf URL: https://arxiv.org/pdf/2602.23169
Copy Paste: [[2602.23169]] Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration(https://arxiv.org/abs/2602.23169)
Keywords: restoration
Abstract: Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.
摘要：尽管在统一模型中解决各种退化问题的一体化图像恢复方面取得了重大进展，但现有方法仍然容易受到分布外退化的影响，从而限制了它们在现实场景中的推广。为了应对这一挑战，这项工作的动机是直觉，即多源退化特征分布是由与底层退化无关的分布的不同退化特定变化引起的，因此恢复这种共享分布对于实现跨退化的泛化至关重要。有了这种见解，我们提出了 BaryIR，一种表示学习框架，它在 Wasserstein 重心 (WB) 空间中对齐多源降级特征，通过最小化与多源降级分布的 Wasserstein 距离的平均值来建模与降级无关的分布。我们进一步引入残差子空间，其嵌入相互对比，同时保持与 WB 嵌入正交。因此，BaryIR 显式解耦两个正交空间：WB 空间，编码跨退化共享的与退化无关的不变内容，以及自适应地保留退化特定知识的残余子空间。这种解开减轻了对分布内退化的过度拟合，并实现了基于与退化无关的共享不变性的自适应恢复。大量实验表明，BaryIR 的性能可与最先进的一体化方法相媲美。值得注意的是，BaryIR 可以很好地泛化到看不见的退化（\textit{例如}类型和级别），并且在学习泛化特征方面表现出显着的鲁棒性，即使是在有限的退化类型上进行训练并在具有混合退化的现实世界数据上进行评估时也是如此。

Title: Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation

Authors: Ismaël Zighed, Andrea Nóvoa, Luca Magri, Taraneh Sayadi
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2602.23188
Pdf URL: https://arxiv.org/pdf/2602.23188
Copy Paste: [[2602.23188]] Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation(https://arxiv.org/abs/2602.23188)
Keywords: generation
Abstract: We propose an efficient retraining strategy for a parameterized Reduced Order Model (ROM) that attains accuracy comparable to full retraining while requiring only a fraction of the computational time and relying solely on sparse observations of the full system. The architecture employs an encode-process-decode structure: a Variational Autoencoder (VAE) to perform dimensionality reduction, and a transformer network to evolve the latent states and model the dynamics. The ROM is parameterized by an external control variable, the Reynolds number in the Navier-Stokes setting, with the transformer exploiting attention mechanisms to capture both temporal dependencies and parameter effects. The probabilistic VAE enables stochastic sampling of trajectory ensembles, providing predictive means and uncertainty quantification through the first two moments. After initial training on a limited set of dynamical regimes, the model is adapted to out-of-sample parameter regions using only sparse data. Its probabilistic formulation naturally supports ensemble generation, which we employ within an ensemble Kalman filtering framework to assimilate data and reconstruct full-state trajectories from minimal observations. We further show that, for the dynamical system considered, the dominant source of error in out-of-sample forecasts stems from distortions of the latent manifold rather than changes in the latent dynamics. Consequently, retraining can be limited to the autoencoder, allowing for a lightweight, computationally efficient, real-time adaptation procedure with very sparse fine-tuning data.
摘要：我们为参数化降阶模型（ROM）提出了一种有效的再训练策略，其精度可与完全再训练相媲美，同时只需要一小部分计算时间，并且仅依赖于整个系统的稀疏观察。该架构采用编码-处理-解码结构：变分自动编码器（VAE）用于执行降维，变压器网络用于演化潜在状态并对动态进行建模。 ROM 由外部控制变量（纳维-斯托克斯设置中的雷诺数）参数化，变压器利用注意力机制来捕获时间依赖性和参数效应。概率 VAE 能够对轨迹集合进行随机采样，通过前两个时刻提供预测手段和不确定性量化。在对一组有限的动态机制进行初始训练后，该模型仅使用稀疏数据适应样本外参数区域。它的概率公式自然支持集成生成，我们在集成卡尔曼滤波框架中使用它来同化数据并从最小的观察中重建全状态轨迹。我们进一步表明，对于所考虑的动力系统，样本外预测的主要误差源源于潜在流形的扭曲，而不是潜在动态的变化。因此，重新训练可以仅限于自动编码器，从而允许使用非常稀疏的微调数据进行轻量级、计算高效、实时的自适应过程。

Title: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Authors: Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2602.23200
Pdf URL: https://arxiv.org/pdf/2602.23200
Copy Paste: [[2602.23200]] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models(https://arxiv.org/abs/2602.23200)
Keywords: generation
Abstract: Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
摘要：在解码过程中减少大型语言模型 (LLM) 的硬件占用对于高效的长序列生成至关重要。关键瓶颈是键值 (KV) 缓存，其大小随序列长度变化，并且很容易占据模型的内存占用量。之前的工作提出了量化方法，重点是压缩 KV 缓存，同时保留其信息。我们引入了 InnerQ，这是一种硬件感知的 KV 缓存量化方案，可以在不牺牲准确性的情况下降低解码延迟。 InnerQ 应用分组量化，同时根据内部维度对缓存矩阵进行分组。与之前在外部维度上进行分组的工作不同，InnerQ 将反量化与向量矩阵乘法对齐，并实现跨 GPU 计算单元的比例因子重用。这减少了内存访问并加速了反量化，与之前的工作相比，速度提高了 22\%$，与半精度向量矩阵乘法相比，速度提高了 88\%$。为了在激进的压缩下保持保真度，InnerQ 结合了 (i) 混合量化，根据本地统计数据选择每组的对称或非对称量化； (ii) 最新令牌和注意力池令牌的高精度窗口，以减少异常值泄漏； (iii) 键缓存的每通道标准化，在预填充期间计算一次并折叠到查询中以避免运行时开销。我们对 Llama 模型的评估实验表明，InnerQ 保持了与非量化 KV 缓存相当的几次 GSM8K 性能，并超越了之前的 KV 缓存量化方法。

Title: ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Authors: Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang, Kehao Wang, Ke Chen, Shengli Lin, Pinghong Zhou, Zeju Li, Yuanyuan Wang, Yi Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.23203
Pdf URL: https://arxiv.org/pdf/2602.23203
Copy Paste: [[2602.23203]] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation(https://arxiv.org/abs/2602.23203)
Keywords: generation
Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.
摘要：结肠镜检查视频生成提供动态、信息丰富的数据，这对于诊断肠道疾病至关重要，特别是在数据稀缺的情况下。高质量视频生成需要时间一致性和对临床属性的精确控制，但面临着不规则肠道结构、多样化疾病表现和各种成像方式的挑战。为此，我们提出了 ColoDiff，一种基于扩散的框架，可生成动态一致且内容感知的结肠镜检查视频，旨在缓解数据短缺并协助临床分析。在帧间级别，我们的 TimeStream 模块通过跨帧标记化机制将时间依赖性与视频序列解耦，从而在肠道结构不规则的情况下实现复杂的动态建模。在帧内级别，我们的内容感知模块结合了噪声注入嵌入和可学习原型，以实现对临床属性的精确控制，突破了扩散模型的粗略指导。此外，ColoDiff 采用非马尔可夫采样策略，可将实时生成的步骤减少 90% 以上。 ColoDiff 在三个公共数据集和一个医院数据库中进行评估，基于生成指标和下游任务，包括疾病诊断、模态区分、肠道准备评分和病变分割。大量实验表明 ColoDiff 生成的视频具有平滑的过渡和丰富的动态。 ColoDiff 在可控结肠镜检查视频生成方面做出了努力，揭示了合成视频在补充真实表现和缓解临床环境中数据稀缺方面的潜力。

Title: Through BrokenEyes: How Eye Disorders Impact Face Detection?

Authors: Prottay Kumar Adhikary
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23212
Pdf URL: https://arxiv.org/pdf/2602.23212
Copy Paste: [[2602.23212]] Through BrokenEyes: How Eye Disorders Impact Face Detection?(https://arxiv.org/abs/2602.23212)
Keywords: generation
Abstract: Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
摘要：视力障碍极大地影响了数百万人的生活，改变了视觉信息的处理和感知方式。在这项工作中，使用 BrokenEyes 系统开发了一个计算框架来模拟五种常见的眼部疾病：年龄相关性黄斑变性、白内障、青光眼、屈光不正和糖尿病视网膜病变，并分析它们对深度学习模型中类神经特征表示的影响。利用人类和非人类数据集的组合，在正常和特定疾病条件下训练的模型揭示了特征图的严重破坏，特别是对于白内障和青光眼，这与这些条件下已知的神经处理挑战相一致。激活能和余弦相似度等评估指标量化了这些扭曲的严重性，提供了对退化视觉输入和学习表征之间相互作用的见解。

Title: Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Authors: Chenhe Du, Xuanyu Tian, Qing Wu, Muyu Liu, Jingyi Yu, Hongjiang Wei, Yuyao Zhang
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2602.23214
Pdf URL: https://arxiv.org/pdf/2602.23214
Copy Paste: [[2602.23214]] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction(https://arxiv.org/abs/2602.23214)
Keywords: generative
Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.
摘要：即插即用扩散先验（PnPDP）框架已成为通过将预训练生成模型视为模块化先验来解决成像逆问题的强大范例。然而，我们发现了流行的 PnP 求解器（例如，基于 HQS 或近似梯度）的一个关键缺陷：它们充当无记忆算子，仅根据瞬时梯度更新估计。缺乏历史跟踪不可避免地会导致不消失的稳态偏差，其中重建无法严格满足严重腐败下的物理测量。为了解决这个问题，我们提出了双耦合 PnP 扩散，它恢复经典的对偶变量以提供积分反馈，理论上保证渐近收敛到精确的数据流形。然而，这种严格的几何耦合带来了第二个挑战：累积的对偶残差表现出光谱颜色的结构化伪影，违反了扩散先验的加性高斯白噪声（AWGN）假设，导致严重的幻觉。为了弥补这一差距，我们引入了频谱均质化（SH），这是一种频域适应机制，可将这些结构化残差调制为统计上兼容的伪 AWGN 输入。这有效地将求解器的严格优化轨迹与降噪器的有效统计流形保持一致。 CT 和 MRI 重建的大量实验表明，我们的方法解决了偏差与幻觉的权衡，实现了最先进的保真度，并显着加速了收敛。

Title: MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.23228
Pdf URL: https://arxiv.org/pdf/2602.23228
Copy Paste: [[2602.23228]] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction(https://arxiv.org/abs/2602.23228)
Keywords: generation
Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
摘要：随着数字娱乐的爆炸性增长，自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的一部分。电影和电视剧等长视频的自动概要生成对现有视觉语言模型 (VLM) 提出了重大挑战。虽然精通单图像字幕，但这些通用模型经常在长时间的环境中表现出严重的失败，主要是缺乏 ID 一致的角色识别和支离破碎的叙事连贯性。为了克服这些限制，我们提出了 MovieTeller，这是一种通过工具增强渐进抽象生成电影概要的新颖框架。我们的核心贡献是一个免培训、工具增强、基于事实的生成过程。我们的框架不需要进行昂贵的模型微调，而是以即插即用的方式直接利用现成的模型。我们首先调用专门的人脸识别模型作为外部“工具”来建立事实基础——精确的角色身份及其相应的边界框。然后将这些基础注入到提示中以引导 VLM 的推理，确保生成的场景描述锚定到可验证的事实。此外，我们的渐进式抽象管道将全长电影的摘要分解为多阶段过程，有效缓解了当前 VLM 的上下文长度限制。实验表明，与端到端基线相比，我们的方法在事实准确性、角色一致性和整体叙事连贯性方面取得了显着改善。

Title: Large Multimodal Models as General In-Context Classifiers

Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23229
Pdf URL: https://arxiv.org/pdf/2602.23229
Copy Paste: [[2602.23229]] Large Multimodal Models as General In-Context Classifiers(https://arxiv.org/abs/2602.23229)
Keywords: generative
Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
摘要：我们应该使用哪种多模态模型进行分类？先前的研究表明，答案在于类似 CLIP 的对比视觉语言模型 (VLM)，因为它们在零样本分类中具有出色的性能。相比之下，大型多模态模型（LMM）更适合复杂的任务。在这项工作中，我们认为这个答案忽视了 LMM 的一项重要功能：上下文学习。我们在不同的数据集上对最先进的 LMM 进行封闭世界分类的基准测试，发现虽然它们的零样本性能低于 CLIP，但具有一些上下文示例的 LMM 可以匹配甚至超过具有基于缓存的适配器的对比 VLM（它们的“上下文”等效项）。我们将此分析扩展到开放世界环境，LMM 的生成性质使它们更适合该任务。在这种具有挑战性的场景中，只要提供不完美的上下文信息，LMM 就会陷入困境。为了解决这个问题，我们提出了 CIRCLE，这是一种简单的免训练方法，它将伪标签分配给上下文中的示例，并使用可用上下文本身迭代地完善它们。通过大量实验，我们表明 CIRCLE 为开放世界分类建立了稳健的基线，超越了 VLM 同类产品，并突显了 LMM 作为统一分类器和专业模型的灵活替代方案的潜力。

Title: Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Authors: Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng, Nicu Sebe
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2602.23259
Pdf URL: https://arxiv.org/pdf/2602.23259
Copy Paste: [[2602.23259]] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving(https://arxiv.org/abs/2602.23259)
Keywords: generative
Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
摘要：随着模仿学习（IL）和大规模驾驶数据集的进步，端到端自动驾驶（E2E-AD）最近取得了巨大进展。目前，基于IL的方法已成为主流范式：模型依赖于专家给出的标准驾驶行为，并学习最小化其行为与专家行为之间的差异。然而，“只像专家一样驾驶”这一目标的泛化能力有限：当遇到专家演示分布之外的罕见或未见的长尾场景时，模型往往会在缺乏先验经验的情况下做出不安全的决策。这就提出了一个基本问题：E2E-AD 系统能否在没有任何专家行动监督的情况下做出可靠的决策？受此启发，我们提出了一个名为风险感知世界模型预测控制（RaWMPC）的统一框架，通过稳健控制来解决这种泛化困境，而不依赖于专家演示。实际上，RaWMPC 利用世界模型来预测多个候选行动的后果，并通过明确的风险评估选择低风险行动。为了赋予世界模型预测危险驾驶行为结果的能力，我们设计了一种风险感知交互策略，系统地将世界模型暴露于危险行为，使灾难性结果可预测，从而可以避免。此外，为了在测试时生成低风险的候选动作，我们引入了一种自我评估蒸馏方法，将风险规避能力从训练有素的世界模型中提取到生成动作提案网络中，而无需任何专家演示。大量实验表明，RaWMPC 在分布内和分布外场景中均优于最先进的方法，同时提供卓越的决策可解释性。

Title: Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Authors: Jasmine Bayrooti, Weiwei Kong, Natalia Ponomareva, Carlos Esteves, Ameesh Makadia, Amanda Prorok
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2602.23262
Pdf URL: https://arxiv.org/pdf/2602.23262
Copy Paste: [[2602.23262]] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling(https://arxiv.org/abs/2602.23262)
Keywords: super-resolution, generation, generative
Abstract: Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.
摘要：在敏感图像数据集上训练的生成模型存在记忆和复制单个训练示例的风险，因此强有力的隐私保证至关重要。虽然差分隐私 (DP) 为此类保证提供了原则性框架，但标准 DP 微调（例如使用 DP-SGD）通常会导致图像质量严重下降，特别是在高频纹理中，因为所有模型参数中都会不加区别地添加噪声。在这项工作中，我们提出了一个基于以下假设的谱DP框架：图像中对隐私最敏感的部分通常是小波空间中的低频分量（例如面部特征和物体形状），而高频分量主要是通用和公共的。基于这一假设，我们提出了以下两阶段框架，用于使用粗糙图像中介生成 DP 图像：（1）DP 在敏感图像的低分辨率小波系数上微调自回归光谱图像标记器模型，以及（2）使用公开预训练的超分辨率模型执行高分辨率上采样。通过在第一阶段将隐私预算限制在图像的全局结构，并利用 DP 的后处理特性进行细节细化，我们在隐私和实用性之间实现了有希望的权衡。 MS-COCO 和 MM-CelebA-HQ 数据集上的实验表明，相对于其他领先的 DP 图像框架，我们的方法生成的图像质量和风格捕捉都有所提高。

Title: ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Authors: Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.23295
Pdf URL: https://arxiv.org/pdf/2602.23295
Copy Paste: [[2602.23295]] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation(https://arxiv.org/abs/2602.23295)
Keywords: generation, generative
Abstract: In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
摘要：近年来，大型数据集阻碍了有效的模型训练，同时还包含冗余概念。数据集蒸馏的目的是合成紧凑的数据集，保留大规模训练集的知识，同时大幅减少存储和计算。扩散模型的最新进展通过利用预先训练的生成先验实现了免训练蒸馏；然而，现有的指导战略仍然有限。当前基于分数的方法要么执行无引导的去噪，要么依赖于对实例原型质心（IPC 质心）的简单的基于模式的指导，这通常是初级的且次优的。我们提出了Manifold-Guided Distillation (ManifoldGD)，这是一种基于扩散的免训练框架，在每个去噪时间步长中集成了多种一致的指导。我们的方法采用通过 VAE 潜在特征的分层、分裂聚类计算的 IPC，产生 IPC 的多尺度核心集，该 IPC 可以捕获粗略的语义模式和精细的类内变异性。使用提取的 IPC 质心的局部邻域，我们为每个扩散去噪时间步创建潜在流形。在每个去噪步骤中，我们将模式对齐向量投影到估计的潜在流形的局部切线空间上，从而限制生成轨迹保持流形忠实，同时保持语义一致性。该公式提高了代表性、多样性和图像保真度，无需任何模型重新训练。实证结果表明，在 FID、真实和合成数据集嵌入之间的 l2 距离以及分类精度方面，与现有的免训练和基于训练的基线相比，ManifoldGD 取得了一致的成果，将 ManifoldGD 确立为第一个几何感知的免训练数据蒸馏框架。

Title: PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Authors: Yiqing Wang, Chunming He, Ming-Chen Lu, Mercy Pawar, Leslie Niziol, Maria Woodward, Sina Farsiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2602.23297
Pdf URL: https://arxiv.org/pdf/2602.23297
Copy Paste: [[2602.23297]] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM(https://arxiv.org/abs/2602.23297)
Keywords: generation
Abstract: Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.
摘要：医学诊断需要视觉表现和临床元数据的有效综合。然而，现有方法通常将元数据视为孤立的标签，未能利用临床描述中嵌入的丰富语义知识。我们提出了 PRIMA（风险集成图像元数据对齐的预训练），这是一个将特定领域知识集成到多模态表示学习中的框架。我们首先通过检索增强生成 (RAG) 策划风险与疾病相关性的专家语料库，以完善临床 ModernBERT，将诊断先验嵌入到文本编码器中。为了弥补模态差距，我们引入了一种利用 DINOv3 和我们改进的 BERT 的双编码器预训练策略，并通过一组四个互补损失函数进行了优化。这些损失旨在捕获多粒度语义对齐并通过软标签处理临床相关性的模糊性。最后，我们利用 Qwen-3 融合这些对齐的特征以进行精确的疾病分类。大量实验表明，PRIMA 有效地将像素级特征与抽象临床专业知识相协调，显着优于其他最先进的方法。值得注意的是，我们的框架无需大量数据收集或详尽的计算资源即可实现卓越的稳健性。我们的代码将在接受后公开。

Title: A Proper Scoring Rule for Virtual Staining

Authors: Samuel Tonks, Steve Hood, Ryan Musso, Ceridwen Hopely, Steve Titus, Minh Doan, Iain Styles, Alexander Krull
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2602.23305
Pdf URL: https://arxiv.org/pdf/2602.23305
Copy Paste: [[2602.23305]] A Proper Scoring Rule for Virtual Staining(https://arxiv.org/abs/2602.23305)
Keywords: generative
Abstract: Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological feature values for each input and cell. However, when evaluating a VS model, the true posterior is unavailable. Existing evaluation protocols only check the accuracy of the marginal distribution over the dataset rather than the predicted posteriors. We introduce information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors. IG is a strictly proper scoring rule and comes with a sound theoretical motivation allowing for interpretability, and for comparing results across models and features. We evaluate diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and show that IG can reveal substantial performance differences other metrics cannot.
摘要：用于高通量筛选 (HTS) 的生成虚拟染色 (VS) 模型可以提供每个输入和细胞的可能生物特征值的估计后验分布。然而，在评估 VS 模型时，真实的后验是不可用的。现有的评估协议仅检查数据集上边缘分布的准确性，而不是预测后验。我们引入信息增益（IG）作为细胞级评估框架，可以直接评估预测的后验。 IG 是一个严格正确的评分规则，并具有合理的理论动机，允许可解释性以及比较模型和特征之间的结果。我们使用 IG 和其他指标在广泛的 HTS 数据集上评估基于扩散和基于 GAN 的模型，并表明 IG 可以揭示其他指标无法揭示的实质性性能差异。

Title: ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Authors: Tianjun Yao, Yongqiang Chen, Yujia Zheng, Pan Li, Zhiqiang Shen, Kun Zhang
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2602.23320
Pdf URL: https://arxiv.org/pdf/2602.23320
Copy Paste: [[2602.23320]] ParamMem: Augmenting Language Agents with Parametric Reflective Memory(https://arxiv.org/abs/2602.23320)
Keywords: generation
Abstract: Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.
摘要：自我反思使语言代理能够迭代地完善解决方案，但通常会产生限制推理性能的重复输出。最近的研究试图通过各种方法来解决这一限制，其中增加反思多样性已显示出希望。我们的实证分析揭示了反射多样性与任务成功之间存在很强的正相关性，进一步激发了对多样化反射信号的需求。我们引入了 ParamMem，这是一种参数化存储模块，可将跨样本反射模式编码为模型参数，从而通过温度控制采样实现多种反射生成。在此模块的基础上，我们提出了 ParamAgent，一个基于反射的代理框架，它将参数内存与情景和跨样本内存集成在一起。关于代码生成、数学推理和多跳问答的广泛实验证明了对最先进基线的持续改进。进一步的分析表明，ParamMem 具有样本效率，能够跨模型尺度从弱到强的迁移，并且支持自我改进，而不依赖于更强的外部模型，凸显了 ParamMem 作为增强语言代理的有效组件的潜力。

Title: SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2602.23359
Pdf URL: https://arxiv.org/pdf/2602.23359
Copy Paste: [[2602.23359]] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation(https://arxiv.org/abs/2602.23359)
Keywords: generation
Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
摘要：我们将遮挡推理视为 3D 布局条件生成的一个基本但被忽视的方面。它对于合成具有深度一致的几何形状和比例的部分遮挡的对象至关重要。虽然现有方法可以生成遵循输入布局的真实场景，但它们通常无法对精确的对象间遮挡进行建模。我们提出了 SeeThrough3D，这是一种用于 3D 布局条件生成的模型，可显式模拟遮挡。我们引入了遮挡感知 3D 场景表示 (OSCR)，其中对象被描绘为放置在虚拟环境中的半透明 3D 框，并从所需的摄像机视角进行渲染。透明度对隐藏对象区域进行编码，使模型能够推理遮挡，而渲染的视点在生成过程中提供显式的相机控制。我们通过引入一组从我们渲染的 3D 表示派生的视觉标记来调节基于预训练流的文本到图像图像生成模型。此外，我们应用屏蔽自注意力将每个对象边界框准确地绑定到其相应的文本描述，从而能够准确生成多个对象而无需对象属性混合。为了训练模型，我们构建了一个包含具有强对象间遮挡的多种多对象场景的合成数据集。 SeeThrough3D 可以有效地推广到看不见的对象类别，并通过逼真的遮挡和一致的相机控制实现精确的 3D 布局控制。