2026-01-16

Title: NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration

Authors: Subhajit Sanyal, Srinivas Soumitri Miriyala, Akshay Janardan Bankar, Sravanth Kodavanti, Harshit, Abhishek Ameta, Shreyas Pandith, Amit Satish Unde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.09823
Pdf URL: https://arxiv.org/pdf/2601.09823
Copy Paste: [[2601.09823]] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration(https://arxiv.org/abs/2601.09823)
Keywords: restoration, super-resolution, generation, generative
Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.
摘要：Stable Diffusion 1.5 等潜在扩散模型提供了强大的生成先验，对于图像恢复非常有价值，但其完整管道的计算量仍然太大，不适合在边缘设备上部署。现有的轻量级变体主要压缩去噪 U-Net 或减少扩散轨迹，这会破坏潜在的流形并限制泛化超出单个任务。我们介绍 NanoSD，这是一系列帕累托最优扩散基础模型，通过网络手术、特征生成蒸馏和联合应用于 U-Net 和 VAE 编码器-解码器的结构化架构缩放，从稳定扩散 1.5 中提炼出来。这种全流程协同设计保留了生成先验，同时生成沿着精度-延迟大小边界占据不同操作点的模型（例如，130M-315M 参数，在移动级 NPU 上实现低至 20ms 的实时推理）。我们表明，参数减少本身与硬件效率无关，并且我们提供了一项分析，揭示了架构平衡、功能路由和潜在空间保留如何共同塑造真正的设备上延迟。当用作嵌入式骨干网时，NanoSD 可在图像超分辨率、图像去模糊、人脸恢复和单目深度估计方面实现最先进的性能，在感知质量和实际可部署性方面均优于先前的轻量级扩散模型。 NanoSD建立了一个通用的扩散基础模型系列，适用于边缘设备上的实时视觉生成和恢复。

Title: ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

Authors: Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2601.09851
Pdf URL: https://arxiv.org/pdf/2601.09851
Copy Paste: [[2601.09851]] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning(https://arxiv.org/abs/2601.09851)
Keywords: generative
Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
摘要：多模式视频字幕将密集的镜头压缩为关键帧和自然语言的结构化格式。通过创建有凝聚力的多模态摘要，该方法将生成式人工智能锚定在丰富的语义证据中，并作为高效检索的轻量级代理。然而，BLEU 或 ROUGE 等传统指标无法量化不同模式的信息覆盖范围，例如将一段文本与一系列关键帧进行比较。为了解决这个问题，我们提出了视频摘要信息丢失（ViSIL）分数，这是一种信息理论框架，可通过视觉语言模型（VLM）推理来量化摘要未捕获的视频信息。通过测量信息丢失，ViSIL 是一个统一的度量标准，可以在多模式摘要格式之间进行直接比较，尽管它们存在结构差异。我们的结果表明，ViSIL 分数与人类和 VLM 在视频问答 (VQA) 任务中的表现存在统计显着相关性。 ViSIL 还支持摘要选择，以优化信息丢失和处理速度之间的权衡，建立帕累托最优边界，在 VQA 准确性方面比文本摘要高出 7\%$，而无需增加处理负载。

Title: VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching

Authors: Kiarie Ndegwa, Andreas Gros, Tony Chang, David Diaz, Vincent A. Landau, Nathan E. Rutenbeck, Luke J. Zachmann, Guy Bayes, Scott Conway
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09866
Pdf URL: https://arxiv.org/pdf/2601.09866
Copy Paste: [[2601.09866]] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching(https://arxiv.org/abs/2601.09866)
Keywords: super-resolution, generative
Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.
摘要：我们提出了 VibrantSR（充满活力的超分辨率），这是一种生成超分辨率框架，用于根据 10 米 Sentinel-2 图像估算 0.5 米树冠高度模型 (CHM)。与受不频繁和不规则的采集计划限制的基于航空图像的方法不同，VibrantSR 利用全球可用的 Sentinel-2 季节性合成材料，能够以季节性到年度的节奏进行一致的监测。使用空间不相交的验证分割对美国西部 22 个 EPA 3 级生态区域进行评估后，VibrantSR 在冠层高度 >= 2 m 的情况下实现了 4.39 米的平均绝对误差，优于 Meta (4.83 m)、LANDFIRE (5.96 m) 和 ETH (7.05 m) 基于卫星的基准。虽然基于空中的 VibrantVS (2.71 m MAE) 保留了准确性优势，但 VibrantSR 可以实现大陆范围内的森林监测和碳核算，而无需依赖昂贵且暂时不频繁的空中采集。

Title: MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

Authors: Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.09879
Pdf URL: https://arxiv.org/pdf/2601.09879
Copy Paste: [[2601.09879]] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation(https://arxiv.org/abs/2601.09879)
Keywords: generation
Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
摘要：医学视觉语言模型 (VLM) 的最新进展在报告生成和视觉问答 (VQA) 等以图像级文本为中心的任务上取得了出色的性能。然而，在 3D 医学 VLM 中实现细粒度的视觉基础和体积空间推理仍然具有挑战性，特别是当旨在将这些功能统一在单个可通用框架内时。为了应对这一挑战，我们提出了 MedVL-SAM2，这是一种统一的 3D 医学多模态模型，同时支持报告生成、VQA 和多范式分割，包括语义、引用和交互式分割。 MedVL-SAM2 通过专为 3D 医学成像定制的内聚架构集成了图像级推理和像素级感知，并结合了基于 SAM2 的体积分割模块，以实现精确的多粒度空间推理。该模型在多阶段管道中进行训练：首先在大规模 3D CT 图像-文本对语料库上进行预训练，以将体积视觉特征与放射学语言嵌入对齐。然后使用全面的 3D CT 分割数据集对语言理解和分割目标进行联合优化。这种联合训练可以通过语言、点或框提示进行灵活的交互，从而将高级视觉推理与空间精确定位统一起来。我们的统一架构在报告生成、VQA 和多个 3D 分割任务方面提供最先进的性能。广泛的分析进一步表明，该模型提供了可靠的 3D 视觉基础、可控的交互式分割和强大的跨模态推理，表明可以在统一的 3D 医学 VLM 中共同实现高级语义推理和精确的 3D 定位。

Title: Transition Matching Distillation for Fast Video Generation

Authors: Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, Arash Vahdat
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.09881
Pdf URL: https://arxiv.org/pdf/2601.09881
Copy Paste: [[2601.09881]] Transition Matching Distillation for Fast Video Generation(https://arxiv.org/abs/2601.09881)
Keywords: generation
Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: this https URL
摘要：大型视频扩散和流动模型在高质量视频生成方面取得了显着的成功，但由于其低效的多步采样过程，它们在实时交互应用中的使用仍然受到限制。在这项工作中，我们提出了过渡匹配蒸馏（TMD），这是一种将视频扩散模型蒸馏为高效的少步生成器的新颖框架。 TMD 的中心思想是将扩散模型的多步去噪轨迹与几步概率转移过程相匹配，其中每个转移都被建模为轻量级条件流。为了实现有效的蒸馏，我们将原始扩散主干分解为两个部分：（1）主干，包括大多数早期层，在每个外部转换步骤提取语义表示； (2) 流头，由最后几层组成，利用这些表示来执行多个内部流更新。给定一个预训练的视频扩散模型，我们首先向模型引入一个流头，并将其调整为条件流图。然后，我们将分布匹配蒸馏应用于学生模型，并在每个过渡步骤中推出流头。对提取 Wan2.1 1.3B 和 14B 文本到视频模型的大量实验表明，TMD 在生成速度和视觉质量之间提供了灵活且强大的权衡。特别是，在视觉保真度和即时依从性方面，TMD 在相当的推理成本下优于现有的蒸馏模型。项目页面：此 https URL

Title: In-Context Operator Learning on the Space of Probability Measures

Authors: Frank Cole, Dixi Wang, Yineng Chen, Yulong Lu, Rongjie Lai
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2601.09979
Pdf URL: https://arxiv.org/pdf/2601.09979
Copy Paste: [[2601.09979]] In-Context Operator Learning on the Space of Probability Measures(https://arxiv.org/abs/2601.09979)
Keywords: generative
Abstract: We introduce \emph{in-context operator learning on probability measure spaces} for optimal transport (OT). The goal is to learn a single solution operator that maps a pair of distributions to the OT map, using only few-shot samples from each distribution as a prompt and \emph{without} gradient updates at inference. We parameterize the solution operator and develop scaling-law theory in two regimes. In the \emph{nonparametric} setting, when tasks concentrate on a low-intrinsic-dimension manifold of source--target pairs, we establish generalization bounds that quantify how in-context accuracy scales with prompt size, intrinsic task dimension, and model capacity. In the \emph{parametric} setting (e.g., Gaussian families), we give an explicit architecture that recovers the exact OT map in context and provide finite-sample excess-risk bounds. Our numerical experiments on synthetic transports and generative-modeling benchmarks validate the framework.
摘要：我们引入\emph{概率测度空间上的上下文算子学习}以实现最佳传输（OT）。目标是学习单个解算子，将一对分布映射到 OT 映射，仅使用每个分布中的少量样本作为提示，并在推理时 \emph{without} 梯度更新。我们参数化解算子并在两种情况下发展标度律理论。在\emph{非参数}设置中，当任务集中在源-目标对的低内在维度流形上时，我们建立泛化界限，量化上下文中的准确性如何随提示大小、内在任务维度和模型容量进行缩放。在 \emph{parametric} 设置（例如，高斯族）中，我们给出了一个显式架构，可以恢复上下文中的精确 OT 图并提供有限样本超额风险界限。我们对合成传输和生成建模基准的数值实验验证了该框架。

Title: FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems

Authors: Tianqi Zhang, Flavio Ponzina, Tajana Rosing
Subjects: cs.LG, cs.AR, cs.IR
Abstract URL: https://arxiv.org/abs/2601.09985
Pdf URL: https://arxiv.org/pdf/2601.09985
Copy Paste: [[2601.09985]] FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems(https://arxiv.org/abs/2601.09985)
Keywords: generation
Abstract: Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.
摘要：近似最近邻搜索（ANNS）是检索增强生成（RAG）中的一项关键技术，可以从海量向量数据库中快速识别最相关的高维嵌入。现代 ANNS 引擎使用预先构建的索引来加速这一过程，并将压缩的矢量量化表示存储在快速内存中。然而，它们仍然依赖于昂贵的第二遍细化阶段，从 SSD 等较慢的存储中读取全精度向量。对于现代文本和多模式嵌入，这些读取现在主导整个查询的延迟。我们提出了 FaTRQ，这是一种使用分层内存的远内存感知细化系统，无需从存储中获取完整向量。它引入了一个渐进距离估计器，可以使用从远内存流式传输的紧凑残差来细化粗略分数。一旦候选人被证明不在 top-k 范围内，细化就会提前停止。为了支持这一点，我们提出了分层残差量化，它将残差编码为有效存储在远存储器中的三元值。 CXL Type-2 设备中部署了自定义加速器，以在本地执行低延迟细化。总之，FaTRQ 比 SOTA GPU ANNS 系统将存储效率提高了 2.4$\times$，并将吞吐量提高了高达 9$\times$。

Title: DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis

Authors: Chengjia Liang, Zhenjiong Wang, Chao Chen, Ruizhi Zhang, Songxi Liang, Hai Xie, Haijun Lei, Zhongwei Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10001
Pdf URL: https://arxiv.org/pdf/2601.10001
Copy Paste: [[2601.10001]] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis(https://arxiv.org/abs/2601.10001)
Keywords: generation, generative
Abstract: Parkinson's disease (PD) and Alzheimer's disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer's Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.
摘要：帕金森病 (PD) 和阿尔茨海默病 (AD) 是世界范围内两种最普遍且无法治愈的神经退行性疾病 (ND)，早期诊断对于延缓其进展至关重要。然而，具有不同结构形式的多度量数据的高维性、神经影像和表型数据的异质性以及类别不平衡共同给早期 ND 诊断带来了重大挑战。为了应对这些挑战，我们提出了一种动态加权双图注意网络（DW-DGAT），它集成了：（1）通用数据融合策略，用于合并多度量数据的三种结构形式；（2）基于大脑区域和样本间关系的双图注意力架构，以提取微观和宏观特征；（3）类别权重生成机制与两个稳定有效的损失函数相结合，以减轻类别不平衡。基于帕金森病进展标记计划 (PPMI) 和阿尔茨海默病神经影像计划 (ADNI) 研究的严格实验证明了我们方法的最先进性能。

Title: Continuous-Depth Transformers with Learned Control Dynamics

Authors: Peter Jemley
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.10007
Pdf URL: https://arxiv.org/pdf/2601.10007
Copy Paste: [[2601.10007]] Continuous-Depth Transformers with Learned Control Dynamics(https://arxiv.org/abs/2601.10007)
Keywords: generation
Abstract: We present a hybrid transformer architecture that replaces discrete middle layers with a continuous-depth Neural Ordinary Differential Equation (ODE) block, enabling inference-time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field $F_\theta(H, \tau, u)$, where $u$ is a low-dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98\%/88\% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068\% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables $O(1)$ memory training regardless of integration depth. Our results demonstrate that continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.
摘要：我们提出了一种混合变压器架构，用连续深度神经常微分方程 (ODE) 块替换离散的中间层，从而通过学习的转向信号实现对生成属性的推理时间控制。与通过固定离散层处理表示的标准变换器不同，我们的方法将深度视为由学习向量场 $F_\theta(H, \tau, u)$ 控制的连续变量，其中 $u$ 是通过显式串联注入的低维控制信号。我们通过四个实验验证了该架构：(1) 零爆炸/消失梯度事件的梯度流稳定性，(2) 语义引导实现了积极/消极情绪控制的 98\%/88\% 准确度，(3) 通过固定和自适应求解器之间可忽略的 0.068\% 轨迹差异验证连续插值，以及 (4) 效率基准测试证明了与标准离散基线的延迟平价。此外，我们表明自适应 ODE 求解器揭示了学习动态中的几何结构：控制信号将矢量场划分为具有不同曲率特征的不同动态状态。无论集成深度如何，伴随方法都可以进行 $O(1)$ 内存训练。我们的结果表明，具有学习控制信号的连续深度动态为可引导语言生成提供了可行、有效的机制。

Title: CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Authors: Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10061
Pdf URL: https://arxiv.org/pdf/2601.10061
Copy Paste: [[2601.10061]] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation(https://arxiv.org/abs/2601.10061)
Keywords: generation
Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
摘要：最近的视频生成模型揭示了帧链（CoF）推理的出现，实现了逐帧视觉推理。凭借此功能，视频模型已成功应用于各种视觉任务（例如迷宫求解、视觉谜题）。然而，由于在 T2I 生成过程中缺乏明确定义的视觉推理起点和可解释的中间状态，它们增强文本到图像 (T2I) 生成的潜力在很大程度上仍未被开发。为了弥补这一差距，我们提出了 CoF-T2I，该模型通过渐进式视觉细化将 CoF 推理集成到 T2I 生成中，其中中间帧充当显式推理步骤，最终帧作为输出。为了建立这样一个明确的生成过程，我们策划了 CoF-Evol-Instruct，这是一个 CoF 轨迹数据集，用于模拟从语义到美学的生成过程。为了进一步提高质量并避免运动伪影，我们为每个帧启用独立的编码操作。实验表明，CoF-T2I 的性能显着优于基础视频模型，并在具有挑战性的基准测试中实现了具有竞争力的性能，在 GenEval 上达到 0.86，在 Imagine-Bench 上达到 7.468。这些结果表明视频模型对于推进高质量文本到图像生成的巨大前景。

Title: Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks

Authors: Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10090
Pdf URL: https://arxiv.org/pdf/2601.10090
Copy Paste: [[2601.10090]] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks(https://arxiv.org/abs/2601.10090)
Keywords: generation
Abstract: In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.
摘要：在本文中，我们提出难度引导采样（DGS）来弥合蒸馏目标和下游任务之间的目标差距，从而提高数据集蒸馏的性能。深度神经网络取得了显着的性能，但训练过程却耗费时间和存储空间。提出数据集蒸馏来生成紧凑、高质量的蒸馏数据集，从而在保持下游性能的同时实现有效的模型训练。现有的方法通常关注从原始数据集中提取的特征，忽略特定于任务的信息，这导致蒸馏目标和下游任务之间存在目标差距。我们建议利用有利于下游训练的特征来进行数据蒸馏，以弥补这一差距。着眼于图像分类的下游任务，我们引入了难度的概念，并提出了 DGS 作为插件式后级采样模块。根据特定的目标难度分布，最终的蒸馏数据集是从现有方法生成的图像池中采样的。我们还提出了难度感知指导（DAG）来探索难度在生成过程中的影响。跨多种设置的广泛实验证明了所提出方法的有效性。它还凸显了各种下游任务的更广泛的潜在困难。

Title: Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Authors: Piyush Singh Pasi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10096
Pdf URL: https://arxiv.org/pdf/2601.10096
Copy Paste: [[2601.10096]] Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text(https://arxiv.org/abs/2601.10096)
Keywords: generation
Abstract: Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at this https URL , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (this https URL ), AudioCaps Multilingual (this https URL ), and Clotho Multilingual (this https URL ), to facilitate further research.
摘要：多模态模型在英语方面表现出色，有丰富的图像文本和音频文本数据支持，但由于多语言多模态资源有限，其他语言的性能急剧下降。现有的解决方案严重依赖机器翻译，而多语言文本建模的进步仍未得到充分利用。我们引入了 METAL，一种轻量级对齐方法，仅使用英语文本学习几个线性层，即可将多语言文本嵌入映射到多模态空间。尽管它很简单，但 METAL 与英语的基线性能相当（10 时的召回率为 94.9%），并在 XTD 文本到图像检索上实现了强大的零样本迁移（11 种语言的平均召回率为 10 %，其中 10 种是看不见的）。定性 t-SNE 可视化显示多语言嵌入与多模态表示紧密结合，而权重分析表明该变换重塑了嵌入几何形状，而不是执行简单的旋转。除了图像文本检索之外，METAL 还可以推广到音频文本检索和跨语言文本到图像的生成。我们在此 https URL 发布代码和检查点，以及多语言评估数据集，包括 MSCOCO Multilingual 30K（此 https URL ）、AudioCaps Multilingual（此 https URL ）和 Clotho Multilingual（此 https URL ），以方便进一步研究。

Title: FlowAct-R1: Towards Interactive Humanoid Video Generation

Authors: Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, Yulong Wang, Zerong Zheng, Jianwen Jiang, Chao Liang, Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10103
Pdf URL: https://arxiv.org/pdf/2601.10103
Copy Paste: [[2601.10103]] FlowAct-R1: Towards Interactive Humanoid Video Generation(https://arxiv.org/abs/2601.10103)
Keywords: generation
Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.
摘要：交互式人形视频生成旨在合成逼真的视觉代理，可以通过连续且响应式的视频与人类互动。尽管视频合成最近取得了进展，但现有方法常常难以在高保真合成和实时交互要求之间进行权衡。在本文中，我们提出了 FlowAct-R1，一个专门为实时交互式人形视频生成而设计的框架。 FlowAct-R1 基于 MMDiT 架构构建，可实现任意持续时间的视频流合成，同时保持低延迟响应能力。我们引入了一种分块扩散强迫策略，并辅以一种新颖的自强迫变体，以减轻错误积累并确保连续交互过程中的长期时间一致性。通过利用高效的蒸馏和系统级优化，我们的框架在 480p 分辨率下实现了稳定的 25fps，首帧时间 (TTFF) 仅约 1.5 秒。所提出的方法提供了整体和细粒度的全身控制，使代理能够在交互场景中的不同行为状态之间自然地转换。实验结果表明，FlowAct-R1 实现了卓越的行为生动性和感知真实感，同时在不同的角色风格中保持了强大的泛化能力。

Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10129
Pdf URL: https://arxiv.org/pdf/2601.10129
Copy Paste: [[2601.10129]] LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning(https://arxiv.org/abs/2601.10129)
Keywords: generation
Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
摘要：当前的多模态潜在推理通常依赖于外部监督（例如辅助图像），而忽略了内在的视觉注意动态。在这项工作中，我们确定了精炼中的一个关键感知差距：学生模型经常模仿教师的文本输出，同时关注根本不同的视觉区域，有效地依赖于语言先验而不是基础感知。为了解决这个问题，我们提出了 LaViT，一个调整潜在视觉思想而不是静态嵌入的框架。 LaViT 迫使学生在文本生成之前自回归重建教师的视觉语义和注意力轨迹，采用课程感觉门控机制来防止捷径学习。大量实验表明，LaViT 显着增强了视觉基础，在复杂推理任务上实现了高达 +16.9% 的增益，并使紧凑的 3B 模型能够超越更大的开源变体和 GPT-4o 等专有模型。

Title: RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation

Authors: Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Sihong Xie
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.10168
Pdf URL: https://arxiv.org/pdf/2601.10168
Copy Paste: [[2601.10168]] RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation(https://arxiv.org/abs/2601.10168)
Keywords: generation
Abstract: Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.
摘要：开放词汇 3D 场景图 (3DSG) 生成可以通过利用结构化语义表示来增强机器人技术中的各种下游任务，例如操纵和导航。 3DSG 由场景的多个图像构建，其中对象表示为节点，关系表示为边。然而，现有的开放词汇表 3DSG 生成工作存在对象级识别精度和速度较低的问题，这主要是由于视点受限、遮挡和冗余表面密度造成的。为了应对这些挑战，我们提出 RAG-3DSG 通过重新拍摄引导的不确定性估计来减轻聚合噪声，并通过可靠的低不确定性对象支持对象级检索增强生成（RAG）。此外，我们提出了一种动态下采样映射策略，以自适应粒度加速跨图像对象聚合。在副本数据集上的实验表明，RAG-3DSG 显着提高了 3DSG 生成中的节点字幕准确性，同时与普通版本相比，将映射时间减少了三分之二。

Title: From Physical Degradation Models to Task-Aware All-in-One Image Restoration

Authors: Hu Gao, Xiaoning Lei, Xichen Xu, Xingjian Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10192
Pdf URL: https://arxiv.org/pdf/2601.10192
Copy Paste: [[2601.10192]] From Physical Degradation Models to Task-Aware All-in-One Image Restoration(https://arxiv.org/abs/2601.10192)
Keywords: restoration
Abstract: All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.
摘要：一体式图像恢复旨在使用单个训练模型自适应地处理多个恢复任务。尽管现有方法通过引入提示信息或利用大型模型取得了可喜的结果，但添加的学习模块增加了系统复杂性并阻碍了实时适用性。在本文中，我们采用物理退化建模视角，并预测任务感知逆退化算子，以实现高效的一体化图像恢复。该框架由两个阶段组成。在第一阶段，预测逆算子生成初始恢复图像以及不确定性感知图，突出显示难以重建的区域，确保恢复可靠性。在第二阶段，在该不确定性图的指导下进一步细化恢复。两个阶段都使用相同的逆算子预测网络，算子预测后引入任务感知参数以适应不同的退化任务。此外，通过加速逆算子的卷积，该方法实现了高效的一体化图像恢复。由此产生的紧密集成架构（称为 OPIR）经过实验的广泛验证，展示了卓越的一体化恢复性能，同时在任务对齐恢复方面保持高度竞争力。

Title: ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation

Authors: Kim Youwang, Lee Hyoseok, Subin Park, Gerard Pons-Moll, Tae-Hyun Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10200
Pdf URL: https://arxiv.org/pdf/2601.10200
Copy Paste: [[2601.10200]] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation(https://arxiv.org/abs/2601.10200)
Keywords: generative
Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
摘要：我们引入了 ELITE，这是一种通过学习初始化和测试时间生成适应从单眼视频合成高效高斯头部头像的方法。先前的工作依赖于 3D 数据先验或 2D 生成先验来补偿单目视频中丢失的视觉线索。然而，3D 数据先验方法通常很难在野外进行泛化，而 2D 生成先验方法计算量大且容易产生身份幻觉。我们确定了这两个先验之间的互补协同作用，并设计了一个有效的系统，该系统可以实现具有强大的野外泛化能力的高保真可动画化身合成。具体来说，我们引入了前馈 Mesh2Gaussian 先验模型（MGPM），可以快速初始化高斯化身。为了进一步弥合测试时的领域差距，我们设计了一个测试时生成适应阶段，利用真实图像和合成图像作为监督。与之前缓慢且容易产生幻觉的完全扩散去噪策略不同，我们提出了一种基于高斯头像渲染的渲染引导单步扩散增强器，可以恢复丢失的视觉细节。我们的实验表明，即使对于具有挑战性的表达，ELITE 也能产生比之前的作品视觉效果更好的化身，同时合成速度比之前的 2D 生成方法快 60 倍。

Title: Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation

Authors: Dong-Yu Chen, Yixin Guo, Shuojin Yang, Tai-Jiang Mu, Shi-Min Hu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2601.10214
Pdf URL: https://arxiv.org/pdf/2601.10214
Copy Paste: [[2601.10214]] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation(https://arxiv.org/abs/2601.10214)
Keywords: generation
Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.
摘要：相机控制在条件视频生成方面已得到广泛研究；然而，在忠实地保留视频内容的同时精确地改变摄像机轨迹仍然是一项具有挑战性的任务。实现精确相机控制的主流方法是根据目标轨迹扭曲 3D 表示。然而，此类方法无法充分利用视频扩散模型 (VDM) 的 3D 先验，并且经常陷入修复陷阱，导致主题不一致和生成质量下降。为了解决这个问题，我们提出了 DepthDirector，一种具有精确相机可控性的视频重新渲染框架。通过利用来自显式 3D 表示的深度视频作为摄像机控制指导，我们的方法可以在新颖的摄像机轨迹下忠实地再现输入视频的动态场景。具体来说，我们设计了一种视图内容双流条件机制，将源视频和目标视点下渲染的扭曲深度序列注入到预训练的视频生成模型中。这种几何引导信号使 VDM 能够理解摄像机运动并利用其 3D 理解能力，从而促进精确的摄像机控制和一致的内容生成。接下来，我们引入一个基于 LoRA 的轻量级视频扩散适配器来训练我们的框架，充分保留 VDM 的知识先验。此外，我们使用虚幻引擎 5 构建了一个名为 MultiCam-WarpData 的大规模多摄像机同步数据集，包含跨 1K 动态场景的 8K 视频。大量实验表明，DepthDirector 在相机可控性和视觉质量方面均优于现有方法。我们的代码和数据集将公开。

Title: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Authors: Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2601.10323
Pdf URL: https://arxiv.org/pdf/2601.10323
Copy Paste: [[2601.10323]] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding(https://arxiv.org/abs/2601.10323)
Keywords: generation
Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
摘要：最近的全多模态大语言模型在统一音频、视觉和文本建模方面显示出了前景。然而，流媒体音频视频理解仍然具有挑战性，因为现有方法存在功能脱节的问题：它们通常表现出不完整的模态支持或缺乏自主主动监控。为了解决这个问题，我们推出了 ROMA，一种实时全多模式助手，用于统一反应式和主动式交互。 ROMA 将连续输入处理为同步多模态单元，将密集音频与离散视频帧对齐以处理粒度不匹配。对于在线决策，我们引入了一个轻量级的发言头，它将响应启动与生成分离，以确保精确触发而不会发生任务冲突。我们使用精心策划的流媒体数据集和两阶段课程来训练 ROMA，逐步优化流媒体格式适应和主动响应能力。为了标准化分散的评估环境，我们将不同的基准重新组织成一个统一的套件，涵盖主动（警报、叙述）和反应（QA）设置。跨越 12 个基准的广泛实验表明，ROMA 在主动任务上实现了最先进的性能，同时在反应设置中具有竞争力，验证了其在统一实时全多模态理解方面的稳健性。

Title: Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Authors: Siqi Kou, Jiachun Jin, Zetong Zhou, Ye Ma, Yugang Wang, Quan Chen, Peng Jiang, Xiao Yang, Jun Zhu, Kai Yu, Zhijie Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10332
Pdf URL: https://arxiv.org/pdf/2601.10332
Copy Paste: [[2601.10332]] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders(https://arxiv.org/abs/2601.10332)
Keywords: generation
Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
摘要：文本到图像 (T2I) 扩散模型 (DM) 的最新进展使得能够根据不同的文本提示进行高质量的视觉合成。然而，大多数现有的 T2I DM，即使是那些配备了基于大语言模型 (LLM) 的文本编码器的 T2I DM，仍然是文本像素映射器——它们仅使用 LLM 作为文本编码器，而没有利用其固有的推理能力来推断在给定文本提示的情况下应该直观地描述什么。为了超越这种文字生成，我们提出了思考然后生成（T2G）范例，其中鼓励基于 LLM 的文本编码器推理并重写原始用户提示；重写提示的状态然后用作扩散调节。为了实现这一目标，我们首先通过轻量级监督微调过程激活 LLM 编码器的“思考然后重写”模式。随后，LLM 编码器和扩散主干进行共同优化，以确保通过 Dual-GRPO 对上下文进行忠实推理并准确渲染语义。特别是，使用基于图像的奖励来强化文本编码器来推断和回忆世界知识，同时推动扩散主干以生成语义一致和视觉连贯的图像。实验表明，基于推理的图像生成和编辑基准在事实一致性、语义对齐和视觉真实感方面有了显着改进，WISE 得分达到 0.79，几乎与 GPT-4 持平。我们的结果向具有推理、表达和演示能力的下一代统一模型迈出了有希望的一步。

Title: EvoMorph: Counterfactual Explanations for Continuous Time-Series Extrinsic Regression Applied to Photoplethysmography

Authors: Mesut Ceylan, Alexis Tabin, Patrick Langer, Elgar Fleisch, Filipe Barata
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10356
Pdf URL: https://arxiv.org/pdf/2601.10356
Copy Paste: [[2601.10356]] EvoMorph: Counterfactual Explanations for Continuous Time-Series Extrinsic Regression Applied to Photoplethysmography(https://arxiv.org/abs/2601.10356)
Keywords: generation
Abstract: Wearable devices enable continuous, population-scale monitoring of physiological signals, such as photoplethysmography (PPG), creating new opportunities for data-driven clinical assessment. Time-series extrinsic regression (TSER) models increasingly leverage PPG signals to estimate clinically relevant outcomes, including heart rate, respiratory rate, and oxygen saturation. For clinical reasoning and trust, however, single point estimates alone are insufficient: clinicians must also understand whether predictions are stable under physiologically plausible variations and to what extent realistic, attainable changes in physiological signals would meaningfully alter a model's prediction. Counterfactual explanations (CFE) address these "what-if" questions, yet existing time series CFE generation methods are largely restricted to classification, overlook waveform morphology, and often produce physiologically implausible signals, limiting their applicability to continuous biomedical time series. To address these limitations, we introduce EvoMorph, a multi-objective evolutionary framework for generating physiologically plausible and diverse CFE for TSER applications. EvoMorph optimizes morphology-aware objectives defined on interpretable signal descriptors and applies transformations to preserve the waveform structure. We evaluated EvoMorph on three PPG datasets (heart rate, respiratory rate, and oxygen saturation) against a nearest-unlike-neighbor baseline. In addition, in a case study, we evaluated EvoMorph as a tool for uncertainty quantification by relating counterfactual sensitivity to bootstrap-ensemble uncertainty and data-density measures. Overall, EvoMorph enables the generation of physiologically-aware counterfactuals for continuous biomedical signals and supports uncertainty-aware interpretability, advancing trustworthy model analysis for clinical time-series applications.
摘要：可穿戴设备能够对生理信号进行连续的、人口规模的监测，例如光电体积描记术（PPG），为数据驱动的临床评估创造新的机会。时间序列外在回归 (TSER) 模型越来越多地利用 PPG 信号来估计临床相关结果，包括心率、呼吸频率和氧饱和度。然而，对于临床推理和信任而言，仅单点估计是不够的：临床医生还必须了解预测在生理上合理的变化下是否稳定，以及生理信号中现实的、可实现的变化在多大程度上会有意义地改变模型的预测。反事实解释 (CFE) 解决了这些“假设”问题，但现有的时间序列 CFE 生成方法很大程度上局限于分类，忽略了波形形态，并且经常产生生理上不可信的信号，限制了它们对连续生物医学时间序列的适用性。为了解决这些限制，我们引入了 EvoMorph，这是一个多目标进化框架，用于为 TSER 应用生成生理上合理且多样化的 CFE。 EvoMorph 优化在可解释信号描述符上定义的形态感知目标，并应用变换来保留波形结构。我们根据最邻近基线在三个 PPG 数据集（心率、呼吸频率和氧饱和度）上评估了 EvoMorph。此外，在一个案例研究中，我们通过将反事实敏感性与引导集成不确定性和数据密度测量相关联，评估了 EvoMorph 作为不确定性量化工具的能力。总体而言，EvoMorph 能够为连续生物医学信号生成生理感知的反事实，并支持不确定性感知的可解释性，从而为临床时间序列应用推进值得信赖的模型分析。

Title: Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs

Authors: Ningyu Sun, Zhaolin Cai, Zitong Xu, Peihang Chen, Huiyu Duan, Yichao Yan, Xiongkuo Min, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10369
Pdf URL: https://arxiv.org/pdf/2601.10369
Copy Paste: [[2601.10369]] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs(https://arxiv.org/abs/2601.10369)
Keywords: generative, quality assessment
Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.
摘要：文本引导的人体姿势编辑在 AIGC 应用中获得了显着的吸引力。然而，它仍然受到结构异常和生成伪影的困扰。现有的评估指标通常将真实性检测与质量评估分开，无法提供对特定姿势不一致的细粒度洞察。为了解决这些限制，我们推出了 HPE-Bench，这是一个专门的基准测试，包含来自 17 个最先进的编辑模型的 1,700 个标准化样本，提供真实性标签和多维质量评分。此外，我们提出了一个基于层选择性多模态大语言模型（MLLM）的统一框架。通过采用对比 LoRA 调整和新颖的层敏感性分析 (LSA) 机制，我们确定了用于姿态评估的最佳特征层。我们的框架在真实性检测和多维质量回归方面都实现了卓越的性能，有效地弥合了取证检测和质量评估之间的差距。

Title: Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement

Authors: Yichong Xia, Yimin Zhou, Jinpeng Wang, Bin Chen
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2601.10373
Pdf URL: https://arxiv.org/pdf/2601.10373
Copy Paste: [[2601.10373]] Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement(https://arxiv.org/abs/2601.10373)
Keywords: generative
Abstract: Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $\epsilon$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.
摘要：基于扩散的生成先验的最新进展使得能够以极低的比特率实现视觉上合理的图像压缩。然而，由于分散的训练范例，现有方法存在采样过程缓慢和比特分配不理想的问题。在这项工作中，我们提出通过 \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR) 加速基于 \textbf{Diff}usion 的图像压缩，这是一种用于高效、高保真图像重建的新颖压缩框架。 DiffCR 的核心是频率感知跳跃估计 (FaSE) 模块，该模块根据预先训练的潜在扩散模型细化 $\epsilon$ 预测，并通过频率解耦注意 (FDA) 将其与不同时间步长的压缩潜在变量对齐。此外，轻量级一致性估计器通过保留扩散采样的语义轨迹来实现快速 \textbf{两步解码}。在不更新骨干扩散模型的情况下，与基于 SOTA 扩散的压缩基线相比，DiffCR 实现了显着的比特率节省（27.2% BD 速率 (LPIPS) 和 65.1% BD 速率 (PSNR)），并且速度提高了超过 10 倍。

Title: Global Context Compression with Interleaved Vision-Text Transformation

Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.10378
Pdf URL: https://arxiv.org/pdf/2601.10378
Copy Paste: [[2601.10378]] Global Context Compression with Interleaved Vision-Text Transformation(https://arxiv.org/abs/2601.10378)
Keywords: generation
Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
摘要：视觉语言模型在端到端 OCR 方面的最新成就指出了文本信息低损耗压缩的新途径。这激发了早期的工作，将 Transformer 的输入渲染成图像进行预填充，通过视觉编码有效地减少了 token 的数量，从而减轻了二次增加的 Attention 计算。然而，这种部分压缩无法节省逐个令牌推理的计算或内存成本。在本文中，我们研究了全局上下文压缩，它可以在预填充和推理阶段保存令牌。因此，我们提出了 VIST2，一种新颖的 Transformer，它将输入文本块与其视觉编码交错，同时完全依赖于前上下文中的视觉标记来预测下一个文本标记分布。围绕这个想法，我们将文本块渲染成草图图像，并分多个阶段训练 VIST2，从课程安排的光学语言建模预训练开始，然后是模态交错指令调整。我们使用从 0.6B 到 8B 的 VIST2 系列进行了广泛的实验，以探索训练配方和超参数。凭借 4$\times$ 的压缩比，生成的模型在长写入任务上表现出明显优于基线的优势，在第一个令牌生成中平均实现了 3$\times$ 的加速，内存使用量减少了 77%，FLOPS 减少了 74%。我们的代码和数据集将公开以支持进一步的研究。

Title: Discrete Feynman-Kac Correctors

Authors: Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Alán Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10403
Pdf URL: https://arxiv.org/pdf/2601.10403
Copy Paste: [[2601.10403]] Discrete Feynman-Kac Correctors(https://arxiv.org/abs/2601.10403)
Keywords: generation
Abstract: Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.
摘要：离散扩散模型最近已成为生成离散序列的自回归方法的有前途的替代方法。通过逐步去噪或去掩模过程生成样本使他们能够捕获数据中分层的非顺序相互依赖性。然而，这些自定义过程并不假设对生成的样本的分布进行灵活的控制。我们提出了离散 Feynman-Kac 校正器，这是一个框架，允许在推理时控制离散掩模扩散模型的生成分布。我们推导了顺序蒙特卡罗（SMC）算法，在给定训练有素的离散扩散模型的情况下，控制采样分布的温度（即执行退火），从多个扩散过程（例如不同条件过程）的边际乘积中进行采样，并从具有外部奖励函数的边际乘积中进行采样，从目标分布中生成也具有高奖励的可能样本。值得注意的是，我们的框架不需要任何额外模型的训练或原始模型的微调。我们说明了我们的框架在多个应用中的实用性，包括：从伊辛模型的退火玻尔兹曼分布中进行有效采样，提高代码生成和摊销学习的语言模型的性能，以及奖励倾斜的蛋白质序列生成。

Title: CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning

Authors: Yuanjie Zhao, Junnan Qiu, Yue Ding, Jie Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10407
Pdf URL: https://arxiv.org/pdf/2601.10407
Copy Paste: [[2601.10407]] CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning(https://arxiv.org/abs/2601.10407)
Keywords: generation
Abstract: Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks. Existing attack strategies typically struggle against safety-constrained algorithms (e.g., CQL) due to inefficient random poisoning and the use of easily detectable Out-of-Distribution (OOD) triggers. In this paper, we propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget. Leveraging the theoretical insight that samples with high Temporal Difference (TD) errors are pivotal for value function convergence, we introduce an adaptive Critical Sample Selection strategy that concentrates the attack budget on the most influential transitions. To evade OOD detection, we propose a Correlation-Breaking Trigger mechanism that exploits the physical mutual exclusivity of state features (e.g., 95th percentile boundaries) to remain statistically concealed. Furthermore, we replace the conventional label inversion with a Gradient-Guided Action Generation mechanism, which searches for worst-case actions within the data manifold using the victim Q-network's gradient. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms state-of-the-art baselines, achieving high attack success rates against representative safety-constrained algorithms with a minimal 5% poisoning budget, while maintaining the agent's performance in clean environments.
摘要：离线强化学习 (RL) 可以从静态数据集进行策略优化，但本质上容易受到后门攻击。由于低效的随机中毒和使用易于检测的分布外 (OOD) 触发器，现有的攻击策略通常难以对抗安全受限的算法（例如 CQL）。在本文中，我们提出了 CS-GBA（基于关键样本的梯度引导后门攻击），这是一种新颖的框架，旨在在严格的预算下实现高隐秘性和破坏性。利用具有高时间差（TD）误差的样本对于价值函数收敛至关重要的理论见解，我们引入了一种自适应关键样本选择策略，该策略将攻击预算集中在最有影响力的转变上。为了逃避 OOD 检测，我们提出了一种相关破坏触发机制，该机制利用状态特征的物理互斥性（例如，第 95 个百分位边界）来保持统计隐藏。此外，我们用梯度引导动作生成机制取代了传统的标签反转，该机制使用受害者 Q 网络的梯度在数据流形中搜索最坏情况的动作。 D4RL 基准的实证结果表明，我们的方法显着优于最先进的基线，以最低 5% 的中毒预算实现了针对代表性安全约束算法的高攻击成功率，同时保持了代理在干净环境中的性能。

Title: DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction

Authors: Zhancun Mu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10471
Pdf URL: https://arxiv.org/pdf/2601.10471
Copy Paste: [[2601.10471]] DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction(https://arxiv.org/abs/2601.10471)
Keywords: generation, generative
Abstract: We present DeFlow, a decoupled offline RL framework that leverages flow matching to faithfully capture complex behavior manifolds. Optimizing generative policies is computationally prohibitive, typically necessitating backpropagation through ODE solvers. We address this by learning a lightweight refinement module within an explicit, data-derived trust region of the flow manifold, rather than sacrificing the iterative generation capability via single-step distillation. This way, we bypass solver differentiation and eliminate the need for balancing loss terms, ensuring stable improvement while fully preserving the flow's iterative expressivity. Empirically, DeFlow achieves superior performance on the challenging OGBench benchmark and demonstrates efficient offline-to-online adaptation.
摘要：我们提出了 DeFlow，一个解耦的离线 RL 框架，它利用流匹配来忠实地捕获复杂的行为流形。优化生成策略在计算上是令人望而却步的，通常需要通过 ODE 求解器进行反向传播。我们通过在流形的显式的、数据派生的信任区域内学习轻量级细化模块来解决这个问题，而不是通过单步蒸馏牺牲迭代生成能力。这样，我们就绕过了求解器微分并消除了平衡损失项的需要，确保了稳定的改进，同时完全保留了流的迭代表达能力。根据经验，DeFlow 在具有挑战性的 OGBench 基准测试中实现了卓越的性能，并展示了高效的离线到在线适应。

Title: Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10551
Pdf URL: https://arxiv.org/pdf/2601.10551
Copy Paste: [[2601.10551]] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure(https://arxiv.org/abs/2601.10551)
Keywords: generation
Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
摘要：城市路边基础设施的自动感知对于智慧城市管理至关重要，但通用模型往往难以捕获必要的细粒度属性和领域规则。虽然大视觉语言模型 (VLM) 擅长开放世界识别，但它们通常难以按照工程标准准确解释复杂的设施状态，从而导致实际应用中的性能不可靠。为了解决这个问题，我们提出了一个适应领域的框架，将 VLM 转变为用于智能基础设施分析的专门代理。我们的方法将数据高效的微调策略与基于知识的推理机制相结合。具体来说，我们利用 Grounding DINO 上的开放词汇微调，以最少的监督稳健地本地化不同的资产，然后在 Qwen-VL 上进行基于 LoRA 的适应，以进行深度语义属性推理。为了减轻幻觉并加强专业合规性，我们引入了双模态检索增强生成（RAG）模块，该模块可以在推理过程中动态检索权威的行业标准和视觉样本。在城市路边场景的综合新数据集上进行评估，我们的框架实现了 58.9 mAP 的检测性能和 95.5% 的属性识别准确率，展示了智能基础设施监控的强大解决方案。

Title: Inference-time Physics Alignment of Video Generative Models with Latent World Models

Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10553
Pdf URL: https://arxiv.org/pdf/2601.10553
Copy Paste: [[2601.10553]] Inference-time Physics Alignment of Video Generative Models with Latent World Models(https://arxiv.org/abs/2601.10553)
Keywords: generation, generative
Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
摘要：最先进的视频生成模型可以产生有前景的视觉内容，但往往违反基本物理原理，限制了它们的实用性。虽然有些人将这种缺陷归因于预训练的物理理解不足，但我们发现物理合理性的不足也源于次优的推理策略。因此，我们引入 WMReward 并将提高视频生成的物理合理性视为推理时间对齐问题。特别是，我们利用潜在世界模型（此处为 VJEPA-2）的强大物理先验作为搜索和引导多个候选去噪轨迹的奖励，从而能够扩展测试时间计算以获得更好的生成性能。根据经验，我们的方法大大提高了图像条件、多帧条件和文本条件生成设置的物理合理性，并得到了人类偏好研究的验证。值得注意的是，在ICCV 2025感知测试PhysicsIQ挑战赛中，我们取得了62.64%的最终成绩，获得第一名，比之前的最佳水平高出7.42%。我们的工作证明了使用潜在世界模型来提高视频生成的物理合理性的可行性，超越了这种特定的实例化或参数化。

Title: RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

Authors: Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10606
Pdf URL: https://arxiv.org/pdf/2601.10606
Copy Paste: [[2601.10606]] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation(https://arxiv.org/abs/2601.10606)
Keywords: generation
Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
摘要：头像生成在虚拟现实 (VR) 中变得越来越重要，特别是对于涉及多回合对话的社交场景。现有方法面临明显的局限性：基于网格的 3D 方法可以模拟双人对话，但缺乏真实的纹理，而基于大型模型的 2D 方法可以产生自然的外观，但会产生过高的计算成本。最近，基于 3D 高斯分布 (3DGS) 的方法实现了高效且逼真的渲染，但仍然仅针对说话者并忽略了社交关系。我们推出 RSATalker，这是第一个利用 3DGS 生成现实且具有社交意识的头部说话的框架，并支持多轮对话。我们的方法首先通过语音驱动基于网格的 3D 面部运动，然后将 3D 高斯绑定到网格面以渲染高保真 2D 头像视频。为了捕捉人际动态，我们提出了一个社交感知模块，该模块通过可学习的查询机制将社会关系（包括血缘和非血缘以及平等和不平等）编码到高级嵌入中。我们设计了一个三阶段训练范例，并使用带有社交关系注释的语音-网格-图像三元组构建了 RSATalker 数据集。大量实验表明，RSATalker 在现实主义和社会意识方面都实现了最先进的性能。代码和数据集将被发布。

Title: CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

Authors: Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.10632
Pdf URL: https://arxiv.org/pdf/2601.10632
Copy Paste: [[2601.10632]] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos(https://arxiv.org/abs/2601.10632)
Keywords: generation, generative
Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
摘要：在本文中，我们发现 3D 人体动作和 2D 人体视频的生成本质上是耦合的。 3D 运动为视频的合理性和一致性提供了结构先验，而预训练的视频模型为运动提供了强大的泛化能力，这需要耦合它们的生成过程。基于此，我们提出了 CoMoVi，这是一个联合生成框架，它将两个视频扩散模型 (VDM) 结合起来，在单个扩散去噪循环中同步生成 3D 人体运动和视频。为了实现这一目标，我们首先提出了一种有效的 2D 人体运动表示，它可以继承预训练 VDM 的强大先验。然后，我们设计了一个双分支扩散模型，将人体运动和视频生成过程与相互特征交互和 3D-2D 交叉关注耦合起来。此外，我们还策划了 CoMoVi 数据集，这是一个带有文本和动作注释的大规模真实人类视频数据集，涵盖了多样化且具有挑战性的人类动作。大量实验证明了我们的方法在 3D 人体运动和视频生成任务中的有效性。

Title: Single-Stage Huffman Encoder for ML Compression

Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.10673
Pdf URL: https://arxiv.org/pdf/2601.10673
Copy Paste: [[2601.10673]] Single-Stage Huffman Encoder for ML Compression(https://arxiv.org/abs/2601.10673)
Keywords: generation
Abstract: Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
摘要：训练和服务大型语言模型 (LLM) 需要跨多个加速器对数据进行分区，其中集体操作经常受到网络带宽的瓶颈。使用霍夫曼码的无损压缩是缓解该问题的有效方法，然而，其三阶段设计需要动态频率分析、码本生成以及码本与数据的传输，这会带来计算、延迟和数据开销，这对于芯片间通信等延迟敏感的场景来说是令人望而却步的。本文提出了一种单级霍夫曼编码器，通过使用从先前数据批次的平均概率分布导出的固定码本来消除这些开销。通过对 Gemma 2B 模型的分析，我们证明张量在层和分片之间表现出高度的统计相似性。使用这种方法，我们实现了每个分片霍夫曼编码 0.5% 以内的压缩率以及理想香农压缩率 1% 以内的压缩率，从而实现了高效的动态压缩。

Title: On the origin of neural scaling laws: from random graphs to natural language

Authors: Maissam Barkeshli, Alberto Alfarano, Andrey Gromov
Subjects: cs.LG, cond-mat.dis-nn, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2601.10684
Pdf URL: https://arxiv.org/pdf/2601.10684
Copy Paste: [[2601.10684]] On the origin of neural scaling laws: from random graphs to natural language(https://arxiv.org/abs/2601.10684)
Keywords: generative
Abstract: Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.
摘要：缩放定律在现代人工智能革命中发挥了重要作用，为从业者提供了模型性能如何随着数据、计算和模型参数数量的增加而提高的预测能力。这激发了人们对神经标度定律起源的浓厚兴趣，一个常见的建议是它们源于数据中已经存在的幂律结构。在本文中，我们研究了经过训练以预测具有可调复杂性的图上的随机游走（二元组）的变压器的缩放定律。我们证明，即使在数据相关性中没有幂律结构的情况下，这种简化的设置也已经产生了神经标度律。我们进一步考虑系统地降低自然语言的复杂性，通过对从日益简化的生成语言模型中采样的序列进行训练，从 4,2,1 层转换器语言模型到语言二元组，揭示缩放指数的单调演化。我们的结果还包括通过对 Erdös-Renyi 和无标度 Barabási-Albert 系综绘制的随机图进行随机游走训练而获得的标度律。最后，我们重新审视语言建模的传统缩放法则，证明可以使用上下文长度为 50 的 2 层转换器来重现几个基本结果，对先前文献中使用的各种拟合进行批判性分析，展示与已发表文献中的当前实践相比获得计算最佳曲线的替代方法，并提供最大更新参数化可能比标准参数化更有效的初步证据。