2025-10-10

Title: ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

Authors: Lingcheng Kong, Jiateng Wei, Hanzhang Shen, Huan Wang
Subjects: cs.LG, cs.CL, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2510.07356
Pdf URL: https://arxiv.org/pdf/2510.07356
Copy Paste: [[2510.07356]] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation(https://arxiv.org/abs/2510.07356)
Keywords: generation
Abstract: GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
摘要：法学硕士的 GPU 内核生成最近经历了快速发展，利用了测试时间扩展和强化学习技术。然而，内核生成的一个关键挑战是高质量数据的稀缺，因为大多数高质量内核都是专有的而不是开源的。这一挑战使我们无法利用监督微调来使 LLM 与内核生成任务保持一致。为了应对这一挑战，我们开发了一个管道，可以生成和管理具有推理轨迹的高质量 CUDA 内核，其动机是通过严格观察得出，简洁而信息丰富的推理轨迹可以稳健地生成高性能内核。使用此管道，我们构建数据集 ConCuR 并引入模型 KernelCoder，据我们所知，这是在由 PyTorch、推理和 CUDA 内核对组成的精选数据集上训练的第一个模型。在 KernelBench 设置中，我们的模型比现有的顶级模型 QwQ-32B 实现了显着改进，并且优于所有针对内核生成进行微调的开源模型以及 DeepSeek-V3.1-Think 和 Claude-4-sonnet 等前沿模型。最后，我们证明平均推理长度可以作为评估内核生成任务难度的指标。观察结果、指标以及我们的数据收集和管理管道可以帮助将来在内核生成任务中获得更好的数据。

Title: DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

Authors: Nithin C. Babu, Aniruddha Mahapatra, Harsh Rangwani, Rajiv Soundararajan, Kuldeep Kulkarni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07441
Pdf URL: https://arxiv.org/pdf/2510.07441
Copy Paste: [[2510.07441]] DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis(https://arxiv.org/abs/2510.07441)
Keywords: generative
Abstract: Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.
摘要：现有的文本到视频 (T2V) 评估基准（例如 VBench 和 EvalCrafter）存在两个限制。 (i) 虽然重点是以主题为中心的提示或静态摄像机场景，但对于制作电影镜头至关重要的摄像机运动以及动态运动下的现有指标在很大程度上尚未得到探索。 (ii) 这些基准通常将视频级分数聚合成单个模型级分数，以对生成模型进行排名。然而，这种聚合忽略了视频级评估，这对于在为给定提示生成的候选视频中选择更好的视频至关重要。为了解决这些差距，我们引入了 DynamicEval，这是一个基准，由系统策划的提示组成，强调动态摄像机运动，并与 10 个 T2V 模型生成的 3k 视频中的视频对上的 45k 人类注释相结合。 DynamicEval 评估视频质量的两个关键维度：背景场景一致性和前景对象一致性。为了背景场景的一致性，我们根据 Vbench 运动平滑度度量获得可解释的误差图。我们观察到，虽然 Vbench 运动平滑度指标显示出与人类判断的良好一致性，但它在两种情况下会失败：相机和前景物体运动引起的遮挡/去遮挡。在此基础上，我们提出了一种新的背景一致性指标，利用对象错误图以有原则的方式纠正两个失败案例。我们的第二个创新是引入了前景一致性指标，该指标可以跟踪每个对象实例中的点及其邻居，以评估对象保真度。大量实验表明，我们提出的指标在视频级别和模型级别都与人类偏好实现了更强的相关性（改进超过 2%），将 DynamicEval 建立为评估动态相机运动下的 T2V 模型的更全面的基准。

Title: Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

Authors: Shuangyi Chen, Ashish Khisti
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2510.07500
Pdf URL: https://arxiv.org/pdf/2510.07500
Copy Paste: [[2510.07500]] Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence(https://arxiv.org/abs/2510.07500)
Keywords: generation
Abstract: We study black-box detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark quantizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen-Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from historical corpora. We prove a principled discretization criterion and establish the asymptotic normality of the decision statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines; our experiments corroborate the statistic's asymptotic normality, and ablations validate the effectiveness of the proposed discretization.
摘要：我们在实际约束下研究机器生成文本的黑盒检测：评分模型（代理 LM）可能与未知源模型不匹配，并且每个输入的对比生成成本高昂。我们提出了 SurpMark，一种基于参考的检测器，它通过其标记惊喜的动态来总结一段段落。 SurpMark 将意外量化为可解释的状态，估计测试文本的状态转换矩阵，并通过测试转换和从历史语料库一次构建的两个固定参考（人类与机器）之间的广义 Jensen-Shannon (GJS) 差距对其进行评分。我们证明了原则离散化准则并建立了决策统计量的渐近正态性。根据经验，在多个数据集、源模型和场景中，SurpMark 始终匹配或超过基线；我们的实验证实了统计量的渐近正态性，并且消融验证了所提出的离散化的有效性。

Title: MLLM4TS: Leveraging Vision and Multimodal Language Models for General Time-Series Analysis

Authors: Qinghua Liu, Sam Heshmati, Zheda Mai, Zubin Abraham, John Paparrizos, Liu Ren
Subjects: cs.LG, cs.AI, cs.CV, cs.DB
Abstract URL: https://arxiv.org/abs/2510.07513
Pdf URL: https://arxiv.org/pdf/2510.07513
Copy Paste: [[2510.07513]] MLLM4TS: Leveraging Vision and Multimodal Language Models for General Time-Series Analysis(https://arxiv.org/abs/2510.07513)
Keywords: generative
Abstract: Effective analysis of time series data presents significant challenges due to the complex temporal dependencies and cross-channel interactions in multivariate data. Inspired by the way human analysts visually inspect time series to uncover hidden patterns, we ask: can incorporating visual representations enhance automated time-series analysis? Recent advances in multimodal large language models have demonstrated impressive generalization and visual understanding capability, yet their application to time series remains constrained by the modality gap between continuous numerical data and discrete natural language. To bridge this gap, we introduce MLLM4TS, a novel framework that leverages multimodal large language models for general time-series analysis by integrating a dedicated vision branch. Each time-series channel is rendered as a horizontally stacked color-coded line plot in one composite image to capture spatial dependencies across channels, and a temporal-aware visual patch alignment strategy then aligns visual patches with their corresponding time segments. MLLM4TS fuses fine-grained temporal details from the numerical data with global contextual information derived from the visual representation, providing a unified foundation for multimodal time-series analysis. Extensive experiments on standard benchmarks demonstrate the effectiveness of MLLM4TS across both predictive tasks (e.g., classification) and generative tasks (e.g., anomaly detection and forecasting). These results underscore the potential of integrating visual modalities with pretrained language models to achieve robust and generalizable time-series analysis.
摘要：由于多元数据中复杂的时间依赖性和跨通道交互，时间序列数据的有效分析提出了重大挑战。受到人类分析师目视检查时间序列以发现隐藏模式的方式的启发，我们问：合并视觉表示可以增强自动化时间序列分析吗？多模态大语言模型的最新进展表现出了令人印象深刻的泛化和视觉理解能力，但它们在时间序列上的应用仍然受到连续数值数据和离散自然语言之间模态差距的限制。为了弥补这一差距，我们引入了 MLLM4TS，这是一种新颖的框架，通过集成专用的视觉分支，利用多模态大型语言模型进行一般时间序列分析。每个时间序列通道在一个合成图像中呈现为水平堆叠的颜色编码线图，以捕获跨通道的空间依赖性，然后时间感知的视觉块对齐策略将视觉块与其相应的时间段对齐。 MLLM4TS 将数值数据中的细粒度时间细节与视觉表示中派生的全局上下文信息融合在一起，为多模态时间序列分析提供统一的基础。标准基准的大量实验证明了 MLLM4TS 在预测任务（例如分类）和生成任务（例如异常检测和预测）中的有效性。这些结果强调了将视觉模式与预训练语言模型相结合以实现稳健且可概括的时间序列分析的潜力。

Title: D2RA: Dual Domain Regeneration Attack

Authors: Pragati Shuddhodhan Meshram, Varun Chandrasekaran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07538
Pdf URL: https://arxiv.org/pdf/2510.07538
Copy Paste: [[2510.07538]] D2RA: Dual Domain Regeneration Attack(https://arxiv.org/abs/2510.07538)
Keywords: generation, generative
Abstract: The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at this https URL.
摘要：生成模型的日益广泛使用加剧了对确保内容归属和出处的水印方法的需求。虽然最近的语义水印方案通过在潜在或频率表示中嵌入信号来提高鲁棒性，但我们表明，即使在资源受限的对抗环境下，它们仍然容易受到攻击。我们提出了 D2RA，这是一种无需训练的单图像攻击，可以在不访问底层模型的情况下删除或削弱水印。通过将带水印的图像投影到互补表示的自然先验上，D2RA 可以抑制水印信号，同时保持视觉保真度。不同水印方案的实验表明，我们的方法持续降低了水印的可检测性，揭示了当前设计的根本弱点。我们的代码可以在这个 https URL 上找到。

Title: TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Authors: Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07550
Pdf URL: https://arxiv.org/pdf/2510.07550
Copy Paste: [[2510.07550]] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility(https://arxiv.org/abs/2510.07550)
Keywords: generative
Abstract: Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
摘要：尽管视觉保真度令人印象深刻，但现代视频生成模型经常产生违反直观物理定律的序列，例如物体漂浮、传送或以违反因果关系的方式变形。虽然人类可以很容易地发现这种难以置信的情况，但仍然没有可靠的方法来定量评估视频中的物理真实感。在这项工作中，我们探索是否可以训练视频语言模型（VLM）来充当物理合理性的可靠判断者。我们发现现有的 VLM 难以识别物理违规，暴露出其时间和因果推理的基本局限性。为了解决这个问题，我们引入了 TRAVL，这是一种微调方法，它将平衡训练数据集与轨迹感知注意力模块相结合，以改善 VLM 中的运动编码和辨别能力。为了更严格地评估物理推理，我们提出了 ImplausiBench，这是一个包含 300 个视频（150 个真实视频，150 个生成视频）的基准，可消除语言偏见并隔离视觉时间理解。绩效报告采用黄金标准的人类判断和更严格的法学硕士评判指标。 TRAVL 和 ImplausiBench 共同提供了一个统一的框架，用于探索和提高多模态模型中的物理合理性，揭示视觉时间理解中具有挑战性且尚未充分探索的方面。

Title: Cross-Modal Attention Guided Unlearning in Vision-Language Models

Authors: Karuna Bhaila, Aneesh Komanduri, Minh-Hao Van, Xintao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07567
Pdf URL: https://arxiv.org/pdf/2510.07567
Copy Paste: [[2510.07567]] Cross-Modal Attention Guided Unlearning in Vision-Language Models(https://arxiv.org/abs/2510.07567)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.
摘要：视觉语言模型 (VLM) 在多模态理解和推理任务中表现出了巨大的能力，例如视觉问答 (VQA)，它需要模型同时根据视觉和文本上下文推断输出。大规模预训练模型的这种推理能力通常归因于跨多个领域收集的大规模预训练数据。然而，模型可能会在训练期间记住私人和/或敏感信息，并在推理中反省它。最近，机器学习已被用来解决法学硕士中私人数据的泄露问题。 VLM 给这个过程增加了一层复杂性，因为除了文本之外，查询中的视觉上下文还可能包含敏感信息。为了解决这个问题，我们探索了视觉语言模型的遗忘，特别是 VQA 任务。我们探索了视觉标记在使用跨模态注意力的 VLM 中生成输出的作用，并利用它来制定跨模态注意力引导忘却（CAGUL），这是一个轻量级且高效的 VLM 忘却框架。与计算成本高昂的模型微调方法相比，CAGUL 利用外部模块将未学习的信息编码为对相关查询重要性较低的视觉标记。我们发现转换后的视觉标记不仅可以防止泄漏，还可以保留参考模型行为。实验结果表明，我们的方法在不改变预训练模型参数或产生再训练成本的情况下表现更好或与基于微调的基线相当，使其成为 VLM 实用且有效的去学习解决方案。

Title: Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion

Authors: Ryan T. Tymkow, Benjamin D. Schnapp, Mojtaba Valipour, Ali Ghodshi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.07570
Pdf URL: https://arxiv.org/pdf/2510.07570
Copy Paste: [[2510.07570]] Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion(https://arxiv.org/abs/2510.07570)
Keywords: generation
Abstract: Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.
摘要：符号回归是指找到一个封闭形式的数学表达式来拟合一组数据点的任务。基于遗传编程的技术是用于解决此问题的最常见算法，但最近基于神经网络的方法越来越受欢迎。大多数用于符号回归的领先的基于神经网络的模型都利用基于变压器的自回归模型来生成以编码输入点为条件的方程。然而，自回归生成仅限于从左到右生成令牌，并且未来生成的令牌仅以先前生成的令牌为条件。出于同时生成所有标记以生成改进的封闭式方程的愿望，我们提出了符号扩散，这是一种基于 D3PM 的离散状态空间扩散模型，它使用离散标记扩散同时生成方程的所有标记。使用为 SymbolicGPT 开发的双变量数据集，我们使用等效的编码器和转换器架构，将基于扩散的生成方法与基于 SymbolicGPT 的自回归模型进行了比较。我们证明，我们使用基于扩散的生成进行符号回归的新方法可以在使用类似底层架构的模型中提供与自回归生成相当的性能，并且通过某些指标可以提高性能，从而为基于神经网络的符号回归开辟了新的研究机会。

Title: LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Authors: Chongyu Fan, Changsheng Wang, Yancheng Huang, Soumyadeep Pal, Sijia Liu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.07626
Pdf URL: https://arxiv.org/pdf/2510.07626
Copy Paste: [[2510.07626]] LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics(https://arxiv.org/abs/2510.07626)
Keywords: generation, generative
Abstract: Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.
摘要：大型语言模型 (LLM) 的机器取消学习旨在删除不需要的数据、知识和行为（例如，为了安全、隐私或版权），同时保留有用的模型功能。尽管过去两年取得了快速进展，但法学硕士忘却的研究仍然支离破碎，对于什么构成有效忘却以及如何严格评估它的清晰度有限。在这项工作中，我们提出了 12 种最近的状态遗忘方法的原则分类，分为三个方法家族：发散驱动的优化、表示错位和基于拒绝的有针对性的遗忘。在此分类法的基础上，我们重新审视了遗忘有效性 (UE)、效用保留 (UT) 和鲁棒性 (Rob) 的评估，重点关注 WMDP 基准。我们的分析表明，当前的评估以多项选择题 (MCQ) 准确性为主，仅提供了狭隘的视角，往往夸大了成功，而忽视了模型的实际生成行为。为了解决这一差距，我们引入了开放式问答 (Open-QA) 指标，可以更好地捕获生成性能并揭示方法系列之间固有的 UE-UT 权衡。此外，我们证明鲁棒性需要更细粒度的分析：例如，域内重新学习和域外微调之间的漏洞存在很大差异，即使两者都属于模型级攻击。通过这项研究，我们希望对法学硕士的学习进行全栈回顾，并为设计和评估未来方法提供可行的指导。

Title: PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment

Authors: Shashank Gupta, Gregoire Phillips, Alan C. Bovik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07636
Pdf URL: https://arxiv.org/pdf/2510.07636
Copy Paste: [[2510.07636]] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment(https://arxiv.org/abs/2510.07636)
Keywords: quality assessment
Abstract: Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at this https URL.
摘要：大型多模态模型 (LMM) 最近在图像和视频质量评估领域取得了相当大的进步，但这一进步尚未在 3D 资产领域得到充分探索。我们有兴趣使用这些模型进行无参考点云质量评估（NR-PCQA），其目的是在没有参考的情况下自动评估点云的感知质量。我们首先观察到不同形式的数据（文本描述、2D 投影和 3D 点云视图）提供了有关点云质量的补充信息。然后，我们构建了 PIT-QMM，这是一种用于 NR-PCQA 的新型 LMM，能够端到端地使用文本、图像和点云来预测质量分数。大量的实验表明，我们提出的方法在流行的基准测试中以更少的训练迭代明显优于最先进的方法。我们还证明了我们的框架能够实现失真定位和识别，这为模型的可解释性和交互性铺平了新的道路。代码和数据集可从此 https URL 获取。

Title: Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection

Authors: Yanjie Pan, Qingdong He, Lidong Wang, Bo Peng, Mingmin Chi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07654
Pdf URL: https://arxiv.org/pdf/2510.07654
Copy Paste: [[2510.07654]] Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection(https://arxiv.org/abs/2510.07654)
Keywords: generation
Abstract: Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.
摘要：视频虚拟试穿旨在将视频中人物的服装替换为目标服装。当前的双分支架构在基于U-Net的扩散模型中取得了显着的成功；然而，使它们适应基于扩散变压器构建的扩散模型仍然具有挑战性。最初，从服装参考分支引入潜在空间特征需要添加或修改主干网络，从而导致大量可训练参数。随后，服装的潜在空间特征缺乏固有的时间特征，因此需要额外的学习。为了解决这些挑战，我们提出了一种新的方法，OIE（Once is Enough），一种基于第一帧服装替换的虚拟试穿策略：具体来说，我们采用基于图像的服装传输模型来替换初始帧中的服装，然后在编辑的第一帧的内容控制下，利用姿势和掩模信息来指导视频生成模型的时间先验顺序合成剩余帧。实验表明，我们的方法实现了卓越的参数效率和计算效率，同时在这些约束下仍然保持领先的性能。

Title: Controllable Video Synthesis via Variational Inference

Authors: Haoyi Duan, Yunzhi Zhang, Yilun Du, Jiajun Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07670
Pdf URL: https://arxiv.org/pdf/2510.07670
Copy Paste: [[2510.07670]] Controllable Video Synthesis via Variational Inference(https://arxiv.org/abs/2510.07670)
Keywords: generation, generative
Abstract: Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
摘要：许多视频工作流程受益于不同粒度的用户控件的混合，从精确的 4D 对象轨迹和摄像机路径到粗略的文本提示，而现有的视频生成模型通常针对固定输入格式进行训练。我们开发了一种视频合成方法来满足这一需求，并为指定元素生成具有高可控性的样本，同时保持未指定元素的多样性。我们将任务作为变分推理来近似组合分布，利用多个视频生成主干来共同考虑所有任务约束。为了解决优化挑战，我们将问题分解为退火分布序列上的逐步 KL 散度最小化，并进一步提出了一种上下文条件分解技术，可减少解空间中的模式以规避局部最优。实验表明，与之前的工作相比，我们的方法产生的样本具有更好的可控性、多样性和 3D 一致性。

Title: SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Authors: Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07723
Pdf URL: https://arxiv.org/pdf/2510.07723
Copy Paste: [[2510.07723]] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction(https://arxiv.org/abs/2510.07723)
Keywords: generation, generative
Abstract: Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.
摘要：由于固有的模糊性和严重的自遮挡，从单个图像进行逼真的 3D 全身人体重建对于电影和视频游戏中的应用来说是一项关键但具有挑战性的任务。虽然最近的方法利用 SMPL 估计和 SMPL 条件图像生成模型来产生新的视图，但它们受到从 SMPL 网格估计的不准确的 3D 先验的影响，并且难以处理困难的人体姿势和重建精细细节。在本文中，我们提出了 SyncHuman，这是一种新颖的框架，首次结合了 2D 多视图生成模型和 3D 原生生成模型，即使在具有挑战性的人体姿势下，也可以从单视图图像中实现高质量的穿着人体网格重建。多视图生成模型擅长捕捉精细的 2D 细节，但在结构一致性方面存在困难，而 3D 原生生成模型则生成粗糙但结构一致的 3D 形状。通过整合这两种方法的互补优势，我们开发了一个更有效的生成框架。具体来说，我们首先使用提出的像素对齐 2D-3D 同步注意力联合微调多视图生成模型和 3D 原生生成模型，以生成几何对齐的 3D 形状和 2D 多视图图像。为了进一步改善细节，我们引入了一种特征注入机制，可将 2D 多视图图像中的精细细节提升到对齐的 3D 形状上，从而实现准确和高保真度的重建。大量实验表明，SyncHuman 可以实现稳健且逼真的 3D 人体重建，即使对于具有挑战性姿势的图像也是如此。我们的方法在几何精度和视觉保真度方面优于基线方法，为未来 3D 生成模型展示了一个有前途的方向。

Title: GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation

Authors: Rongchao Xu, Kunlin Cai, Lin Jiang, Dahai Yu, Zhiqing Hong, Yuan Tian, Guang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.07735
Pdf URL: https://arxiv.org/pdf/2510.07735
Copy Paste: [[2510.07735]] GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation(https://arxiv.org/abs/2510.07735)
Keywords: generation, generative
Abstract: Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.
摘要：基于位置的社交网络 (LBSN) 签到轨迹数据对于许多实际应用非常重要，例如 POI 推荐、广告和流行病干预。然而，高昂的采集成本和不断增加的隐私问题使我们无法访问大规模的 LBSN 轨迹数据。合成数据生成的最新进展为我们提供了实现这一目标的新机会，即利用生成式人工智能生成合成数据，在保留真实数据特征的同时确保隐私保护。然而，由于其空间离散、时间不规则的性质以及稀疏活动和不确定的人员流动性导致的复杂时空模式，生成合成的 LBSN 签到轨迹仍然具有挑战性。为了应对这一挑战，我们提出了 GeoGen，这是一个用于大规模 LBSN 签入轨迹生成的两阶段从粗到细的框架。在第一阶段，我们从原始 LBSN 签入轨迹重建空间连续、时间规则的潜在运动序列，然后设计一个稀疏感知的时空扩散模型 (S$^2$TDiff) 和一个高效的去噪网络来学习其潜在的行为模式。在第二阶段，我们设计了 Coarse2FineNet，这是一种基于 Transformer 的 Seq2Seq 架构，在编码器和多任务混合头解码器中配备了动态上下文融合机制，它通过对语义相关性和行为不确定性进行建模，基于粗粒度潜在运动序列生成细粒度 LBSN 轨迹。对四个真实世界数据集的广泛实验表明，GeoGen 在保真度和效用评估方面均优于最先进的模型，例如，它在 FS-TKY 数据集上的距离和半径指标分别增加了 69% 和 55% 以上。

Title: A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization

Authors: Yiqin Lv, Zhiyu Mou, Miao Xu, Jinghao Chen, Qi Wang, Yixiu Mao, Yun Qu, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng, Xiangyang Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07760
Pdf URL: https://arxiv.org/pdf/2510.07760
Copy Paste: [[2510.07760]] A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization(https://arxiv.org/abs/2510.07760)
Keywords: generative
Abstract: In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.
摘要：在在线广告中，异构的广告商需求导致了大量定制的竞价任务，这些任务通常是独立优化的，导致大量的计算和有限的数据效率。多任务学习提供了一个原则框架，可以通过共享表示来联合训练这些任务。然而，现有的多任务优化策略主要以训练动态为指导，并且在不稳定的投标环境中通常泛化能力较差。为此，我们提出了验证对齐多任务优化（VAMO），它根据每个任务训练梯度和保留验证梯度之间的对齐情况自适应地分配任务权重，从而引导更新朝着验证改进和更好地匹配部署目标的方向发展。我们进一步为该框架配备了周期性感知的时间模块，并将其与先进的生成自动投标骨干相结合，以增强季节性结构的跨任务传输并增强投标性能。同时，我们为所提出的方法提供了理论见解，例如收敛保证和对齐分析。对模拟和大规模现实世界广告系统的广泛实验一致证明了相对于典型基线的显着改进，阐明了所提出方法的有效性。

Title: MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

Authors: Kaen Kogashi, Anoop Cherian, Meng-Yu Jennifer Kuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07828
Pdf URL: https://arxiv.org/pdf/2510.07828
Copy Paste: [[2510.07828]] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions(https://arxiv.org/abs/2510.07828)
Keywords: generation
Abstract: Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.
摘要：现实世界的场景通常是多个人以因果、目标导向或合作的方式与多个对象进行交互。然而，现有的 3D 人机交互 (HOI) 基准仅考虑了这些复杂交互的一小部分。为了弥补这一差距，我们提出了 MMHOI——一个大规模的多人多对象交互数据集，由来自 12 个日常场景的图像组成。 MMHOI 为每个人和物体提供完整的 3D 形状和姿势注释，以及 78 个动作类别和 14 个特定于交互的身体部位的标签，为下一代 HOI 研究提供全面的测试平台。在 MMHOI 的基础上，我们提出了 MMHOI-Net，这是一种基于端到端 Transformer 的神经网络，用于联合估计人体物体 3D 几何形状、它们的相互作用和相关动作。我们框架中的一个关键创新是用于建模对象及其交互的结构化双补丁表示，并与动作识别相结合以增强交互预测。在 MMHOI 和最近提出的 CORE4D 数据集上的实验表明，我们的方法在多 HOI 建模中实现了最先进的性能，在准确性和重建质量方面均表现出色。

Title: MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Authors: Weisen Jiang, Sinno Jialin Pan
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2510.07835
Pdf URL: https://arxiv.org/pdf/2510.07835
Copy Paste: [[2510.07835]] MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation(https://arxiv.org/abs/2510.07835)
Keywords: generation
Abstract: This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at this https URL.
摘要：本文介绍了 MetaDefense，这是一种用于防御大型语言模型 (LLM) 中基于微调的越狱攻击的新颖框架。我们观察到，尽管 LLM 能够区分嵌入空间中伪装的有害查询，但现有的防御机制无法泛化到由看不见的攻击模板伪装的有害查询。基于这些见解，我们提出了一种两阶段的防御方法：（i）生成前防御，在响应生成开始之前检测有害查询，以及（ii）生成中期防御，在生成期间监视部分响应以防止输出更多有害内容。我们的 MetaDefense 训练法学硕士使用专门的提示来预测查询和部分响应的危害性，从而能够及早终止潜在有害的交互。跨多个 LLM 架构（LLaMA-2-7B、Qwen-2.5-3B-Instruct 和 LLaMA-3.2-3B-Instruct）的大量实验表明，MetaDefense 的性能显着优于现有的防御机制，通过可见和不可见的攻击模板实现对有害查询的强大防御，同时保持良性任务的竞争性能。代码可从此 https URL 获取。

Title: IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries

Authors: Harsh Kavediya, Vighnesh Nayak, Bheeshm Sharma, Balamurugan Palaniappan
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2510.07837
Pdf URL: https://arxiv.org/pdf/2510.07837
Copy Paste: [[2510.07837]] IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries(https://arxiv.org/abs/2510.07837)
Keywords: generation
Abstract: Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: this https URL.
摘要：手语到口语音频翻译对于将听力和语言障碍的人与其他人联系起来非常重要。我们考虑具有独立手势序列的手语视频，而不是连续的语法手势。此类视频在教育应用和符号提示界面中非常有用。为此，我们提出了 IsoSignVid2Aud，这是一种新颖的端到端框架，可以将手语视频与一系列可能不符合语法的连续手语翻译成语音，而不需要中间文本表示，提供即时的通信优势，同时避免多阶段翻译系统中固有的延迟和级联错误。我们的方法将基于 I3D 的特征提取模块与专门的特征转换网络和音频生成管道相结合，利用新颖的非极大值抑制 (NMS) 算法对非语法连续序列中的符号进行时间检测。实验结果证明了在 ASL-Citizen-1500 和 WLASL-100 数据集上的竞争性能，Top-1 准确率分别为 72.01\% 和 78.67\%，音频质量指标（PESQ：2.67，STOI：0.73）表明语音输出清晰。代码可在以下位置获得：此 https URL。

Title: GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio

Authors: Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.07919
Pdf URL: https://arxiv.org/pdf/2510.07919
Copy Paste: [[2510.07919]] GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio(https://arxiv.org/abs/2510.07919)
Keywords: generation
Abstract: Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ($w_1, \dots, w_n$); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.
摘要：个性化多目标排名系统的总体架构。它包括：（1）用于初始特征处理和候选生成的特征中心和预排序模型； (2) 预测各种用户反馈信号的多任务学习（MTL）模型； (3) 多任务融合 (MTF) 模块（我们提出的 GRADE 框架），用于学习个性化权重 ($w_1, \dots, w_n$)；然后应用这些权重来计算最终分数，并通过混合排名模型进行排序以生成混合排名，最终将结果提供给用户。

Title: TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Authors: Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2510.07940
Pdf URL: https://arxiv.org/pdf/2510.07940
Copy Paste: [[2510.07940]] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation(https://arxiv.org/abs/2510.07940)
Keywords: generation
Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
摘要：视频基础模型 (VFM) 表现出出色的视觉生成性能，但在合成场景（例如运动、计算能力和空间关系）中表现不佳。在这项工作中，我们引入了测试时优化和记忆（TTOM），这是一种免训练框架，可在推理过程中将 VFM 输出与时空布局对齐，以实现更好的文本图像对齐。我们不是在现有工作中直接干预每个样本的潜在特征或注意力，而是在总体布局注意力目标的指导下集成和优化新参数。此外，我们在流设置中制定视频生成，并通过支持插入、读取、更新和删除等灵活操作的参数化内存机制维护历史优化上下文。值得注意的是，我们发现 TTOM 解开了组合世界知识，显示出强大的可迁移性和泛化性。 T2V-CompBench 和 Vbench 基准测试的实验结果表明 TTOM 是一种有效、实用、可扩展且高效的框架，可实现动态合成视频生成的跨模式对齐。

Title: CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

Authors: Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07944
Pdf URL: https://arxiv.org/pdf/2510.07944
Copy Paste: [[2510.07944]] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving(https://arxiv.org/abs/2510.07944)
Keywords: generation, generative
Abstract: Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.
摘要：生成模型已广泛应用于环境模拟和未来状态预测的世界建模。随着自动驾驶技术的进步，人们不仅对在各种控制下生成高保真视频的需求不断增长，而且对生成深度估计等多样化且有意义的信息的需求也不断增长。为了解决这个问题，我们提出了 CVD-STORM，这是一种跨视图视频扩散模型，利用时空重建变分自动编码器 (VAE)，可在各种控制输入下生成具有 4D 重建功能的长期多视图视频。我们的方法首先通过辅助 4D 重建任务微调 VAE，增强其编码 3D 结构和时间动态的能力。随后，我们将此 VAE 集成到视频扩散过程中，以显着提高生成质量。实验结果表明，我们的模型在 FID 和 FVD 指标方面都取得了实质性改进。此外，联合训练的高斯泼溅解码器有效地重建动态场景，为全面的场景理解提供有价值的几何信息。

Title: Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement

Authors: Yidi Liu, Xueyang Fu, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.07961
Pdf URL: https://arxiv.org/pdf/2510.07961
Copy Paste: [[2510.07961]] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement(https://arxiv.org/abs/2510.07961)
Keywords: restoration
Abstract: Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware this http URL Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency this http URL Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent this http URL inference, a tunable parameter {\alpha} enables flexible fidelity-perception this http URL show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
摘要：超高清 (UHD) 图像恢复面临着计算效率和高频细节保留之间的权衡。虽然变分自动编码器 (VAE) 通过潜在空间处理提高效率，但其高斯约束通常会丢弃特定于退化的高频信息，从而损害重建保真度。为了克服这个问题，我们提出了Latent Harmony，这是一个两阶段框架，通过联合规范潜在空间并强制执行高频感知的http URL，重新定义了超高清恢复的VAE。第一阶段，我们引入了LH-VAE，它通过视觉语义约束和渐进退化扰动增强了语义鲁棒性，而潜在等方差则增强了高频此http URL。 URL Two 使用高频低阶适应 (HF-LoRA) 联合训练这个精炼的 VAE 和恢复模型：编码器 LoRA 由面向保真度的高频对齐损失引导以恢复真实细节，解码器 LoRA 由面向感知的损失驱动以合成真实纹理。两个 LoRA 模块均通过交替优化和选择性梯度传播进行训练，以保留预先训练的潜在此 http URL 推断，可调参数 {\alpha} 可实现灵活的保真度感知此 http URL 显示潜在和谐在 UHD 和标准分辨率任务中实现了最先进的性能，有效地平衡了效率、感知质量和重建准确性。

Title: Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN

Authors: Chandresh Sutariya, Nitin Singh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07984
Pdf URL: https://arxiv.org/pdf/2510.07984
Copy Paste: [[2510.07984]] Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN(https://arxiv.org/abs/2510.07984)
Keywords: restoration
Abstract: The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.
摘要：同时恢复高频细节和抑制低光图像中的严重噪声对计算机视觉提出了重大且持续的挑战。虽然像 SwinIR 这样的大规模 Transformer 模型已经在性能方面达到了最先进的水平，但它们的高计算成本可能成为实际应用的障碍。本文通过在这项具有挑战性的任务中将最先进的 SwinIR 模型与标准的轻量级卷积神经网络 (CNN) 进行比较，研究了性能和效率之间的关键权衡。我们的实验结果揭示了一个微妙但重要的发现。基于 Transformer 的 SwinIR 模型实现了更高的峰值性能，峰值信噪比 (PSNR) 为 39.03 dB，而轻量级 CNN 则提供了令人惊讶的竞争性 PSNR（37.4 dB）。至关重要的是，CNN 仅经过 10 个 epoch 的训练就达到了这一性能，而更复杂的 SwinIR 模型需要 132 个 epoch。模型的大小进一步强调了这种效率； CNN 比 SwinIR 小 55 倍以上。这项工作表明，标准 CNN 可以以显着降低的计算开销提供近乎最先进的结果，为其在资源限制是主要问题的现实场景中的使用提供了令人信服的案例。

Title: Real-Time Motion-Controllable Autoregressive Video Diffusion

Authors: Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08131
Pdf URL: https://arxiv.org/pdf/2510.08131
Copy Paste: [[2510.08131]] Real-Time Motion-Controllable Autoregressive Video Diffusion(https://arxiv.org/abs/2510.08131)
Keywords: generation
Abstract: Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: this https URL.
摘要：由于双向扩散模型固有的延迟以及缺乏有效的自回归（AR）方法，实时运动可控视频生成仍然具有挑战性。现有的 AR 视频扩散模型仅限于简单的控制信号或文本到视频的生成，并且在几步生成中经常会出现质量下降和运动伪影的问题。为了应对这些挑战，我们提出了 AR-Drag，这是第一个 RL 增强的几步 AR 视频扩散模型，用于通过多种运动控制进行实时图像到视频生成。我们首先微调基本 I2V 模型以支持基本运动控制，然后通过基于轨迹的奖励模型的强化学习进一步改进它。我们的设计通过自推出机制保留了马尔可夫特性，并通过在去噪步骤中选择性地引入随机性来加速训练。大量实验表明，AR-Drag 实现了高视觉保真度和精确的运动对齐，与最先进的运动可控 VDM 相比，显着减少了延迟，同时仅使用 1.3B 参数。其他可视化效果可以在我们的项目页面上找到：此 https URL。

Title: UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Authors: Shian Du, Menghan Xia, Chang Liu, Quande Liu, Xintao Wang, Pengfei Wan, Xiangyang Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08143
Pdf URL: https://arxiv.org/pdf/2510.08143
Copy Paste: [[2510.08143]] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution(https://arxiv.org/abs/2510.08143)
Keywords: super-resolution, generation, generative
Abstract: Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
摘要：级联视频超分辨率已成为一种有前途的技术，可减轻与使用大型基础模型生成高分辨率视频相关的计算负担。然而，现有的研究主要局限于文本到视频的任务，未能利用文本之外的其他生成条件，而这对于确保多模态视频生成的保真度至关重要。我们通过提出 UniMMVSR 来解决这一限制，这是第一个整合混合模式条件（包括文本、图像和视频）的统一生成视频超分辨率框架。我们对潜在视频扩散模型中的条件注入策略、训练方案和数据混合技术进行了全面的探索。一个关键的挑战是设计不同的数据构建和条件利用方法，使模型能够精确地利用所有条件类型，考虑到它们与目标视频的不同相关性。我们的实验表明，UniMMVSR 显着优于现有方法，生成的视频具有出色的细节和对多模态条件的更高程度的符合性。我们还验证了将 UniMMVSR 与基础模型相结合以实现 4K 视频的多模态引导生成的可行性，这是现有技术以前无法实现的壮举。

Title: Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Authors: Zhentao Zou, Zhengrong Yue, Kunpeng Du, Binlei Bao, Hanting Li, Haizhen Xie, Guozheng Xu, Yue Zhou, Yali Wang, Jie Hu, Xue Jiang, Xinghao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08157
Pdf URL: https://arxiv.org/pdf/2510.08157
Copy Paste: [[2510.08157]] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing(https://arxiv.org/abs/2510.08157)
Keywords: generation
Abstract: Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.
摘要：使用自然语言进行图像编辑已经非常流行，但由于缺乏明确的推理过程，现有方法难以处理复杂的对象交叉和细粒度的空间关系。虽然思想链 (CoT) 已被探索来增强推理，但纯文本 CoT 或用坐标信息增强的 CoT 从根本上限制了其表示复杂视觉布局的能力，并且缺乏必要的视觉线索来指导细粒度、像素级细节的生成。为了应对这些挑战，我们提出了多模态推理编辑（MURE），这是一种新颖的框架，它将视觉编辑过程从纯粹基于文本的推理转变为一系列交错的文本和视觉原理。我们的框架使用本机多模式、交错的文本图像 CoT 执行图像编辑。这种方法会生成逐步的推理链，其中文本描述后面跟着相应的视觉提示，例如定义预期编辑区域或新内容表示的位置掩码。此外，为了减轻大型语言模型的幻觉现象，我们引入了多模态深度置信度（MMDC）推理范式。该范例在每一步都探索了视觉推理路径树。通过使用奖励模型中的深度置信度分数修剪低质量分支，它可以确保模型始终遵循高质量轨迹实现最终编辑结果。所提出的方法将复杂的编辑任务分解为相互依赖的子任务，在每个阶段实现更高的精度并产生高保真的编辑结果。我们定义了交错文本图像链的公式，并发布了第一个 CoT-Edit-14K 数据集，其中包含 14K 高质量编辑示例。大量的实验表明，我们的方法在三个图像编辑基准测试中取得了显着的改进。

Title: Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing

Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.08169
Pdf URL: https://arxiv.org/pdf/2510.08169
Copy Paste: [[2510.08169]] Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing(https://arxiv.org/abs/2510.08169)
Keywords: generation, generative
Abstract: Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at this https URL.
摘要：自回归 (AR) 模型在序列生成中很常见，但由于其单向性质，在许多生物任务中受到限制，例如从头肽测序和蛋白质建模，无法捕获关键的全局双向标记依赖性。非自回归 (NAR) 模型提供整体、双向表示，但面临生成一致性和可扩展性的挑战。为了超越这一点，我们提出了一种混合框架，通过动态集成来自非自回归机制的丰富上下文信息来增强 AR 生成。我们的方法将共享输入编码器与两个解码器结合起来：一个非自回归解码器学习潜在的双向生物特征，一个 AR 解码器利用这些双向特征合成生物序列。新颖的跨解码器注意力模块使 AR 解码器能够迭代查询和集成这些双向特征，从而丰富其预测。这种协同作用是通过量身定制的训练策略来培养的，其中包括用于平衡目标的重要性退火和用于稳定、集中学习的跨解码器梯度阻塞。对从头肽测序的九种物种基准的评估表明，我们的模型大大超过了 AR 和 NAR 基线。它以独特的方式将 AR 稳定性与 NAR 上下文感知相结合，为不同的下游数据提供强大、卓越的性能。这项研究推进了生物序列建模技术，并为增强 AR 模型提供了一种新颖的架构范例，增强了对复杂序列生成的双向理解。代码可从此 https URL 获取。

Title: Expressive Value Learning for Scalable Offline Reinforcement Learning

Authors: Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08218
Pdf URL: https://arxiv.org/pdf/2510.08218
Copy Paste: [[2510.08218]] Expressive Value Learning for Scalable Offline Reinforcement Learning(https://arxiv.org/abs/2510.08218)
Keywords: generative
Abstract: Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
摘要：强化学习 (RL) 是学习做出一系列决策的强大范例。然而，强化学习尚未在机器人技术中得到充分利用，主要是由于其缺乏可扩展性。离线强化学习提供了一条有前途的途径，通过在大型、多样化的数据集上训练智能体，避免在线强化学习昂贵的现实世界交互。将离线强化学习扩展到日益复杂的数据集需要富有表现力的生成模型，例如扩散和流匹配。然而，现有方法通常依赖于时间反向传播 (BPTT)（计算量过高）或策略蒸馏（会引入复合错误并限制更大基础策略的可扩展性）。在本文中，我们考虑如何开发可扩展的离线强化学习方法而不依赖于时间蒸馏或反向传播的问题。我们引入了离线强化学习的表达价值学习（EVOR）：一种可扩展的离线强化学习方法，集成了表达策略和表达价值函数。 EVOR 在训练期间通过流匹配学习最佳的正则化 Q 函数。在推理时，EVOR 通过针对表达值函数的拒绝采样来执行推理时策略提取，从而无需重新训练即可实现高效优化、正则化和计算可扩展搜索。根据经验，我们表明 EVOR 在多种离线 RL 任务上的表现优于基线，证明了将表达价值学习集成到离线 RL 中的好处。

Title: Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction

Authors: Mu Li, Yin Wang, Zhiying Leng, Jiapeng Liu, Frederick W. B. Li, Xiaohui Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08260
Pdf URL: https://arxiv.org/pdf/2510.08260
Copy Paste: [[2510.08260]] Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction(https://arxiv.org/abs/2510.08260)
Keywords: generation
Abstract: Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.
摘要：人类交互本质上是动态的、层次化的，其中动态是指运动随距离的变化，层次是从个体到个体间，最终到整体运动。利用这些属性对于双人运动生成至关重要，而现有方法几乎模拟了时间不变的人类交互，忽略了距离和层次结构。为了解决这个问题，我们提出了一种细粒度的双人运动生成方法，即 FineDual，这是一种对个体到个体间的动态分层交互进行建模的三阶段方法。第一阶段，自学习阶段，通过大语言模型将双人整体文本划分为个体文本，在个体层面对齐文本特征和运动特征。第二阶段，自适应调整阶段，通过交互距离预测器预测交互距离，通过交互感知图网络在个体间动态建模人类交互。最后一个阶段，教师引导细化阶段，以整体文本特征为指导，对整体层面的动作特征进行细化，生成细粒度、高质量的双人动作。对双人运动数据集的广泛定量和定性评估表明，我们提出的 FineDual 优于现有方法，可以有效地模拟动态分层人类交互。

Title: Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification

Authors: Chenying Liu, Gianmarco Perantoni, Lorenzo Bruzzone, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08269
Pdf URL: https://arxiv.org/pdf/2510.08269
Copy Paste: [[2510.08269]] Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification(https://arxiv.org/abs/2510.08269)
Keywords: generation
Abstract: Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.
摘要：与传统的单标签分类 (SLC) 相比，多标签分类 (MLC) 可以对遥感 (RS) 图像提供更全面的语义理解。然而，由于标记过程的复杂性和高成本，获得 MLC 的完整注释尤其具有挑战性。作为一种实用的替代方案，单正多标签学习（SPML）已经出现，其中每张图像仅用一个相关标签进行注释，并且模型有望恢复全套标签。虽然可扩展，但 SPML 引入了显着的监督模糊性，需要专门的模型训练解决方案。尽管计算机视觉领域已经提出了各种 SPML 方法，但 RS 背景下的研究仍然有限。为了弥补这一差距，我们提出了自适应梯度校准（AdaGC），这是一种专为遥感图像量身定制的新颖且可通用的 SPML 框架。 AdaGC 采用梯度校准（GC）机制结合 Mixup 和双指数移动平均（EMA）模块来实现稳健的伪标签生成。为了最大限度地提高 AdaGC 的有效性，我们引入了一个简单但有理论依据的指标，可以在基于训练动态的初始预热阶段后自适应地触发 GC，从而保证 GC 在减轻对标签噪声的过度拟合方面的有效性。在两种不同标签噪声类型下的两个基准 RS 数据集上进行的大量实验表明，AdaGC 实现了最先进的 (SOTA) 性能，同时在不同设置下保持了强大的鲁棒性。

Title: Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints

Authors: Tsuyoshi Okita
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.08295
Pdf URL: https://arxiv.org/pdf/2510.08295
Copy Paste: [[2510.08295]] Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints(https://arxiv.org/abs/2510.08295)
Keywords: generation, generative
Abstract: Conventional time-series generation often ignores domain-specific physical constraints, limiting statistical and physical consistency. We propose a hierarchical framework that embeds the inherent hierarchy of physical laws-conservation, dynamics, boundary, and empirical relations-directly into deep generative models, introducing a new paradigm of physics-informed inductive bias. Our method combines Fourier Neural Operators (FNOs) for learning physical operators with Conditional Flow Matching (CFM) for probabilistic generation, integrated via time-dependent hierarchical constraints and FNO-guided corrections. Experiments on harmonic oscillators, human activity recognition, and lithium-ion battery degradation show 16.3% higher generation quality, 46% fewer physics violations, and 18.5% improved predictive accuracy over baselines.
摘要：传统的时间序列生成通常忽略特定领域的物理约束，从而限制了统计和物理一致性。我们提出了一个层次框架，将物理定律的固有层次结构（守恒定律、动力学、边界和经验关系）直接嵌入到深层生成模型中，引入了基于物理的归纳偏差的新范式。我们的方法将用于学习物理算子的傅里叶神经算子 (FNO) 与用于概率生成的条件流匹配 (CFM) 相结合，并通过时间相关的层次约束和 FNO 引导的校正进行集成。关于谐波振荡器、人类活动识别和锂离子电池退化的实验表明，与基线相比，发电质量提高了 16.3%，物理违规减少了 46%，预测准确性提高了 18.5%。

Title: LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Authors: Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, Jun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08318
Pdf URL: https://arxiv.org/pdf/2510.08318
Copy Paste: [[2510.08318]] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation(https://arxiv.org/abs/2510.08318)
Keywords: generation
Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
摘要：视频扩散模型 (DM) 实现了高质量视频合成。然而，它们的计算成本随序列长度呈二次方扩展，因为自注意力的复杂度呈二次方。虽然线性注意力降低了成本，但由于线性注意力的表达能力有限以及视频生成中时空建模的复杂性，完全取代二次注意力需要昂贵的预训练。在本文中，我们提出了 LinVideo，这是一种高效的无数据后训练框架，它用线性注意力替换目标数量的自注意力模块，同时保留原始模型的性能。首先，我们观察到不同层的可替换性存在显着差异。我们不是手动或启发式选择，而是将层选择构建为二元分类问题，并提出选择性转移，它自动且逐步地将层转换为线性注意力，同时对性能影响最小。此外，为了克服此传输过程现有目标的无效性和低效率，我们引入了随时分布匹配（ADM）目标，该目标可以沿着采样轨迹在任何时间步长上对齐样本分布。这个目标是有效的并且恢复了模型性能。大量实验表明，我们的方法在保持生成质量的同时实现了 1.25-2.00 倍的加速，并且我们的 4 步蒸馏模型进一步将延迟降低了 15.92 倍，同时视觉质量下降最小。

Title: Hyperspectral data augmentation with transformer-based diffusion models

Authors: Mattia Ferrari, Lorenzo Bruzzone
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08363
Pdf URL: https://arxiv.org/pdf/2510.08363
Copy Paste: [[2510.08363]] Hyperspectral data augmentation with transformer-based diffusion models(https://arxiv.org/abs/2510.08363)
Keywords: generation, generative
Abstract: The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.
摘要：新一代高光谱卫星传感器的引入，加上深度学习方法的进步，显着增强了区分中大尺度详细土地覆盖类别的能力。然而，深度学习方法的一个重大挑战是使用小型标记数据集训练网络时存在过度拟合的风险。在这项工作中，我们提出了一种利用引导扩散模型的数据增强技术。为了使用有限数量的标记样本有效地训练模型并捕获数据中的复杂模式，我们实现了一个轻量级的变压器网络。此外，我们引入了改进的加权损失函数和优化的余弦方差调度程序，这有助于对小型数据集进行快速有效的训练。我们使用 PRISMA 卫星获取的高光谱图像评估了所提出的方法在 10 种不同森林类型的森林分类任务上的有效性。结果表明，所提出的方法在平均和加权平均精度方面均优于其他数据增强技术。该方法的有效性通过模型的稳定训练行为进一步凸显，它解决了深度生成模型在数据增强的实际应用中的常见限制。

Title: Guided Star-Shaped Masked Diffusion

Authors: Viacheslav Meshchaninov, Egor Shibaev, Artem Makoian, Ivan Klimov, Danil Sheshenya, Andrei Malinin, Nikita Balagansky, Daniil Gavrilov, Aibek Alanov, Dmitry Vetrov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.08369
Pdf URL: https://arxiv.org/pdf/2510.08369
Copy Paste: [[2510.08369]] Guided Star-Shaped Masked Diffusion(https://arxiv.org/abs/2510.08369)
Keywords: generation
Abstract: The performance of pre-trained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable re-masking scheduler that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.
摘要：预训练的掩蔽扩散模型的性能通常受到采样程序的限制，这使得决策不可逆转，并且在低步生成机制中陷入困境。我们引入了一种新颖的采样算法，该算法与预先训练的模型一起使用，并且在对单层进行轻量级微调后，显着提高了样本质量和效率。我们的方法使用星形范式重新制定生成过程，这本质上允许纠错。为了使这个过程有效，我们用一个可学习的重新屏蔽调度程序来增强它，它可以智能地识别和修改可能的错误。这种方法可以显着提高质量，特别是在使用少量采样步骤时。我们广泛地消除了我们方法的关键组成部分，并展示了其在不同场景中的可用性。在文本和代码生成的综合实验中，我们的采样算法优于或匹配现有方法。

Title: UniVideo: Unified Understanding, Generation, and Editing for Videos

Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08377
Pdf URL: https://arxiv.org/pdf/2510.08377
Copy Paste: [[2510.08377]] UniVideo: Unified Understanding, Generation, and Editing for Videos(https://arxiv.org/abs/2510.08377)
Keywords: generation
Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
摘要：统一的多模态模型在多模态内容生成和编辑方面显示出了有希望的结果，但仍然很大程度上局限于图像领域。在这项工作中，我们提出了 UniVideo，这是一个将统一建模扩展到视频领域的多功能框架。 UniVideo采用双流设计，将用于指令理解的多模态大语言模型（MLLM）与用于视频生成的多模态DiT（MMDiT）相结合。这种设计能够准确解释复杂的多模式指令，同时保持视觉一致性。在此架构之上，UniVideo 将不同的视频生成和编辑任务统一在单一多模式指令范例下，并在它们之间进行联合训练。大量实验表明，UniVideo 在文本/图像到视频生成、上下文视频生成和上下文视频编辑方面达到或超过了最先进的特定任务基线。值得注意的是，UniVideo 的统一设计实现了两种形式的通用化。首先，UniVideo 通过在单个指令中集成多种功能来支持任务组合，例如将编辑与风格转换相结合。其次，即使没有接受过自由格式视频编辑的明确培训，UniVideo 也可以将其编辑能力从大规模图像编辑数据转移到此设置，处理看不见的指令，例如绿屏字符或更改视频中的材料。除了这些核心功能之外，UniVideo 还支持基于视觉提示的视频生成，其中 MLLM 解释视觉提示并在合成过程中指导 MMDiT。为了促进未来的研究，我们将发布我们的模型和代码。

Title: FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.08396
Pdf URL: https://arxiv.org/pdf/2510.08396
Copy Paste: [[2510.08396]] FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts(https://arxiv.org/abs/2510.08396)
Keywords: generation
Abstract: Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at this https URL.
摘要：低秩适应（LoRA）是一种广泛使用的基础模型参数高效微调方法，但它受到参数干扰，导致性能不佳。尽管基于专家混合 (MoE) 的 LoRA 变体在减轻单任务指令调整中的任务内相关性方面表现出了希望，但它们引入了额外的路由器参数，并且在出现任务间干扰的多任务模型合并中仍然无效。受苍蝇嗅觉电路的启发，我们提出了 FlyLoRA，这是一种基于 MoE 的隐式 LoRA 变体，它引入了：（1）上投影矩阵中的按等级专家激活，以及（2）统一专家路由和下投影的隐式路由器，其中冻结稀疏随机投影矩阵取代了传统的密集可训练版本。该设计通过消除对显式路由器的需求，解决了任务内去相关和计算效率之间的权衡，同时本质上减轻了由于随机矩阵的正交性而导致的任务间干扰。跨越四个领域的广泛实验——一般知识理解、科学问答、数学推理和代码生成——证明了与现有方法相比的一致的性能改进。除了经验收益之外，FlyLoRA 还强调了生物结构如何激发人工智能技术的创新。代码可从此 https URL 获取。

Title: VideoVerse: How Far is Your T2V Generator from a World Model?

Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08398
Pdf URL: https://arxiv.org/pdf/2510.08398
Copy Paste: [[2510.08398]] VideoVerse: How Far is Your T2V Generator from a World Model?(https://arxiv.org/abs/2510.08398)
Keywords: generation
Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
摘要：文本到视频 (T2V) 生成技术最近的快速发展对于构建“世界模型”至关重要，这使得现有基准越来越不足以评估最先进的 T2V 模型。首先，当前的评估维度，例如每帧的美学质量和时间一致性，不再能够区分最先进的 T2V 模型。其次，事件级时间因果关系不仅将视频与其他模态区分开来，而且构成了世界模型的重要组成部分，但在现有基准中却严重缺乏探索。第三，现有基准缺乏对世界知识的系统评估，而世界知识是构建世界模型的必备能力。为了解决这些问题，我们引入了 VideoVerse，这是一个综合基准测试，重点评估 T2V 模型是否能够理解现实世界中复杂的时间因果关系和世界知识。我们收集不同领域（例如自然景观、体育、室内场景、科幻小说、化学和物理实验）的代表性视频，并提取具有固有时间因果关系的事件级描述，然后由独立注释者将其重写为文本到视频提示。对于每个提示，我们从动态和静态属性的角度设计了一套二元评估问题，总共有十个精心定义的评估维度。我们的 VideoVerse 总共包含 300 个精心策划的提示，涉及 815 个事件和 793 个二进制评估问题。因此，通过使用现代视觉语言模型开发了一种符合人类偏好的基于 QA 的评估流程。最后，我们对 VideoVerse 上最先进的开源和闭源 T2V 模型进行了系统评估，深入分析了当前 T2V 生成器与世界模型的差距。

Title: Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin

Authors: Lauren Anderson, Lucas Chatelain, Nicolas Tremblay, Kathryn Grandfield, David Rousseau, Aurélien Gourrier
Subjects: cs.LG, cs.CV, q-bio.TO
Abstract URL: https://arxiv.org/abs/2510.08407
Pdf URL: https://arxiv.org/pdf/2510.08407
Copy Paste: [[2510.08407]] Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin(https://arxiv.org/abs/2510.08407)
Keywords: super-resolution, generation, quality assessment
Abstract: The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.
摘要：目前认为牙齿的机械感觉系统部分依赖于流体流经延伸穿过牙本质的孔隙网络所产生的成牙本质细胞刺激。因此，可视化最小的亚显微孔隙血管需要共焦荧光显微镜（当前的黄金标准）可实现的最高分辨率。这极大地将视野范围限制在非常小的样本区域。为了克服这一限制，我们测试了不同的深度学习 (DL) 超分辨率 (SR) 模型，以允许更快地实验采集较低分辨率图像，并通过后处理恢复最佳图像质量。将三个有监督的 2D SR 模型（RCAN、pix2pix、FSRCNN）和一个无监督的（CycleGAN）应用于通过不同采样方案获取的一组独特的实验配对高分辨率和低分辨率共焦图像，导致像素大小增加 x2、x4、x8。使用一系列广泛的相似性和基于分布的图像质量评估（IQA）指标来量化模型性能，这产生了不一致的结果，与我们的视觉感知基本矛盾。这就提出了此类通用指标与有效针对牙齿孔隙度特定结构的相关性的问题。为了解决这种相互矛盾的信息，考虑到孔隙度网络的特定尺度和形态，对生成的 SR 图像进行分割，并通过比较连接的组件进行分析。此外，还使用图形分析评估了 SR 模型在整个共焦图像堆栈中保留 3D 孔隙度连接性的能力。这种生物学驱动的评估可以对 SR 性能进行更好的机械解释，突出显示模型对弱强度特征的敏感性差异以及图像生成中非线性的影响，这解释了标准 IQA 指标的失败。

Title: Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Authors: Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08431
Pdf URL: https://arxiv.org/pdf/2510.08431
Copy Paste: [[2510.08431]] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency(https://arxiv.org/abs/2510.08431)
Keywords: generation
Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
摘要：这项工作代表了将连续时间一致性蒸馏扩展到通用应用级图像和视频扩散模型的首次努力。尽管连续时间一致性模型（sCM）在理论原理和经验上对于加速学术规模的扩散具有强大的作用，但由于雅可比向量积（JVP）计算的基础设施挑战和标准评估基准的限制，其对大规模文本到图像和视频任务的适用性仍不清楚。我们首先开发了并行兼容的 FlashAttention-2 JVP 内核，支持对具有超过 100 亿个参数的模型和高维视频任务进行 sCM 训练。我们的研究揭示了 sCM 在精细细节生成方面的基本质量局限性，我们将其归因于错误累积及其前向发散目标的“模式覆盖”性质。为了解决这个问题，我们提出了分数正则化连续时间一致性模型（rCM），它将分数蒸馏作为长跳跃正则化器。这种集成通过“模式搜索”反向发散来补充 sCM，有效提高视觉质量，同时保持高世代多样性。 rCM 在高达 14B 参数和 5 秒视频的大型模型（Cosmos-Predict2、Wan2.1）上进行了验证，在质量指标上匹配或超越了最先进的蒸馏方法 DMD2，同时在多样性方面提供了显着的优势，所有这些都无需 GAN 调整或广泛的超参数搜索。精炼模型只需 $1\sim4$ 步即可生成高保真样本，将扩散采样速度加快 $15\times\sim50\times$。这些结果将 rCM 定位为推进大规模扩散蒸馏的实用且具有理论基础的框架。

Title: Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Authors: Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08445
Pdf URL: https://arxiv.org/pdf/2510.08445
Copy Paste: [[2510.08445]] Synthetic Series-Symbol Data Generation for Time Series Foundation Models(https://arxiv.org/abs/2510.08445)
Keywords: generation
Abstract: Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \texttt{SymTime}, a pre-trained foundation model for enhancing time series representation using symbolic information. \texttt{SymTime} demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at this https URL.
摘要：时间序列分析（TSA）的基础模型引起了极大的关注。然而，训练数据稀缺和不平衡等挑战继续阻碍其发展。受复杂动态系统理论的启发，我们设计了一系列符号数据生成机制，可以无限制地创建高质量的时间序列数据与相应的符号表达式。为了利用具有强相关性的序列符号数据对，我们开发了 \texttt{SymTime}，这是一种预训练的基础模型，用于使用符号信息增强时间序列表示。 \texttt{SymTime} 在与下游任务进行微调时，展示了五个主要 TSA 任务的竞争性能，可与在现实数据集上预训练的基础模型相媲美。这种方法强调了系列符号数据生成和预训练机制在克服数据稀缺和提高任务性能方面的潜力。该代码可从此 https URL 获取。

Title: SummDiff: Generative Modeling of Video Summarization with Diffusion

Authors: Kwanseok Kim, Jaehoon Hahm, Sumin Kim, Jinhwan Sul, Byunghak Kim, Joonseok Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.08458
Pdf URL: https://arxiv.org/pdf/2510.08458
Copy Paste: [[2510.08458]] SummDiff: Generative Modeling of Video Summarization with Diffusion(https://arxiv.org/abs/2510.08458)
Keywords: generation, generative
Abstract: Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
摘要：视频摘要是通过选择帧子集来缩短视频同时保留其重要时刻的任务。尽管该任务具有固有的主观性，但以前的工作已经确定性地回归到多个评分者的平均帧分数，忽略了构成良好总结的固有主观性。我们提出了一种新颖的问题表述，将视频摘要构建为条件生成任务，允许模型学习良好摘要的分布并生成多个合理的摘要，以更好地反映不同的人类观点。我们提出的方法 SummDiff 首次在视频摘要中采用扩散模型，动态适应视觉上下文并根据输入视频生成多个候选摘要。大量的实验表明，SummDiff 不仅在各种基准上实现了最先进的性能，而且还生成了与个人注释者偏好密切相关的摘要。此外，我们通过背包分析的新颖指标提供了更深入的见解，这是生成摘要的重要最后一步，但在评估中被忽视了。

Title: MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Authors: Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08508
Pdf URL: https://arxiv.org/pdf/2510.08508
Copy Paste: [[2510.08508]] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration(https://arxiv.org/abs/2510.08508)
Keywords: restoration, quality assessment
Abstract: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.
摘要：由于采集和传输条件不同，现实世界的视频经常会遭受复杂的退化，例如噪声、压缩伪影和低光失真。现有的恢复方法通常需要专业的手动选择专门的模型，或者依赖于无法泛化不同退化的整体架构。受专家经验的启发，我们提出了 MoA-VR，第一个 \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration 系统，通过三个协调代理模仿人类专业人员的推理和处理过程：退化识别、路由和恢复以及恢复质量评估。具体来说，我们构建了大规模、高分辨率的视频退化识别基准，并构建了视觉语言模型（VLM）驱动的退化标识符。我们进一步引入了一种由大语言模型（LLM）支持的自适应路由器，它通过观察工具使用模式自主学习有效的恢复策略。为了评估中间和最终处理的视频质量，我们构建了 \underline{Res}tored \underline{V}ideo \underline{Q}quality (Res-VQ) 数据集，并设计了一个专门为恢复任务量身定制的基于 VLM 的视频质量评估 (VQA) 模型。大量实验表明，MoA-VR 可以有效处理多种复合退化，在客观指标和感知质量方面始终优于现有基线。这些结果凸显了在通用视频恢复系统中集成多模态智能和模块化推理的潜力。

Title: FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Authors: Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08527
Pdf URL: https://arxiv.org/pdf/2510.08527
Copy Paste: [[2510.08527]] FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control(https://arxiv.org/abs/2510.08527)
Keywords: generation
Abstract: We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.
摘要：我们提出了 FlexTraj，一个具有灵活点轨迹控制的图像到视频生成框架。 FlexTraj 引入了一种基于点的统一运动表示，该表示使用分段 ID、时间一致的轨迹 ID 和用于外观线索的可选颜色通道对每个点进行编码，从而实现密集和稀疏轨迹控制。 FlexTraj 没有通过令牌串联或 ControlNet 将轨迹条件注入视频生成器，而是采用高效的序列串联方案，该方案可以实现更快的收敛、更强的可控性和更高效的推理，同时在未对齐条件下保持鲁棒性。为了训练这样一个统一的点轨迹控制视频生成器，FlexTraj 采用了退火训练策略，逐渐减少对完全监督和对齐条件的依赖。实验结果表明，FlexTraj 可为视频生成实现多粒度、与对齐无关的轨迹控制，支持各种应用，例如运动克隆、基于拖动的图像到视频、运动插值、相机重定向、灵活的动作控制和网格动画。

Title: Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Authors: Rishubh Parihar, Or Patashnik, Daniil Ostashev, R. Venkatesh Babu, Daniel Cohen-Or, Kuan-Chieh Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08532
Pdf URL: https://arxiv.org/pdf/2510.08532
Copy Paste: [[2510.08532]] Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing(https://arxiv.org/abs/2510.08532)
Keywords: generative
Abstract: Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
摘要：基于指令的图像编辑提供了一种通过自然语言操作图像的强大而直观的方法。然而，仅仅依赖文本指令限制了对编辑范围的细粒度控制。我们推出了 Kontinuous Kontext，这是一种指令驱动的编辑模型，它提供了对编辑强度的控制的新维度，使用户能够以平稳、连续的方式逐渐调整编辑，从无变化到完全实现的结果。 Kontinuous Kontext 扩展了最先进的图像编辑模型，以接受额外的输入，即标量编辑强度，然后与编辑指令配对，从而实现对编辑范围的明确控制。为了注入这些标量信息，我们训练一个轻量级投影网络，将输入标量和编辑指令映射到模型调制空间中的系数。为了训练我们的模型，我们使用现有的生成模型合成了图像编辑指令强度四元组的多样化数据集，然后是过滤阶段以确保质量和一致性。 Kontinously Kontext 提供了一种统一的方法，可对编辑强度进行细粒度控制，以实现指令驱动的编辑，跨越风格化、属性、材质、背景和形状变化等各种操作，从细微到强烈，而无需针对特定属性的培训。

Title: MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Authors: Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08540
Pdf URL: https://arxiv.org/pdf/2510.08540
Copy Paste: [[2510.08540]] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization(https://arxiv.org/abs/2510.08540)
Keywords: generation
Abstract: While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
摘要：虽然当前的多模态大型语言模型 (MLLM) 已表现出对数学和逻辑等推理任务的熟练程度，但它们的长链反思推理能力（解决复杂现实问题的先决条件）在很大程度上仍未得到充分开发。在这项工作中，我们首先进行广泛的实证调查来评估这种能力。利用精心设计的数据合成引擎，我们构建了 MM-HELIX，这是一个多模式基准，由 42 个具有挑战性的合成任务的 1,260 个样本组成，需要迭代思维和回溯。该基准的实证结果表明，现有的 MLLM 在长链反思推理中表现出显着的性能缺陷。为了解决这个限制，我们生成训练后数据并进一步探索利用这些数据的学习范例。我们首先开发逐步引发响应生成管道来创建 MM-HELIX-100K，这是一个用于指令调整阶段的 100k 高质量反射推理轨迹的大型数据集。鉴于标准强化学习由于奖励信号稀疏和监督微调后的灾难性遗忘而在复杂任务上失败，我们提出了自适应混合策略优化（AHPO），这是一种新颖的训练策略，可以动态地将离线监督和在线优化统一到一个阶段。该策略使模型能够在奖励稀疏时从专家数据中学习，并在熟练后进行独立探索。当应用于 Qwen2.5-VL-7B 基线时，我们的方法在 MM-HELIX 基准上实现了 +18.6\% 的精度改进，并在一般数学和逻辑任务上表现出强大的泛化能力，平均性能增益 +5.7\%。我们的工作表明，MLLM 中的反思推理可以有效地学习和推广，为开发更强大的 MLLM 铺平道路。

Title: Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Authors: Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.08554
Pdf URL: https://arxiv.org/pdf/2510.08554
Copy Paste: [[2510.08554]] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization(https://arxiv.org/abs/2510.08554)
Keywords: generation
Abstract: Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
摘要：扩散语言模型 (DLM) 通过迭代细化实现并行、顺序无关的生成，为自回归大型语言模型 (LLM) 提供灵活的替代方案。然而，由于难以处理的可能性，使强化学习 (RL) 微调适应 DLM 仍然是一个开放的挑战。 diffu-GRPO 等开创性工作通过一步揭露来估计代币级别的可能性。虽然计算效率高，但这种方法存在严重偏差。更原则性的基础在于序列级可能性，其中证据下限（ELBO）充当替代。然而，尽管存在这种清晰的数学联系，但由于可能性评估的成本过高，基于 ELBO 的方法的采用仍然有限。在这项工作中，我们重新审视 ELBO 估计并理清其方差来源。这种分解通过沿几个关键维度的快速、确定性积分近似来减少方差。基于这一见解，我们引入了 \textbf{Group Diffusion Policy Optimization (GDPO)}，这是一种专为 DLM 量身定制的新 RL 算法。 GDPO 利用简单而有效的半确定性蒙特卡罗方案来减轻普通双蒙特卡罗采样下 ELBO 估计量的方差爆炸，从而在紧张的评估预算下产生可证明较低方差的估计量。根据经验，GDPO 比预训练检查点取得了一致的收益，并且在大多数数学、推理和编码基准上都优于 diffu-GRPO（最先进的基准之一）。

Title: VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Authors: Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08555
Pdf URL: https://arxiv.org/pdf/2510.08555
Copy Paste: [[2510.08555]] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning(https://arxiv.org/abs/2510.08555)
Keywords: generation
Abstract: We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
摘要：我们引入了任意时空视频完成的任务，其中视频是由放置在任何空间位置和时间戳的任意用户指定的补丁生成的，类似于在视频画布上绘画。这种灵活的公式自然地将许多现有的可控视频生成任务（包括第一帧图像到视频、修复、扩展和插值）统一在一个单一的、有凝聚力的范例下。然而，实现这一愿景在现代潜在视频扩散模型中面临着一个根本障碍：因果 VAE 引入的时间模糊性，其中多个像素帧被压缩为单个潜在表示，使得精确的帧级调节在结构上变得困难。我们通过 VideoCanvas 解决了这一挑战，这是一个新颖的框架，它采用上下文调节 (ICC) 范式来适应这种零新参数的细粒度控制任务。我们提出了一种分离空间和时间控制的混合调节策略：空间放置通过零填充处理，而时间对齐通过时间 RoPE 插值实现，该插值为每个条件分配潜在序列内的连续分数位置。这解决了 VAE 的时间模糊性，并在冻结的主干上实现像素帧感知控制。为了评估这一新功能，我们开发了 VideoCanvasBench，这是任意时空视频完成的第一个基准，涵盖场景内保真度和场景间创造力。实验表明，VideoCanvas 显着优于现有的调节范例，在灵活且统一的视频生成方面建立了新的技术水平。

Title: MultiCOIN: Multi-Modal COntrollable Video INbetweening

Authors: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.08561
Pdf URL: https://arxiv.org/pdf/2510.08561
Copy Paste: [[2510.08561]] MultiCOIN: Multi-Modal COntrollable Video INbetweening(https://arxiv.org/abs/2510.08561)
Keywords: generative
Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
摘要：视频过渡在两个图像帧之间创建平滑自然的过渡，使其成为视频编辑和长视频合成不可或缺的工具。该领域的现有作品无法生成大型、复杂或错综复杂的运动。特别是，它们无法适应用户意图的多样性，并且通常缺乏对中间帧细节的精细控制，导致与创意思维不一致。为了填补这些空白，我们引入了 \modelname{}，一个视频中间框架，它允许多模式控制，包括深度过渡和分层、运动轨迹、文本提示和运动定位的目标区域，同时实现细粒度视频插值的灵活性、易用性和精度之间的平衡。为了实现这一目标，我们采用 Diffusion Transformer (DiT) 架构作为我们的视频生成模型，因为它具有生成高质量长视频的经过验证的能力。为了确保 DiT 和我们的多模式控件之间的兼容性，我们将所有运动控件映射到通用稀疏且用户友好的基于点的表示形式作为视频/噪声输入。此外，为了尊重在不同粒度和影响水平上运行的各种控制，我们将内容控制和运动控制分成两个分支，以便在指导去噪过程之前对所需的特征进行编码，从而产生两个生成器，一个用于运动，另一个用于内容。最后，我们提出了一种分阶段的训练策略，以确保我们的模型顺利学习多模态控制。广泛的定性和定量实验表明，多模式控制可以实现更加动态、可定制和上下文准确的视觉叙事。

Title: Who Said Neural Networks Aren't Linear?

Authors: Nimrod Berman, Assaf Hallak, Assaf Shocher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.08570
Pdf URL: https://arxiv.org/pdf/2510.08570
Copy Paste: [[2510.08570]] Who Said Neural Networks Aren't Linear?(https://arxiv.org/abs/2510.08570)
Keywords: generative
Abstract: Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.
摘要：众所周知，神经网络是非线性的。然而，线性度是相对于一对向量空间 $f$$:$$X$$\to$$Y$ 定义的。是否有可能识别一对非标准向量空间，而传统的非线性函数实际上是线性的？本文介绍了一种通过构造使此类向量空间显式化的方法。我们发现，如果我们将线性算子 $A$ 夹在两个可逆神经网络 $f(x)=g_y^{-1}(A g_x(x))$ 之间，则相应的向量空间 $X$ 和 $Y$ 是由从 $g_x$ 和 $g_y$ 派生的新定义的加法和缩放操作导出的。我们将这种架构称为线性化器。该框架使得整个线性代数库（包括 SVD、伪逆、正交投影等）都适用于非线性映射。此外，我们还表明，共享一个神经网络的两个线性化器的组合也是一个线性化器。我们利用这一特性，并证明使用我们的架构训练扩散模型可以将数百个采样步骤分解为一个步骤。我们进一步利用我们的框架在网络上强制执行幂等性（即 $f(f(x))=f(x)$），从而形成全局投影生成模型并演示模块化风格转移。