2025-10-30

Title: Towards Fine-Grained Human Motion Video Captioning

Authors: Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24767
Pdf URL: https://arxiv.org/pdf/2510.24767
Copy Paste: [[2510.24767]] Towards Fine-Grained Human Motion Video Captioning(https://arxiv.org/abs/2510.24767)
Keywords: generative
Abstract: Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.
摘要：对于视频字幕模型来说，生成视频中人类行为的准确描述仍然是一项具有挑战性的任务。现有的方法通常难以捕捉细粒度的运动细节，导致字幕模糊或语义不一致。在这项工作中，我们介绍了运动增强字幕模型（M-ACM），这是一种新颖的生成框架，可通过结合运动感知解码来增强字幕质量。 M-ACM 的核心是利用源自人体网格恢复的运动表示来明确突出人体动态，从而减少幻觉并提高生成的字幕中的语义保真度和空间对齐。为了支持这一领域的研究，我们推出了人体运动洞察 (HMI) 数据集，其中包含 115K 个专注于人体运动的视频描述对，以及 HMI-Bench（用于评估专注于运动的视频字幕的专用基准）。实验结果表明，M-ACM 在准确描述复杂的人体运动和微妙的时间变化方面明显优于以前的方法，为以运动为中心的视频字幕设定了新标准。

Title: ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Authors: Mingzhi Zhu, Ding Shang, Sai Qian Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24787
Pdf URL: https://arxiv.org/pdf/2510.24787
Copy Paste: [[2510.24787]] ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality(https://arxiv.org/abs/2510.24787)
Keywords: generative
Abstract: Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.
摘要：生成高保真人脸渲染的真实感编解码器头像 (PCA) 越来越多地用于虚拟现实 (VR) 环境，以通过基于深度学习的生成模型实现沉浸式通信和交互。然而，这些模型提出了巨大的计算要求，使得实时推理对资源有限的 VR 设备（例如头戴式显示器）造成了挑战，因为延迟和功效对于这些设备而言至关重要。为了应对这一挑战，我们提出了一种专为 Codec Avatar 模型量身定制的高效训练后量化 (PTQ) 方法，可在不影响输出质量的情况下实现低精度执行。此外，我们还设计了定制硬件加速器，可以集成到VR设备的片上系统中，进一步提高处理效率。在这些组件的基础上，我们引入了 ESCA，这是一个全栈优化框架，可加速边缘 VR 平台上的 PCA 推理。实验结果表明，ESCA 将 FovVideoVDP 质量得分比最佳 4 位基准提高了高达 $+0.39$，延迟降低高达 $3.36\times$，并在端到端测试中维持每秒 100 帧的渲染速率，满足实时 VR 要求。这些结果证明了在资源有限的设备上部署高保真编解码器化身的可行性，为更加身临其境和便携式的 VR 体验打开了大门。

Title: SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

Authors: Ruiyang Zhang, Jiahao Luo, Xiaoru Feng, Qiufan Pang, Yaodong Yang, Juntao Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24820
Pdf URL: https://arxiv.org/pdf/2510.24820
Copy Paste: [[2510.24820]] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing(https://arxiv.org/abs/2510.24820)
Keywords: generation
Abstract: With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.
摘要：随着文本到图像（T2I）模型的快速发展，确保其安全性变得越来越重要。现有的安全方法可以分为训练时间方法和推理时间方法。虽然推理时间方法因其成本效益而被广泛采用，但它们经常受到过度拒绝以及安全性和实用性之间不平衡等限制。为了应对这些挑战，我们提出了一个多轮安全编辑框架，该框架作为与模型无关的即插即用模块，可以为任何文本到图像模型实现高效的安全对齐。该框架的核心是 MR-SafeEdit，这是一个多轮图像文本交错数据集，专为文本到图像生成过程中的安全编辑而构建。我们引入了一种事后安全编辑范式，它反映了人类识别和完善不安全内容的认知过程。为了实例化这个范例，我们开发了 SafeEditor，这是一个统一的 MLLM，能够对生成的图像进行多轮安全编辑。实验结果表明，SafeEditor 通过减少过度拒绝，同时实现更有利的安全性与实用性平衡，超越了以前的安全方法。

Title: Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Authors: Inclusion AI: Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24821
Pdf URL: https://arxiv.org/pdf/2510.24821
Copy Paste: [[2510.24821]] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation(https://arxiv.org/abs/2510.24821)
Keywords: generation, generative
Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.
摘要：我们提出 Ming-Flash-Omni，它是 Ming-Omni 的升级版本，建立在 Ling-Flash-2.0 的稀疏专家混合 (MoE) 变体之上，总参数为 1000 亿个，其中每个代币只有 61 亿个活跃参数。该架构可实现高效扩展（显着提高计算效率，同时显着扩展模型容量），并在视觉、语音和语言方面提供更强大的统一多模态智能，代表着向通用人工智能 (AGI) 迈出了关键一步。与前身相比，升级版本在多模式理解和生成方面表现出显着改进。我们显着提高了语音识别能力，在上下文 ASR 方面实现了最先进的性能，在方言感知 ASR 方面取得了极具竞争力的结果。在图像生成中，Ming-Flash-Omni 引入了高保真文本渲染，并在图像编辑过程中展示了场景一致性和身份保留方面的显着收益。此外，Ming-Flash-Omni引入了生成分割功能，该功能不仅可以实现强大的独立分割性能，还可以增强图像生成的空间控制并提高编辑一致性。值得注意的是，Ming-Flash-Omni 在文本到图像生成和生成分割方面取得了最先进的结果，并在所有 12 个上下文 ASR 基准上创造了新记录，所有这些都在一个统一的架构中进行。

Title: The Generation Phases of Flow Matching: a Denoising Perspective

Authors: Anne Gagneux, Ségolène Martin, Rémi Gribonval, Mathurin Massias
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24830
Pdf URL: https://arxiv.org/pdf/2510.24830
Copy Paste: [[2510.24830]] The Generation Phases of Flow Matching: a Denoising Perspective(https://arxiv.org/abs/2510.24830)
Keywords: generation, generative
Abstract: Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
摘要：流量匹配已经取得了显着的成功，但影响其生成过程质量的因素仍然知之甚少。在这项工作中，我们采用去噪的视角并设计了一个框架来实证探索生成过程。建立流匹配模型和降噪器之间的正式联系，我们提供了比较它们在生成和降噪方面的性能的共同点。这使得原则性和受控扰动的设计能够影响样本的生成：噪声和漂移。这带来了对生成过程的不同动态阶段的新见解，使我们能够准确地描述生成过程中降噪器成功或失败的阶段以及为什么这很重要。

Title: VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos

Authors: Qiucheng Wu, Handong Zhao, Zhixin Shu, Jing Shi, Yang Zhang, Shiyu Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.24904
Pdf URL: https://arxiv.org/pdf/2510.24904
Copy Paste: [[2510.24904]] VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos(https://arxiv.org/abs/2510.24904)
Keywords: generative
Abstract: Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in this https URL .
摘要：尽管最近的文本到视频生成模型越来越能够遵循由文本描述或摄像机轨迹强加的外部摄像机控制，但它们仍然难以推广到非常规的摄像机运动，这对于创建真正原创和艺术的视频至关重要。挑战在于很难找到足够的具有不常见相机动作的训练视频。为了应对这一挑战，我们提出了 VividCam，这是一种训练范例，使扩散模型能够从合成视频中学习复杂的相机运动，从而释放对收集真实训练视频的依赖。 VividCam 结合了多种解缠结策略，将相机运动学习与合成外观伪影隔离开来，确保更稳健的运动表示并减轻域偏移。我们证明，我们的设计使用极其简单的合成数据来合成各种精确控制和复杂的相机运动。值得注意的是，这种合成数据通常由低多边形 3D 场景中的基本几何图形组成，并且可以由 Unity 等引擎高效渲染。我们的视频结果可以在此 https URL 中找到。

Title: Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms

Authors: Bernhard Klein
Subjects: cs.LG, cs.AR, cs.NE
Abstract URL: https://arxiv.org/abs/2510.24951
Pdf URL: https://arxiv.org/pdf/2510.24951
Copy Paste: [[2510.24951]] Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms(https://arxiv.org/abs/2510.24951)
Keywords: generation
Abstract: While modern machine learning has transformed numerous application domains, its growing computational demands increasingly constrain scalability and efficiency, particularly on embedded and resource-limited platforms. In practice, neural networks must not only operate efficiently but also provide reliable predictions under distributional shifts or unseen data. Bayesian neural networks offer a principled framework for quantifying uncertainty, yet their computational overhead further compounds these challenges. This work advances resource-efficient and robust inference for both conventional and Bayesian neural networks through the joint pursuit of algorithmic and hardware efficiency. The former reduces computation through model compression and approximate Bayesian inference, while the latter optimizes deployment on digital accelerators and explores analog hardware, bridging algorithmic design and physical realization. The first contribution, Galen, performs automatic layer-specific compression guided by sensitivity analysis and hardware-in-the-loop feedback. Analog accelerators offer efficiency gains at the cost of noise; this work models device imperfections and extends noisy training to nonstationary conditions, improving robustness and stability. A second line of work advances probabilistic inference, developing analytic and ensemble approximations that replace costly sampling, integrate into a compiler stack, and optimize embedded inference. Finally, probabilistic photonic computing introduces a paradigm where controlled analog noise acts as an intrinsic entropy source, enabling fast, energy-efficient probabilistic inference directly in hardware. Together, these studies demonstrate how efficiency and reliability can be advanced jointly through algorithm-hardware co-design, laying the foundation for the next generation of trustworthy, energy-efficient machine-learning systems.
摘要：虽然现代机器学习已经改变了许多应用领域，但其不断增长的计算需求日益限制可扩展性和效率，特别是在嵌入式和资源有限的平台上。在实践中，神经网络不仅必须高效运行，而且必须在分布变化或看不见的数据下提供可靠的预测。贝叶斯神经网络提供了量化不确定性的原则框架，但其计算开销进一步加剧了这些挑战。这项工作通过共同追求算法和硬件效率，推进传统神经网络和贝叶斯神经网络的资源效率和鲁棒推理。前者通过模型压缩和近似贝叶斯推理来减少计算量，而后者则优化数字加速器上的部署并探索模拟硬件、桥接算法设计和物理实现。第一个贡献是 Galen，在灵敏度分析和硬件在环反馈的指导下执行自动特定层压缩。模拟加速器以噪声为代价提高效率；这项工作对设备缺陷进行了建模，并将噪声训练扩展到非平稳条件，从而提高了鲁棒性和稳定性。第二项工作推进概率推理，开发分析和集成近似来取代昂贵的采样、集成到编译器堆栈中并优化嵌入式推理。最后，概率光子计算引入了一种范例，其中受控模拟噪声充当内在熵源，从而可以直接在硬件中实现快速、节能的概率推理。这些研究共同证明了如何通过算法-硬件协同设计共同提高效率和可靠性，为下一代值得信赖、节能的机器学习系统奠定基础。

Title: Sequences of Logits Reveal the Low Rank Structure of Language Models

Authors: Noah Golowich, Allen Liu, Abhishek Shetty
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2510.24966
Pdf URL: https://arxiv.org/pdf/2510.24966
Copy Paste: [[2510.24966]] Sequences of Logits Reveal the Low Rank Structure of Language Models(https://arxiv.org/abs/2510.24966)
Keywords: generation
Abstract: A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model's logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation -- in particular, we can generate a response to a target prompt using a linear combination of the model's outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.
摘要：研究大型语言模型的一个主要问题是理解其固有的低维结构。我们引入了一种在模型不可知的级别上研究语言模型的低维结构的方法：作为顺序概率模型。我们首先凭经验证明，多种现代语言模型都表现出低秩结构：特别是，根据不同提示和响应集的模型逻辑构建的矩阵具有较低的近似秩。然后，我们证明这种低秩结构可以用于生成——特别是，我们可以使用模型在不相关甚至无意义的提示上的输出的线性组合来生成对目标提示的响应。在理论方面，我们观察到，研究上述意义上的语言模型的近似等级会产生一个简单的通用抽象，其理论预测与我们的实验平行。然后我们分析抽象的表示能力并给出可证明的学习保证。

Title: PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes

Authors: Xiang liu, Zhaoxiang Liu, Huan Hu, Zipeng Wang, Ping Chen, Zezhou Chen, Kai Wang, Shiguo Lian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25084
Pdf URL: https://arxiv.org/pdf/2510.25084
Copy Paste: [[2510.25084]] PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes(https://arxiv.org/abs/2510.25084)
Keywords: generation
Abstract: Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at this https URL.
摘要：个性化图像生成的最新进展显着改善了面部身份保存，特别是在娱乐和社交媒体等领域。然而，现有的方法仍然难以以无每主体调整（PSTF）的方式实现对面部属性的精确控制。 PreciseControl 等基于调整的技术通过提供对面部特征的细粒度控制而显示出前景，但它们通常需要广泛的技术专业知识和额外的训练数据，限制了它们的可访问性。相比之下，PSTF 方法通过从单个面部输入生成图像来简化过程，但它们缺乏对面部属性的精确控制。在本文中，我们介绍了一种新颖的 PSTF 方法，该方法既可以精确控制面部属性，又可以高保真地保存面部身份。我们的方法利用人脸识别模型来提取面部身份特征，然后使用 e4e 编码器将其映射到 StyleGAN2 的 $W^+$ 潜在空间中。我们通过 Triplet-De Coupled Cross-Attention 模块进一步增强了模型，该模块将面部身份、属性特征和文本嵌入集成到 UNet 架构中，确保身份和属性信息的清晰分离。我们的方法在 FFHQ 数据集上进行训练，可以生成个性化图像，并对面部属性进行细粒度控制，同时不需要针对个人身份进行额外的微调或训练数据。我们证明，我们的方法成功地平衡了个性化与精确的面部属性控制，为高质量、适应性强的面部图像合成提供了更高效、用户友好的解决方案。该代码可通过此 https URL 公开获取。

Title: Continual Low-Rank Adapters for LLM-based Generative Recommender Systems

Authors: Hyunsik Yoo, Ting-Wei Li, SeongKu Kang, Zhining Liu, Charlie Xu, Qilin Qi, Hanghang Tong
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2510.25093
Pdf URL: https://arxiv.org/pdf/2510.25093
Copy Paste: [[2510.25093]] Continual Low-Rank Adapters for LLM-based Generative Recommender Systems(https://arxiv.org/abs/2510.25093)
Keywords: generative
Abstract: While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.
摘要：虽然大型语言模型 (LLM) 在推荐方面取得了出色的性能，但随着用户、项目和用户偏好随着时间的推移而变化，它们在持续学习方面面临着挑战。现有的基于 LoRA 的连续方法主要侧重于保持先前任务的性能，但这忽略了推荐的独特性：目标不是预测过去的偏好，当当前兴趣发生重大变化时，过时的偏好甚至会损害性能。为了解决这个问题，我们提出了 PESO（Proximally rEgularized Single Evolution lOra，一种推荐的 LoRA 持续适应方法）。PESO 引入了近端正则器，将当前适配器锚定到最近的冻结状态，使模型能够灵活地平衡适应和保存，并更好地捕获最近的用户行为。理论上，我们表明这种近端设计在 LoRA 子空间中提供了数据感知、方向指导。从经验来看，PESO 始终优于现有的基于LoRA的持续学习方法。

Title: The Neural Differential Manifold: An Architecture with Explicit Geometric Structure

Authors: Di Zhang
Subjects: cs.LG, cs.AI, math.DG, math.OC
Abstract URL: https://arxiv.org/abs/2510.25113
Pdf URL: https://arxiv.org/pdf/2510.25113
Copy Paste: [[2510.25113]] The Neural Differential Manifold: An Architecture with Explicit Geometric Structure(https://arxiv.org/abs/2510.25113)
Keywords: generative
Abstract: This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold's metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.
摘要：本文介绍了神经微分流形（NDM），这是一种新颖的神经网络架构，它明确地将几何结构纳入其基本设计中。与传统的欧几里得参数空间不同，NDM 将神经网络重新概念化为可微分流形，其中每一层都充当局部坐标图，并且网络参数直接参数化每个点的黎曼度量张量。该架构分为三个协同层：一个坐标层，通过受归一化流启发的可逆变换实现平滑的图表转换；一个几何层，通过辅助子网络动态生成流形的度量；以及一个进化层，通过双目标损失函数优化任务性能和几何简单性。这种几何正则化会惩罚过度的曲率和体积变形，从而提供增强泛化性和鲁棒性的内在正则化。该框架能够实现与学习的流形几何结构相一致的自然梯度下降优化，并通过赋予内部表示以清晰的几何意义来提供前所未有的可解释性。我们分析了这种方法的理论优势，包括其更有效优化、增强持续学习的潜力以及在科学发现和可控生成建模中的应用。虽然仍然存在重大的计算挑战，但神经微分流形代表了向几何结构、可解释和高效的深度学习系统的根本转变。

Title: Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective

Authors: Wan Jiang, Jing Yan, Ruixuan Zhang, Xiaojing Chen, Changtao Miao, Zhe Li, Chenhao Lin, Yunfeng Diao, Richang Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25141
Pdf URL: https://arxiv.org/pdf/2510.25141
Copy Paste: [[2510.25141]] Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective(https://arxiv.org/abs/2510.25141)
Keywords: generative
Abstract: The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.
摘要：生成人工智能 (AI) 的兴起使得检测人工智能生成的图像成为确保真实性的关键挑战。现有的基于重建的方法缺乏理论基础和经验启发法，限制了可解释性和可靠性。在本文中，我们从几何角度引入了重构误差的雅可比谱下界，表明重构流形外的真实图像表现出非平凡的误差下界，而流形上生成的图像具有接近于零的误差。此外，我们揭示了依赖于单次静态重建误差的现有方法的局限性。当某些真实图像的误差低于生成图像时，这些方法通常会失败。这种违反直觉的行为降低了检测准确性，并且需要特定于数据的阈值调整，从而限制了它们在现实场景中的适用性。为了应对这些挑战，我们提出了 ReGap，这是一种免训练方法，通过利用结构化编辑操作引入受控扰动来计算动态重建误差。这样可以测量编辑前后的错误变化，通过增强错误分离来提高检测精度。实验结果表明，我们的方法优于现有基线，对常见后处理操作表现出鲁棒性，并在不同条件下有效推广。

Title: EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Authors: Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25146
Pdf URL: https://arxiv.org/pdf/2510.25146
Copy Paste: [[2510.25146]] EA3D: Online Open-World 3D Object Extraction from Streaming Videos(https://arxiv.org/abs/2510.25146)
Keywords: generation
Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
摘要：当前的 3D 场景理解方法受到离线收集的多视图数据或预先构建的 3D 几何结构的限制。在本文中，我们提出了 ExtractAnything3D (EA3D)，这是一个用于开放世界 3D 对象提取的统一在线框架，可实现同步几何重建和整体场景理解。给定流视频，EA3D 使用视觉语言和 2D 视觉基础编码器动态解释每个帧，以提取对象级知识。这些知识通过前馈在线更新策略集成并嵌入到高斯特征图中。然后，我们根据历史帧迭代估计视觉里程计，并用新的观察结果逐步更新在线高斯特征。循环联合优化模块将模型的注意力引导到感兴趣的区域，同时增强几何重建和语义理解。跨不同基准和任务的广泛实验，包括照片级真实感渲染、语义和实例分割、3D 边界框和语义占用估计以及 3D 网格生成，证明了 EA3D 的有效性。我们的方法为联合在线 3D 重建和整体场景理解建立了一个统一且高效的框架，从而实现了广泛的下游任务。

Title: Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk

Authors: Weimin Huang, Ryan Piansky, Bistra Dilkina, Daniel K. Molzahn
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2510.25147
Pdf URL: https://arxiv.org/pdf/2510.25147
Copy Paste: [[2510.25147]] Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk(https://arxiv.org/abs/2510.25147)
Keywords: generation
Abstract: To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.
摘要：为了减轻严重的野火点火风险，公用事业公司切断了高风险地区的电源线。最佳电源关闭 (OPS) 问题可优化线路通电状态，通过断电来管理野火点火风险，同时减少负载脱落。 OPS 问题是计算上具有挑战性的混合整数线性程序 (MILP)，必须在操作环境中快速且频繁地解决。对于特定的电力系统，OPS 实例共享一个通用结构，具有与野火风险、负载和可再生能源发电相关的不同参数。这促使人们使用机器学习 (ML) 通过利用跨实例的共享模式来解决 OPS 问题。在本文中，我们开发了一个 ML 引导的框架，通过扩展现有的 ML 引导的 MILP 解决方案方法，同时集成有关通电和断电线路数量的领域知识，快速生成高质量的断电决策。基于加州的大规模现实综合测试系统的结果表明，所提出的机器学习引导方法比传统优化方法更快地产生高质量的解决方案。

Title: Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation

Authors: Wenhao Zheng, Chenwei Sun, Wenbo Zhang, Jiancheng Lv, Xianggen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25163
Pdf URL: https://arxiv.org/pdf/2510.25163
Copy Paste: [[2510.25163]] Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation(https://arxiv.org/abs/2510.25163)
Keywords: generation, generative
Abstract: Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at this https URL.
摘要：深度生成模型，例如扩散模型，通过简化的连续性假设在图像生成和音频生成方面显示出了有希望的进展。然而，由于解决远程约束和参数敏感性方面的挑战，用于生成多模态数据（例如参数化 CAD 序列）的生成建模技术的发展仍然落后。在这项工作中，我们提出了一种用于定量约束 CAD 生成的新颖框架，称为目标引导贝叶斯流网络 (TGBFN)。 TGBFN 首次在统一的连续且可微的参数空间而不是离散数据空间中处理 CAD 序列的多模态（即离散命令和连续参数）。此外，TGBFN 渗透参数更新内核并引入引导贝叶斯流来控制 CAD 属性。为了评估 TGBFN，我们构建了一个用于定量约束 CAD 生成的新数据集。对单条件和多条件约束生成任务的广泛比较表明，TGBFN 在生成高保真、条件感知的 CAD 序列方面实现了最先进的性能。该代码可从此 https URL 获取。

Title: Balanced conic rectified flow

Authors: Kim Shin Seong, Mingi Kwon, Jaeseok Jeong, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25229
Pdf URL: https://arxiv.org/pdf/2510.25229
Copy Paste: [[2510.25229]] Balanced conic rectified flow(https://arxiv.org/abs/2510.25229)
Keywords: generation, generative
Abstract: Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.
摘要：整流流是一种生成模型，通过常微分方程 (ODE) 学习两个分布之间的平滑传输映射。基于扩散的生成模型需要对生成 ODE 进行昂贵的数值积分才能以最先进的质量对图像进行采样，与此不同的是，整流流使用称为回流的迭代过程来学习平滑且笔直的 ODE 路径。这允许相对简单且高效地生成高质量图像。然而，整流流仍面临一些挑战。 1）回流过程需要大量的生成对来保留目标分布，从而导致巨大的计算成本。 2) 由于该模型通常仅使用生成的图像对进行训练，因此其性能很大程度上取决于 1 校正流模型，导致其偏向于生成的数据。在这项工作中，我们通过实验暴露了原始校正流程的局限性，并提出了一种将真实图像纳入训练过程的新颖方法。通过保留真实图像的 ODE 路径，我们的方法有效地减少了对大量生成数据的依赖。相反，我们证明可以使用小得多的生成图像和真实图像来有效地进行回流过程。在 CIFAR-10 中，与原始方法相比，我们不仅在一步生成中而且在全步模拟中都取得了明显更好的 FID 分数，同时仅使用生成对。此外，我们的方法引入更直的路径并避免回流期间生成的图像饱和，从而在保留真实图像的分布的同时实现更稳健的 ODE 学习。

Title: DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Authors: Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25237
Pdf URL: https://arxiv.org/pdf/2510.25237
Copy Paste: [[2510.25237]] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis(https://arxiv.org/abs/2510.25237)
Keywords: generation, generative
Abstract: Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.
摘要：深度生成模型的最新进展使得操纵面部视频变得更加容易，这引发了人们对其潜在的欺诈和错误信息滥用的严重担忧。现有的检测器通常在域内场景中表现良好，但由于依赖于伪造特定的工件，因此无法泛化不同的操作技术。在这项工作中，我们引入了 DeepShield，这是一种新颖的深度伪造检测框架，可以平衡局部敏感性和全局泛化，以提高针对看不见的伪造品的鲁棒性。 DeepShield 通过两个关键组件增强了 CLIP-ViT 编码器：本地补丁指导 (LPG) 和全局伪造多样化 (GFD)。 LPG 应用时空工件建模和补丁式监督来捕获经常被全局模型忽视的细粒度不一致。 GFD 引入了域特征增强，利用域桥接和边界扩展特征生成来合成不同的伪造品，减轻过度拟合并增强跨域适应性。通过集成用于深度伪造检测的新颖的本地和全局分析，DeepShield 在跨数据集和交叉操作评估方面优于最先进的方法，实现了针对看不见的深度伪造攻击的卓越鲁棒性。

Title: VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

Authors: Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, Xin Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25238
Pdf URL: https://arxiv.org/pdf/2510.25238
Copy Paste: [[2510.25238]] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations(https://arxiv.org/abs/2510.25238)
Keywords: quality assessment
Abstract: Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at this https URL.
摘要：视频审美评估是多媒体计算的一个重要领域，它将计算机视觉与人类认知相结合。由于缺乏标准化数据集和鲁棒模型，其进展受到限制，因为视频的时间动态和多模态融合挑战阻碍了基于图像的方法的直接应用。本研究引入了 VADB，这是最大的视频美学数据库，拥有 10,490 个不同的视频，由 37 位专业人士在多个美学维度进行注释，包括整体和特定属性的美学评分、丰富的语言评论和客观标签。我们提出了 VADB-Net，一种具有两阶段训练策略的双模态预训练框架，它在评分任务上优于现有的视频质量评估模型，并支持下游视频美学评估任务。数据集和源代码可从此 https URL 获取。

Title: Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation

Authors: Yuyang Huang, Yabo Chen, Junyu Zhou, Wenrui Dai, Xiaopeng Zhang, Junni Zou, Hongkai Xiong, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25279
Pdf URL: https://arxiv.org/pdf/2510.25279
Copy Paste: [[2510.25279]] Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation(https://arxiv.org/abs/2510.25279)
Keywords: generation
Abstract: Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.
摘要：无源域适应（SFDA）是一项具有挑战性的任务，它仅使用预先训练的源模型和未标记的目标数据来解决域转移。现有的 SFDA 方法受到源-目标域差异的根本限制。非生成 SFDA 方法在具有较大域差异的挑战性场景中会遇到不可靠的伪标签，而基于生成的 SFDA 方法由于创建伪源数据时域差异扩大而明显退化。为了解决这一限制，我们提出了一种名为扩散驱动渐进目标操纵（DPTM）的新型基于生成的框架，该框架利用未标记的目标数据作为参考来可靠地生成并逐步完善 SFDA 的伪目标域。具体来说，我们根据伪标签的可靠性将目标样本分为信任集和非信任集，以充分可靠地利用其信息。对于来自非信任集中的样本，我们开发了一种操作策略，将它们在语义上转换为新分配的类别，同时通过潜在扩散模型将它们保持在目标分布中。此外，我们设计了一种渐进细化机制，通过迭代细化逐步减少伪目标域和真实目标域之间的域差异。实验结果表明，DPTM 大幅优于现有方法，并在四个不同规模的主流 SFDA 基准数据集上实现了最先进的性能。值得注意的是，在源目标差距较大的场景下，DPTM 可以显着提升性能高达 18.6%。

Title: Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design

Authors: Zongxi Yu, Xiaolong Qian, Shaohua Gao, Qi Jiang, Yao Gao, Kailun Yang, Kaiwei Wang
Subjects: cs.CV, cs.RO, eess.IV, physics.optics
Abstract URL: https://arxiv.org/abs/2510.25314
Pdf URL: https://arxiv.org/pdf/2510.25314
Copy Paste: [[2510.25314]] Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design(https://arxiv.org/abs/2510.25314)
Keywords: restoration
Abstract: Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at this https URL.
摘要：实现高保真、紧凑的 RGBD 成像面临双重挑战：传统的紧凑型光学器件在整个景深范围内与 RGB 清晰度作斗争，而纯软件的单目深度估计 (MDE) 是一个依赖于不可靠的语义先验的不适定问题。虽然具有 DOE 等元件的深度光学器件可以对深度进行编码，但它们会在制造复杂性和色差方面进行权衡，从而损害了简单性。为了解决这个问题，我们首先引入一种新颖的仿生全球面单中心镜头，围绕该镜头我们构建了仿生单中心成像（BMI）框架，这是一个整体协同设计。这种光学设计自然地将深度编码到其深度变化的点扩散函数 (PSF) 中，而不需要复杂的衍射或自由形状元件。我们建立了严格的基于物理的正向模型，通过精确模拟光学退化过程来生成合成数据集。该模拟管道与双头、多尺度重建网络共同设计，该网络采用共享编码器从单个编码捕获中联合恢复高保真全焦点 (AiF) 图像和精确的深度图。大量的实验验证了所提出的框架的最先进的性能。在深度估计中，该方法的 Abs Rel 为 0.026，RMSE 为 0.130，明显优于领先的纯软件方法和其他深度光学系统。对于图像恢复，该系统实现了 0.960 的 SSIM 和 0.082 的感知 LPIPS 分数，从而确认了图像保真度和深度精度之间的卓越平衡。这项研究表明，仿生全球面光学器件与联合重建算法的集成构成了解决高性能紧凑型 RGBD 成像固有挑战的有效策略。源代码将在此 https URL 上公开提供。

Title: CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Authors: Xuchen Feng, Siyu Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.25323
Pdf URL: https://arxiv.org/pdf/2510.25323
Copy Paste: [[2510.25323]] CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices(https://arxiv.org/abs/2510.25323)
Keywords: generative
Abstract: Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.
摘要：归一化流是深度生成模型，可通过可逆变换实现有效的似然估计和采样。一个关键的挑战是设计线性层来增强表现力，同时保持雅可比行列式和逆矩阵的高效计算。我们引入了一种基于循环矩阵和对角矩阵乘积的新型可逆线性层。这种分解使用 $m$ 对角矩阵和 $m-1$ 循环矩阵将参数复杂度从 $\mathcal{O}(n^2)$ 降低到 $\mathcal{O}(mn)$，同时仍近似一般线性变换。通过利用快速傅里叶变换，我们的方法将矩阵求逆的时间复杂度从 $\mathcal{O}(n^3)$ 降低到 $\mathcal{O}(mn\log n)$，并将计算对数行列式的时间复杂度从 $\mathcal{O}(n^3)$ 降低到 $\mathcal{O}(mn)$，其中 $n$ 是输入维度。我们在此层的基础上开发了循环对角流（CDFlow），它可以对自然图像数据集实现强大的密度估计，并有效地对具有固有周期性结构的数据进行建模。此外，CDFlow 显着加速了标准化流程中的关键操作，为可扩展的生成建模提供了实际好处。

Title: StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Authors: Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25332
Pdf URL: https://arxiv.org/pdf/2510.25332
Copy Paste: [[2510.25332]] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA(https://arxiv.org/abs/2510.25332)
Keywords: generation
Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at this https URL.
摘要：流视频应用的快速增长需要具有增强的时间动态理解和复杂推理能力的多模态模型。然而，当前的视频问答（VideoQA）数据集面临两个关键限制：1）静态注释机制无法捕获时间视频流中答案的演变性质，2）缺乏明确的推理过程注释限制了模型的可解释性和逻辑推导能力。为了应对这些挑战，我们引入了 StreamingCoT，这是第一个专门为流式 VideoQA 和多模式思想链 (CoT) 任务中的时间演化推理而设计的数据集。我们的框架首先建立一个动态的分层注释架构，该架构生成每秒的密集描述，并通过相似性融合构建时间相关的语义片段，并与受时间演化模式约束的问题答案集配对。我们进一步提出了一种显式推理链生成范式，该范式通过关键帧语义对齐提取时空对象，使用大型语言模型导出基于对象状态转换的推理路径，并通过人工验证的验证确保逻辑连贯性。该数据集为推进流视频理解、复杂时间推理和多模态推理方面的研究奠定了基础。我们的 StreamingCoT 及其构建工具包可以通过此 https URL 访问。

Title: More than a Moment: Towards Coherent Sequences of Audio Descriptions

Authors: Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.25440
Pdf URL: https://arxiv.org/pdf/2510.25440
Copy Paste: [[2510.25440]] More than a Moment: Towards Coherent Sequences of Audio Descriptions(https://arxiv.org/abs/2510.25440)
Keywords: generation
Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
摘要：音频描述 (AD) 在屏幕上传达重要信息，使视障观众能够观看视频。为了有效，广告必须形成一个连贯的序列，帮助听众可视化正在展开的场景，而不是描述孤立的时刻。然而，大多数自动方法独立生成每个 AD，通常会导致重复、不连贯的描述。为了解决这个问题，我们提出了一种免训练方法 CoherentAD，它首先为每个 AD 时间间隔生成多个候选描述，然后在整个序列中执行自回归选择，以形成连贯且信息丰富的叙述。为了全面评估 AD 序列，我们引入了序列级指标 StoryRecall，它可以衡量预测的 AD 传达真实叙述的效果，以及捕获连续 AD 输出之间的冗余的重复指标。我们的方法产生了连贯的 AD 序列，增强了叙事理解，优于依赖独立生成的先前方法。

Title: TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting

Authors: Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2510.25502
Pdf URL: https://arxiv.org/pdf/2510.25502
Copy Paste: [[2510.25502]] TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting(https://arxiv.org/abs/2510.25502)
Keywords: generation
Abstract: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
摘要：零样本时间序列预测的基础模型在高效的长期预测和可重复性方面面临挑战，现有的纯合成方法在具有挑战性的基准上表现不佳。本文提出了 TempoPFN，这是一种基于线性递归神经网络 (RNN) 的单变量时间序列基础模型，仅在合成数据上进行预训练。该模型使用具有状态编织功能的 GatedDeltaProduct 架构，可在序列长度上进行完全并行的训练，从而无需窗口或汇总技术，同时保持强大的时间状态跟踪。我们全面的合成数据管道通过新颖的增强功能统一了不同的生成器，包括随机微分方程、高斯过程和音频合成。在 Gift-Eval 基准的零样本评估中，TempoPFN 实现了顶级竞争性能，超越了所有现有的纯合成方法，并超越了绝大多数基于真实世界数据训练的模型，同时通过利用完全并行的训练和推理，比现有基线更高效。我们开源完整的数据生成管道和训练代码，为未来的研究提供可重复的基础。

Title: RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Peng Ye, Bangyin Xiang, Zhibo Wang, Wei Cheng, Gang Yu, Tao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25590
Pdf URL: https://arxiv.org/pdf/2510.25590
Copy Paste: [[2510.25590]] RegionE: Adaptive Region-Aware Generation for Efficient Image Editing(https://arxiv.org/abs/2510.25590)
Keywords: generation
Abstract: Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.
摘要：近年来，基于指令的图像编辑（IIE）受到了广泛的关注。在实践中，IIE 通常只修改图像的特定区域，而其余区域基本上保持不变。尽管这两类区域在生成难度和计算冗余方面存在显着差异，但现有的 IIE 模型并未考虑这种区别，而是在整个图像上应用统一的生成过程。这促使我们提出 RegionE，一种自适应的、区域感知的生成框架，无需额外的训练即可加速 IIE 任务。具体来说，RegionE框架由三个主要组件组成：1）自适应区域划分。我们观察到未编辑区域的轨迹是直的，允许在单个步骤中推断出多步骤去噪预测。因此，在早期去噪阶段，我们根据最终估计结果与参考图像之间的差异将图像划分为编辑区域和未编辑区域。 2）区域感知生成。区分区域后，我们将多步去噪替换为对未编辑区域的一步预测。对于编辑区域，轨迹是弯曲的，需要局部迭代去噪。为了提高局部迭代生成的效率和质量，我们提出了区域指令KV缓存，它在合并全局信息的同时降低了计算成本。 3) 自适应速度衰减缓存。观察到编辑区域中的相邻时间步表现出很强的速度相似性，我们进一步提出了一种自适应速度衰减缓存来加速局部去噪过程。我们将 RegionE 应用于最先进的 IIE 基础模型，包括 Step1X-Edit、FLUX.1 Kontext 和 Qwen-Image-Edit。 RegionE 的加速因子分别为 2.57、2.41 和 2.06。 GPT-4o 的评估证实语义和感知保真度得到了很好的保留。

Title: BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training

Authors: Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero III
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2510.25609
Pdf URL: https://arxiv.org/pdf/2510.25609
Copy Paste: [[2510.25609]] BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training(https://arxiv.org/abs/2510.25609)
Keywords: generation
Abstract: We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.
摘要：我们引入了 BOLT-GAN，这是受贝叶斯最佳学习阈值 (BOLT) 启发的 WGAN 框架的简单而有效的修改。我们证明，通过 Lipschitz 连续判别器，BOLT-GAN 隐式地最小化了与 Earth Mover (Wasserstein) 距离不同的度量距离，并实现了更好的训练稳定性。对四个标准图像生成基准（CIFAR-10、CelebA-64、LSUN Bedroom-64 和 LSUN Church-64）的实证评估表明，BOLT-GAN 的性能始终优于 WGAN，Frechet 起始距离 (FID) 降低了 10-60%。我们的结果表明，BOLT 是增强 GAN 训练的广泛适用的原则。

Title: Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Authors: Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.25739
Pdf URL: https://arxiv.org/pdf/2510.25739
Copy Paste: [[2510.25739]] Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation(https://arxiv.org/abs/2510.25739)
Keywords: generation
Abstract: Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
摘要：自回归 (AR) 图像生成模型能够生成高保真图像，但由于其固有的顺序、逐个令牌的解码过程，常常会受到推理缓慢的影响。推测性解码采用轻量级草稿模型来近似较大 AR 模型的输出，在不影响质量的情况下加速文本生成方面已显示出前景。然而，它在图像生成中的应用在很大程度上仍未得到充分探索。挑战源于明显更大的采样空间，这使得草稿和目标模型输出之间的对齐变得复杂，再加上图像固有的二维空间结构的使用不充分，从而限制了局部依赖性的建模。为了克服这些挑战，我们引入了 Hawk，这是一种利用图像的空间结构来指导推测模型进行更准确、更高效的预测的新方法。多个文本到图像基准的实验结果表明，与标准 AR 模型相比，速度提高了 1.71 倍，同时保留了图像保真度和多样性。

Title: FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.25765
Pdf URL: https://arxiv.org/pdf/2510.25765
Copy Paste: [[2510.25765]] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion(https://arxiv.org/abs/2510.25765)
Keywords: generation, generative
Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
摘要：铰接式 3D 对象是机器人、AR/VR 和动画领域许多应用的核心。最近对此类对象进行建模的方法要么依赖于需要密集视图监督的基于优化的重建管道，要么依赖于产生粗略几何近似且经常忽略表面纹理的前馈生成模型。相比之下，静态对象的开放世界 3D 生成取得了显着的成功，特别是随着 Trellis 等原生 3D 扩散模型的出现。然而，通过训练原生 3D 扩散模型将这些方法扩展到铰接物体面临着重大挑战。在这项工作中，我们提出了 FreeArt3D，这是一种用于生成铰接式 3D 对象的免训练框架。 FreeArt3D 不是在有限的铰接数据上训练新模型，而是将预先训练的静态 3D 扩散模型（例如，Trellis）重新用作强大的形状先验。通过将衔接视为额外的生成维度，它将分数蒸馏采样 (SDS) 扩展到 3D 到 4D 领域。给定在不同关节状态下捕获的一些图像，FreeArt3D 可以联合优化对象的几何形状、纹理和关节参数，而无需特定于任务的训练或访问大规模关节数据集。我们的方法生成高保真几何和纹理，准确预测底层运动结构，并可以很好地概括不同的对象类别。尽管遵循每个实例的优化范例，FreeArt3D 在几分钟内即可完成，并且在质量和多功能性方面显着优于先前最先进的方法。

Title: VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Authors: Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.25772
Pdf URL: https://arxiv.org/pdf/2510.25772
Copy Paste: [[2510.25772]] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning(https://arxiv.org/abs/2510.25772)
Keywords: generation, generative
Abstract: Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.
摘要：视觉效果 (VFX) 对于数字媒体的表达能力至关重要，但其创作仍然是生成人工智能的主要挑战。流行的方法通常依赖于每个效果一个 LoRA 范式，这种范式是资源密集型的，并且从根本上无法推广到看不见的效果，从而限制了可扩展性和创建。为了应对这一挑战，我们推出了 VFXMaster，这是第一个用于 VFX 视频生成的统一的、基于参考的框架。它将效果生成重新定义为上下文学习任务，使其能够将参考视频中的各种动态效果再现到目标内容上。此外，它还展示了对未见效应类别的显着泛化。具体来说，我们设计了一种上下文调节策略，通过参考示例提示模型。上下文注意掩模旨在精确解耦和注入基本效果属性，允许单个统一模型在不泄漏信息的情况下掌握效果模仿。此外，我们提出了一种有效的一次性效果适应机制，以快速提高对单个用户提供的视频中难以看到的效果的泛化能力。大量的实验表明，我们的方法有效地模拟了各种类别的效果信息，并对域外效果表现出出色的泛化能力。为了促进未来的研究，我们将向社区发布我们的代码、模型和综合数据集。