2024-12-04

Title: MALT: Improving Reasoning with Multi-Agent LLM Training

Authors: Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Markian Rybchuk, Philip H. S. Torr, Ivan Laptev, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01928
Pdf URL: https://arxiv.org/pdf/2412.01928
Copy Paste: [[2412.01928]] MALT: Improving Reasoning with Multi-Agent LLM Training(https://arxiv.org/abs/2412.01928)
Keywords: generation
Abstract: Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.
摘要：实现 LLM 之间的有效协作是开发能够解决复杂问题的自主系统的关键一步。虽然 LLM 通常用作单模型生成器，人类在其中批评和改进其输出，但联合训练的协作模型的潜力仍未得到充分开发。尽管在多智能体通信和辩论环境中取得了令人鼓舞的结果，但在训练模型共同完成任务方面却进展甚微。在本文中，我们介绍了在推理问题上迈向“多智能体 LLM 训练”（MALT）的第一步。我们的方法采用顺序多智能体设置，为异构 LLM 分配专门的角色：生成器、验证器和迭代解决问题的细化模型。我们提出了一种基于轨迹扩展的合成数据生成过程和一种由联合结果奖励驱动的信用分配策略。这使我们的训练后设置能够利用正轨迹和负轨迹作为联合顺序系统的一部分自主地改进每个模型的专门能力。我们在 MATH、GSM8k 和 CQA 上评估了我们的方法，其中 Llama 3.1 8B 模型上的 MALT 分别比同一基线模型实现了 14.14%、7.12% 和 9.40% 的相对改进。这表明多智能体协作能力在数学和常识推理问题上的表现取得了初步进步。更广泛地说，我们的工作为围绕多智能体 LLM 训练方法的研究提供了具体的方向。

Title: A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients

Authors: Junbo Shen, Bing Xue, Thomas Kannampallil, Chenyang Lu, Joanna Abraham
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01950
Pdf URL: https://arxiv.org/pdf/2412.01950
Copy Paste: [[2412.01950]] A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients(https://arxiv.org/abs/2412.01950)
Keywords: generative
Abstract: Early detection of surgical complications allows for timely therapy and proactive risk mitigation. Machine learning (ML) can be leveraged to identify and predict patient risks for postoperative complications. We developed and validated the effectiveness of predicting postoperative complications using a novel surgical Variational Autoencoder (surgVAE) that uncovers intrinsic patterns via cross-task and cross-cohort presentation learning. This retrospective cohort study used data from the electronic health records of adult surgical patients over four years (2018 - 2021). Six key postoperative complications for cardiac surgery were assessed: acute kidney injury, atrial fibrillation, cardiac arrest, deep vein thrombosis or pulmonary embolism, blood transfusion, and other intraoperative cardiac events. We compared prediction performances of surgVAE against widely-used ML models and advanced representation learning and generative models under 5-fold cross-validation. 89,246 surgeries (49% male, median (IQR) age: 57 (45-69)) were included, with 6,502 in the targeted cardiac surgery cohort (61% male, median (IQR) age: 60 (53-70)). surgVAE demonstrated superior performance over existing ML solutions across all postoperative complications of cardiac surgery patients, achieving macro-averaged AUPRC of 0.409 and macro-averaged AUROC of 0.831, which were 3.4% and 3.7% higher, respectively, than the best alternative method (by AUPRC scores). Model interpretation using Integrated Gradients highlighted key risk factors based on preoperative variable importance. surgVAE showed excellent discriminatory performance for predicting postoperative complications and addressing the challenges of data complexity, small cohort sizes, and low-frequency positive events. surgVAE enables data-driven predictions of patient risks and prognosis while enhancing the interpretability of patient risk profiles.
摘要：早期发现手术并发症可及时治疗并主动降低风险。机器学习 (ML) 可用于识别和预测患者术后并发症的风险。我们开发了一种新型手术变分自动编码器 (surgVAE)，并验证了其预测术后并发症的有效性，该编码器通过跨任务和跨队列呈现学习揭示内在模式。这项回顾性队列研究使用了四年（2018-2021 年）成年手术患者电子健康记录的数据。评估了心脏手术的六个关键术后并发症：急性肾损伤、心房颤动、心脏骤停、深静脉血栓形成或肺栓塞、输血和其他术中心脏事件。我们在 5 倍交叉验证下比较了 surgVAE 与广泛使用的 ML 模型和高级表示学习和生成模型的预测性能。共纳入 89,246 例手术（49% 为男性，年龄中位数（IQR）：57 岁（45-69 岁）），其中目标心脏手术队列为 6,502 例（61% 为男性，年龄中位数（IQR）：60 岁（53-70 岁））。surgVAE 在心脏手术患者所有术后并发症方面均表现出优于现有 ML 解决方案的性能，宏平均 AUPRC 为 0.409，宏平均 AUROC 为 0.831，分别比最佳替代方法高 3.4% 和 3.7%（按 AUPRC 评分计算）。使用积分梯度的模型解释突出了基于术前变量重要性的关键风险因素。surgVAE 在预测术后并发症和应对数据复杂性、队列规模小和阳性事件频率低的挑战方面表现出色。 surgVAE 能够通过数据驱动预测患者风险和预后，同时增强患者风险状况的可解释性。

Title: Free Process Rewards without Process Labels

Authors: Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01981
Pdf URL: https://arxiv.org/pdf/2412.01981
Copy Paste: [[2412.01981]] Free Process Rewards without Process Labels(https://arxiv.org/abs/2412.01981)
Keywords: generation
Abstract: Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an \textit{implicit PRM} can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline \textit{á la} Math-Shepherd using less than $1/38$ of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.
摘要：与评估整个响应的对应结果奖励模型 (ORM) 不同，过程奖励模型 (PRM) 逐步对推理轨迹进行评分，提供更密集、更细粒度的奖励。但是，训练 PRM 需要在每个中间步骤注释标签，这对手动和自动数据收集都提出了重大挑战。本文旨在应对这一挑战。从理论和经验上看，我们表明，只需在更便宜的响应级标签上训练 ORM，即可在不增加额外成本的情况下获得 \textit{implicit PRM}。唯一的假设是将结果奖励参数化为策略和参考模型的对数似然比，无论损失目标的具体选择如何，都可以对其进行优化。在实验中，我们用各种目标实例化我们的隐式 PRM，并评估它们在 MATH 上的表现。我们表明，我们的隐式 PRM 比基于 MCTS 的强大基线 \textit{á la} Math-Shepherd 的表现要好，而训练数据不到 $1/38$。通过多数投票，其性能可以进一步提高。我们进一步发现，扩大指令和响应对我们的隐式 PRM 有益，而后者带来的收益更大。特别是，我们发现，当我们的隐式 PRM 使用交叉熵 (CE) 损失进行实例化时，它的数据效率更高，并且即使在每个指令只有一个响应的情况下进行训练，也可以不断改进生成模型，这种设置存在极端数据稀缺和不平衡的问题。此外，指令应该与下游任务相关，而响应的多样性不会带来收益。令人惊讶的是，在额外的 Math-Shepherd 步骤标签上进行训练并没有给仅在结果数据上训练的隐式 PRM 带来进一步的改进。我们希望我们的工作能够鼓励人们重新思考 PRM 训练方法，并有助于使训练 PRM 更容易获得。

Title: HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment

Authors: Armin Shafiee Sarvestani, Sheyang Tang, Zhou Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01986
Pdf URL: https://arxiv.org/pdf/2412.01986
Copy Paste: [[2412.01986]] HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment(https://arxiv.org/abs/2412.01986)
Keywords: quality assessment
Abstract: Mesh quality assessment (MQA) models play a critical role in the design, optimization, and evaluation of mesh operation systems in a wide variety of applications. Current MQA models, whether model-based methods using topology-aware features or projection-based approaches working on rendered 2D projections, often fail to capture the intricate interactions between texture and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid full-reference colored MQA framework that integrates model-based and projection-based approaches, capturing complex interactions between textural information and 3D structures for enriched quality representations. Our method employs graph learning to extract detailed 3D representations, which are then projected to 2D using a novel feature rendering process that precisely aligns them with colored projections. This enables the exploration of geometry-texture interactions via cross-attention, producing comprehensive mesh quality representations. Extensive experiments demonstrate HybridMQA's superior performance across diverse datasets, highlighting its ability to effectively leverage geometry-texture interactions for a thorough understanding of mesh quality. Our implementation will be made publicly available.
摘要：网格质量评估 (MQA) 模型在各种应用中的网格操作系统的设计、优化和评估中发挥着关键作用。当前的 MQA 模型，无论是使用拓扑感知特征的基于模型的方法，还是基于投影的渲染 2D 投影方法，通常都无法捕捉纹理和 3D 几何之间的复杂交互。我们推出了 HybridMQA，这是一种首创的混合全参考彩色 MQA 框架，它集成了基于模型和基于投影的方法，可捕捉纹理信息和 3D 结构之间的复杂交互，以丰富质量表示。我们的方法采用图形学习来提取详细的 3D 表示，然后使用新颖的特征渲染过程将其投影到 2D，该过程可将它们与彩色投影精确对齐。这使得我们能够通过交叉注意力探索几何纹理交互，从而产生全面的网格质量表示。大量实验证明了 HybridMQA 在不同数据集上的卓越性能，突显了它能够有效利用几何纹理交互来彻底了解网格质量。我们的实现将公开。

Title: NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Authors: Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02030
Pdf URL: https://arxiv.org/pdf/2412.02030
Copy Paste: [[2412.02030]] NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training(https://arxiv.org/abs/2412.02030)
Keywords: generation, quality assessment
Abstract: We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. While one-step methods offer dramatic speed advantages, they typically suffer from quality degradation compared to their multi-step counterparts. Just as a panel of art critics provides comprehensive feedback by specializing in different aspects like composition, color, and technique, our approach maintains a large pool of specialized discriminator heads that collectively guide the generation process. Each discriminator group develops expertise in specific quality aspects at different noise levels, providing diverse feedback that enables high-fidelity one-step generation. Our framework combines: (i) a dynamic discriminator pool with specialized discriminator groups to improve generation quality, (ii) strategic refresh mechanisms to prevent discriminator overfitting, and (iii) global-local discriminator heads for multi-scale quality assessment, and unconditional/conditional training for balanced generation. Additionally, our framework uniquely supports flexible deployment through bottom-up refinement, allowing users to dynamically choose between 1-4 denoising steps with the same model for direct quality-speed trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion significantly outperforms existing single-step methods across multiple evaluation metrics, particularly excelling in preserving fine details and global consistency.
摘要：我们引入了 NitroFusion，这是一种与单步扩散完全不同的方法，可通过动态对抗框架实现高质量生成。虽然单步方法具有显著的速度优势，但与多步方法相比，它们通常会遭受质量下降的困扰。就像艺术评论家小组通过专注于构图、颜色和技巧等不同方面提供全面反馈一样，我们的方法维护着一个庞大的专门鉴别器头池，它们共同指导生成过程。每个鉴别器组都会在不同噪声水平下发展特定质量方面的专业知识，从而提供多样化的反馈，从而实现高保真的单步生成。我们的框架结合了：(i) 动态鉴别器池与专门的鉴别器组，以提高生成质量，(ii) 战略刷新机制，以防止鉴别器过度拟合，以及 (iii) 用于多尺度质量评估的全局-局部鉴别器头，以及用于平衡生成的无条件/条件训练。此外，我们的框架通过自下而上的细化独特地支持灵活部署，允许用户在同一模型中动态选择 1-4 个去噪步骤，以实现直接的质量和速度权衡。通过全面的实验，我们证明 NitroFusion 在多个评估指标上的表现明显优于现有的单步方法，尤其是在保留精细细节和全局一致性方面表现出色。

Title: GNN-based Auto-Encoder for Short Linear Block Codes: A DRL Approach

Authors: Kou Tian, Chentao Yue, Changyang She, Yonghui Li, Branka Vucetic
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2412.02053
Pdf URL: https://arxiv.org/pdf/2412.02053
Copy Paste: [[2412.02053]] GNN-based Auto-Encoder for Short Linear Block Codes: A DRL Approach(https://arxiv.org/abs/2412.02053)
Keywords: generation
Abstract: This paper presents a novel auto-encoder based end-to-end channel encoding and decoding. It integrates deep reinforcement learning (DRL) and graph neural networks (GNN) in code design by modeling the generation of code parity-check matrices as a Markov Decision Process (MDP), to optimize key coding performance metrics such as error-rates and code algebraic properties. An edge-weighted GNN (EW-GNN) decoder is proposed, which operates on the Tanner graph with an iterative message-passing structure. Once trained on a single linear block code, the EW-GNN decoder can be directly used to decode other linear block codes of different code lengths and code rates. An iterative joint training of the DRL-based code designer and the EW-GNN decoder is performed to optimize the end-end encoding and decoding process. Simulation results show the proposed auto-encoder significantly surpasses several traditional coding schemes at short block lengths, including low-density parity-check (LDPC) codes with the belief propagation (BP) decoding and the maximum-likelihood decoding (MLD), and BCH with BP decoding, offering superior error-correction capabilities while maintaining low decoding complexity.
摘要：本文提出了一种基于自动编码器的新型端到端信道编码和解码。它通过将代码奇偶校验矩阵的生成建模为马尔可夫决策过程 (MDP)，将深度强化学习 (DRL) 和图神经网络 (GNN) 集成到代码设计中，以优化关键编码性能指标，例如错误率和代码代数性质。提出了一种边加权 GNN (EW-GNN) 解码器，该解码器在具有迭代消息传递结构的 Tanner 图上运行。一旦在单个线性分组码上训练完毕，EW-GNN 解码器即可直接用于解码不同码长和码率的其他线性分组码。对基于 DRL 的代码设计器和 EW-GNN 解码器进行迭代联合训练，以优化端到端编码和解码过程。仿真结果表明，所提出的自动编码器明显超越了几种传统的短块长度编码方案，包括采用信念传播 (BP) 解码和最大似然解码 (MLD) 的低密度奇偶校验 (LDPC) 码、以及采用 BP 解码的 BCH，提供了卓越的纠错能力，同时保持了较低的解码复杂度。

Title: CLERF: Contrastive LEaRning for Full Range Head Pose Estimation

Authors: Ting-Ruen Wei, Haowei Liu, Huei-Chung Hu, Xuyang Wu, Yi Fang, Hsin-Tai Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02066
Pdf URL: https://arxiv.org/pdf/2412.02066
Copy Paste: [[2412.02066]] CLERF: Contrastive LEaRning for Full Range Head Pose Estimation(https://arxiv.org/abs/2412.02066)
Keywords: generative
Abstract: We introduce a novel framework for representation learning in head pose estimation (HPE). Previously such a scheme was difficult due to head pose data sparsity, making triplet sampling infeasible. Recent progress in 3D generative adversarial networks (3D-aware GAN) has opened the door for easily sampling triplets (anchor, positive, negative). We perform contrastive learning on extensively augmented data including geometric transformations and demonstrate that contrastive learning allows networks to learn genuine features that contribute to accurate HPE. On the other hand, we observe that existing HPE works struggle to predict head poses as accurately when test image rotation matrices are slightly out of the training dataset distribution. Experiments show that our methodology performs on par with state-of-the-art models on standard test datasets and outperforms them when images are slightly rotated/ flipped or full range head pose. To the best of our knowledge, we are the first to deliver a true full range HPE model capable of accurately predicting any head pose including upside-down pose. Furthermore, we compared with other existing full-yaw range models and demonstrated superior results.
摘要：我们引入了一种用于头部姿势估计 (HPE) 中的表征学习的新框架。以前，由于头部姿势数据稀疏，这种方案很难实现，因此三元组采样不可行。3D 生成对抗网络 (3D 感知 GAN) 的最新进展为轻松采样三元组 (锚点、正样本、负样本) 打开了大门。我们对包括几何变换在内的大量增强数据进行对比学习，并证明对比学习允许网络学习有助于准确 HPE 的真正特征。另一方面，我们观察到，当测试图像旋转矩阵略微偏离训练数据集分布时，现有的 HPE 工作很难准确预测头部姿势。实验表明，我们的方法在标准测试数据集上的表现与最先进的模型相当，并且在图像略微旋转/翻转或全范围头部姿势时表现优于它们。据我们所知，我们是第一个提供真正的全范围 HPE 模型的人，能够准确预测任何头部姿势，包括倒立姿势。此外，我们与其他现有的全偏航范围模型进行了比较，并证明了其优越的结果。

Title: AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

Authors: Zhihang Lin, Mingbao Lin, Wengyi Zhan, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02099
Pdf URL: https://arxiv.org/pdf/2412.02099
Copy Paste: [[2412.02099]] AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation(https://arxiv.org/abs/2412.02099)
Keywords: generation
Abstract: Diffusion models suffer severe object repetition and local distortion when the inference resolution differs from its pre-trained resolution. We propose AccDiffusion v2, an accurate method for patch-wise higher-resolution diffusion extrapolation without training. Our in-depth analysis in this paper shows that using an identical text prompt for different patches leads to repetitive generation, while the absence of a prompt undermines image details. In response, our AccDiffusion v2 novelly decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of a patch. Further analysis reveals that local distortion arises from inaccurate descriptions in prompts about the local structure of higher-resolution images. To address this issue, AccDiffusion v2, for the first time, introduces an auxiliary local structural information through ControlNet during higher-resolution diffusion extrapolation aiming to mitigate the local distortions. Finally, our analysis indicates that global semantic information is conducive to suppressing both repetitive generation and local distortion. Hence, our AccDiffusion v2 further proposes dilated sampling with window interaction for better global semantic information during higher-resolution diffusion extrapolation. We conduct extensive experiments, including both quantitative and qualitative comparisons, to demonstrate the efficacy of our AccDiffusion v2. The quantitative comparison shows that AccDiffusion v2 achieves state-of-the-art performance in image generation extrapolation without training. The qualitative comparison intuitively illustrates that AccDiffusion v2 effectively suppresses the issues of repetitive generation and local distortion in image generation extrapolation. Our code is available at \url{this https URL}.
摘要：当推理分辨率与预训练分辨率不同时，扩散模型会遭受严重的对象重复和局部失真。我们提出了 AccDiffusion v2，这是一种无需训练即可进行逐块高分辨率扩散外推的精确方法。本文的深入分析表明，对不同的块使用相同的文本提示会导致重复生成，而没有提示会破坏图像细节。作为回应，我们的 AccDiffusion v2 新颖地将原始图像内容感知提示解耦为一组块内容感知提示，每个提示都是对块的更精确描述。进一步分析表明，局部失真源于提示中关于高分辨率图像局部结构的描述不准确。为了解决这个问题，AccDiffusion v2 首次在高分辨率扩散外推过程中通过 ControlNet 引入辅助局部结构信息，旨在减轻局部失真。最后，我们的分析表明全局语义信息有利于抑制重复生成和局部失真。因此，我们的 AccDiffusion v2 进一步提出了带窗口交互的扩张采样，以便在高分辨率扩散外推期间获得更好的全局语义信息。我们进行了广泛的实验，包括定量和定性比较，以证明我们的 AccDiffusion v2 的有效性。定量比较表明，AccDiffusion v2 在无需训练的图像生成外推中实现了最先进的性能。定性比较直观地表明，AccDiffusion v2 有效地抑制了图像生成外推中的重复生成和局部失真问题。我们的代码可在 \url{this https URL} 上找到。

Title: Evaluating the Impact of Data Augmentation on Predictive Model Performance

Authors: Valdemar Švábenský, Conrad Borchers, Elizabeth B. Cloude, Atsushi Shimada
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.02108
Pdf URL: https://arxiv.org/pdf/2412.02108
Copy Paste: [[2412.02108]] Evaluating the Impact of Data Augmentation on Predictive Model Performance(https://arxiv.org/abs/2412.02108)
Keywords: generation
Abstract: In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper's contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.
摘要：在监督机器学习 (SML) 研究中，大型训练数据集对于获得有效结果至关重要。然而，在学习分析 (LA) 中获取原始数据具有挑战性。数据增强可以通过扩展和多样化数据来解决这一问题，尽管其在学习分析中的应用仍未得到充分探索。本文系统地比较了数据增强技术及其对典型学习分析任务（即预测学业成绩）的预测性能的影响。增强在四个 SML 模型上进行了演示，我们成功地从之前基于 AUC 值的 LAK 研究中复制了这些模型。在 21 种增强技术中，SMOTE-ENN 采样表现最佳，与基线模型相比，平均 AUC 提高了 0.01，训练时间缩短了约一半。此外，我们比较了 21 种技术的 99 种组合，发现在向 SMOTE-ENN (+0.014) 添加噪声时，各个模型的改进很小，但具有统计学意义。值得注意的是，一些增强技术显著降低了预测性能或增加了与随机机会相关的性能波动。本文的贡献是双重的。首先，我们的实证结果表明，采样技术为 SML 的 LA 应用提供了统计上最可靠的性能改进，并且比具有复杂超参数设置的深度生成方法在计算上更高效。其次，LA 社区可能会从通过独立复制验证最近的研究中受益。

Title: OmniCreator: Self-Supervised Unified Generation with Universal Editing

Authors: Haodong Chen, Lan Wang, Harry Yang, Ser-Nam Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02114
Pdf URL: https://arxiv.org/pdf/2412.02114
Copy Paste: [[2412.02114]] OmniCreator: Self-Supervised Unified Generation with Universal Editing(https://arxiv.org/abs/2412.02114)
Keywords: generation, generative
Abstract: We introduce OmniCreator, a novel framework that can conduct text-prompted unified (image+video) generation as well as editing all in one place. OmniCreator acquires generative and universal editing capabilities in a self-supervised manner, taking original text-video pairs as conditions while utilizing the same video as a denoising target to learn the semantic correspondence between video and text. During inference, when presented with a text prompt and a video, OmniCreator is capable of generating a target that is faithful to both, achieving a universal editing effect that is unconstrained as opposed to existing editing work that primarily focuses on certain editing types or relies on additional controls (e.g., structural conditions, attention features, or DDIM inversion). On the other hand, when presented with a text prompt only, OmniCreator becomes generative, producing high-quality video as a result of the semantic correspondence learned. Importantly, we found that the same capabilities extend to images as is, making OmniCreator a truly unified framework. Further, due to the lack of existing generative video editing benchmarks, we introduce the OmniBench-99 dataset, designed to evaluate the performance of generative video editing models comprehensively. Extensive experiments demonstrate that OmniCreator exhibits substantial superiority over all other models.
摘要：我们引入了 OmniCreator，这是一个新颖的框架，可以在一个地方进行文本提示的统一（图像+视频）生成以及编辑。OmniCreator 以自监督的方式获得生成性和通用编辑能力，以原始文本-视频对为条件，同时利用同一视频作为去噪目标来学习视频和文本之间的语义对应关系。在推理过程中，当呈现文本提示和视频时，OmniCreator 能够生成忠实于两者的目标，实现不受约束的通用编辑效果，而不是现有的编辑工作主要关注某些编辑类型或依赖于其他控制（例如，结构条件、注意特征或 DDIM 反转）。另一方面，当仅呈现文本提示时，OmniCreator 变得具有生成性，根据学习到的语义对应关系生成高质量的视频。重要的是，我们发现相同的功能可以扩展到图像，使 OmniCreator 成为一个真正统一的框架。此外，由于缺乏现有的生成视频编辑基准，我们引入了 OmniBench-99 数据集，旨在全面评估生成视频编辑模型的性能。大量实验表明，OmniCreator 比其他所有模型都表现出显著的优势。

Title: Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Authors: Yu Yuan, Xijun Wang, Yichen Sheng, Prateek Chennuri, Xingguang Zhang, Stanley Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02168
Pdf URL: https://arxiv.org/pdf/2412.02168
Copy Paste: [[2412.02168]] Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis(https://arxiv.org/abs/2412.02168)
Keywords: generation, generative
Abstract: Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
摘要：如今，图像生成可以根据文本提示生成相当逼真的图像。但是，如果要求生成器合成特定的相机设置，例如使用 24mm 镜头和 70mm 镜头创建不同的视野，则生成器将无法解释和生成场景一致的图像。这种限制不仅阻碍了生成工具在摄影应用中的采用，而且还体现了弥合数据驱动模型与物理世界之间差距的更广泛问题。在本文中，我们介绍了生成摄影的概念，这是一个旨在控制内容生成过程中相机固有设置的框架。这项工作的核心创新是维度提升和对比相机学习的概念，它们实现了不同相机设置的连续和一致的过渡。实验结果表明，与 Stable Diffusion 3 和 FLUX 等最先进的模型相比，我们的方法可以生成更多场景一致的照片级逼真图像。

Title: LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Authors: Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02193
Pdf URL: https://arxiv.org/pdf/2412.02193
Copy Paste: [[2412.02193]] LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models(https://arxiv.org/abs/2412.02193)
Keywords: generation
Abstract: Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.
摘要：开放宇宙 3D 布局生成根据语言指令排列未标记的 3D 资产。大型语言模型 (LLM) 难以生成物理上合理的 3D 场景并遵守输入指令，尤其是在混乱的场景中。我们引入了 LayoutVLM，这是一个框架和场景布局表示，它利用视觉语言模型 (VLM) 的语义知识并支持可微分优化以确保物理合理性。LayoutVLM 使用 VLM 从视觉标记的图像中生成两个相互加强的表示，并使用自洽解码过程来改进 VLM 的空间规划。我们的实验表明，LayoutVLM 解决了现有 LLM 和基于约束的方法的局限性，产生了物理上合理的 3D 布局，更符合输入语言指令的语义意图。我们还证明，使用从现有场景数据集中提取的拟议场景布局表示对 VLM 进行微调可以提高性能。

Title: 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Authors: Jinzhi Zhang, Feng Xiong, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02202
Pdf URL: https://arxiv.org/pdf/2412.02202
Copy Paste: [[2412.02202]] 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation(https://arxiv.org/abs/2412.02202)
Keywords: generation
Abstract: Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.
摘要：自回归变换器彻底改变了高保真图像生成。一个关键因素在于标记器，它将高分辨率图像块压缩为易于管理的离散标记，其扫描或层次顺序适合大型语言模型。然而，将这些标记器扩展到 3D 生成提出了一个重大挑战：与自然表现出空间序列和多尺度关系的图像块不同，3D 数据缺乏固有顺序，因此很难在保留结构细节的同时压缩成更少的标记。为了解决这个问题，我们引入了变分标记器 (VAT)，它将无序的 3D 数据转换为具有隐式层次结构的紧凑潜在标记，适用于高效、高保真的粗到细自回归建模。VAT 从上下文变换器开始，它将大量无序的 3D 特征压缩成一个简化的标记集，同时将信息损失降到最低。然后，将此潜在空间映射到高斯分布以进行残差量化，标记计数在各个尺度上逐渐增加。通过这种方式，不同尺度的 token 通过将自身分配到同一高斯分布的不同子空间中，自然地建立互连，从而促进跨尺度 token 关系的离散建模。在解码阶段，使用高分辨率三平面将这些紧凑的潜在 token 转换为详细的 3D 形状。大量实验表明，VAT 能够实现可扩展且高效的 3D 生成，在质量、效率和泛化方面均优于现有方法。值得注意的是，VAT 实现了高达 250 倍的压缩，将 1MB 的网格缩小到仅 3.9KB，F 分数为 96%，并且可以进一步压缩到 256 个 int8 token，实现 2000 倍的缩小，同时保持 92% 的 F 分数。

Title: An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction

Authors: Yaxin Liang, Xinshi Li, Xin Huang, Ziqi Zhang, Yue Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02211
Pdf URL: https://arxiv.org/pdf/2412.02211
Copy Paste: [[2412.02211]] An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction(https://arxiv.org/abs/2412.02211)
Keywords: generative
Abstract: This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.
摘要：本研究提出了一种基于自编码器的自动化数据挖掘框架，并通过实验验证了其在特征提取和数据降维方面的有效性。通过编码-解码结构，自编码器可以捕捉数据的潜在特征并实现降噪和异常检测，为数据挖掘过程提供高效稳定的解决方案。实验将自编码器与传统降维方法（如PCA、FA、T-SNE和UMAP）的性能进行了比较。结果表明，自编码器在重构误差和均方根误差方面表现最佳，并且能够更好地保留数据结构并增强模型的泛化能力。基于自编码器的框架不仅减少了人工干预，而且显著提高了数据处理的自动化程度。未来，随着深度学习和大数据技术的进步，结合生成对抗网络（GAN）或图神经网络（GNN）的自编码器方法有望在复杂数据处理、实时数据分析和智能决策等领域得到更广泛的应用。

Title: CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution

Authors: Jikai Wang, Huan Zheng, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02234
Pdf URL: https://arxiv.org/pdf/2412.02234
Copy Paste: [[2412.02234]] CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution(https://arxiv.org/abs/2412.02234)
Keywords: super-resolution
Abstract: Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network. However, current lightweight SR methods still suffer from inferior performance and unpleasant details. Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery. To respond this issue, we propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation. To be specific, we introduce cube attention, which expands 2D attention to 3D space, facilitating exhaustive information interactions, further encouraging comprehensive information extraction and promoting feature variety. In addition, we inject block and grid sampling strategies to construct intra-cube transformer blocks (Intra-CTB) and inter-cube transformer blocks (Inter-CTB), which perform local and global modeling, respectively. Extensive experiments show that our CubeFormer achieves state-of-the-art performance on commonly used SR benchmarks. Our source code and models will be publicly available.
摘要：轻量级图像超分辨率 (SR) 方法旨在使用轻量级神经网络提高分辨率并恢复图像的细节。然而，当前的轻量级 SR 方法仍然存在性能较差和细节不尽如人意的问题。我们的分析表明，这些方法受到特征多样性约束的阻碍，这对特征表示和细节恢复产生不利影响。为了解决这个问题，我们提出了一个简单而有效的基线，称为 CubeFormer，旨在通过完成整体信息聚合来增强特征丰富度。具体来说，我们引入了立方体注意力，将 2D 注意力扩展到 3D 空间，促进详尽的信息交互，进一步鼓励全面的信息提取并促进特征多样性。此外，我们注入了块和网格采样策略来构建立方体内变压器块 (Intra-CTB) 和立方体间变压器块 (Inter-CTB)，分别执行局部和全局建模。大量实验表明，我们的 CubeFormer 在常用的 SR 基准上实现了最先进的性能。我们的源代码和模型将公开。

Title: Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

Authors: Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, Wonjong Rhee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02237
Pdf URL: https://arxiv.org/pdf/2412.02237
Copy Paste: [[2412.02237]] Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models(https://arxiv.org/abs/2412.02237)
Keywords: generation, generative
Abstract: Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we present a method for constructing Head Relevance Vectors (HRVs) that align with useful visual concepts. An HRV for a given visual concept is a vector with a length equal to the total number of cross-attention heads, where each element represents the importance of the corresponding head for the given visual concept. We develop and employ an ordered weakening analysis to demonstrate the effectiveness of HRVs as interpretable features. To demonstrate the utility of HRVs, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. We show that misinterpretations of polysemous words in image generation can be corrected in most cases, five challenging attributes in image editing can be successfully modified, and catastrophic neglect in multi-concept generation can be mitigated. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.
摘要：最近的文本到图像扩散模型利用了交叉注意层，这些层已被有效地用于增强一系列视觉生成任务。然而，我们对交叉注意层的理解仍然有些有限。在本研究中，我们提出了一种构建与有用的视觉概念一致的头部相关向量 (HRV) 的方法。给定视觉概念的 HRV 是一个长度等于交叉注意头部总数的向量，其中每个元素代表相应头部对于给定视觉概念的重要性。我们开发并采用有序弱化分析来证明 HRV 作为可解释特征的有效性。为了证明 HRV 的实用性，我们提出了概念强化和概念调整方法，并将它们应用于增强三个视觉生成任务。我们表明，在大多数情况下可以纠正图像生成中对多义词的误解，可以成功修改图像编辑中的五个具有挑战性的属性，并且可以减轻多概念生成中的灾难性忽视。总的来说，我们的工作对理解交叉注意力层做出了进展，并引入了在头部层面上精细控制这些层的新方法。

Title: Fast LiDAR Data Generation with Rectified Flows

Authors: Kazuto Nakashima, Xiaowen Liu, Tomoya Miyawaki, Yumi Iwashita, Ryo Kurazume
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.02241
Pdf URL: https://arxiv.org/pdf/2412.02241
Copy Paste: [[2412.02241]] Fast LiDAR Data Generation with Rectified Flows(https://arxiv.org/abs/2412.02241)
Keywords: restoration, generation, generative
Abstract: Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite the success of diffusion models, generating high-quality samples requires numerous iterations of running neural networks, and the increasing computational cost can pose a barrier to robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with much fewer sampling steps against diffusion models. We also propose a efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on the unconditional generation of the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.
摘要：构建 LiDAR 生成模型有望成为自主移动机器人恢复、场景处理和可扩展模拟的强大数据先验。近年来，使用扩散模型的方法已经出现，显著提高了训练稳定性和生成质量。尽管扩散模型取得了成功，但生成高质量样本需要运行神经网络进行大量迭代，而不断增加的计算成本可能会对机器人应用造成障碍。为了应对这一挑战，本文提出了 R2Flow，一种快速、高保真度的 LiDAR 数据生成模型。我们的方法基于学习直线轨迹的整流流，模拟数据生成时采样步骤比扩散模型少得多。我们还提出了一种高效的基于 Transformer 的模型架构，用于处理 LiDAR 范围和反射率测量的图像表示。我们在 KITTI-360 数据集的无条件生成上进行的实验证明了我们的方法在效率和质量方面的有效性。

Title: VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

Authors: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02259
Pdf URL: https://arxiv.org/pdf/2412.02259
Copy Paste: [[2412.02259]] VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation(https://arxiv.org/abs/2412.02259)
Keywords: generation
Abstract: Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
摘要：当前的视频生成模型擅长生成短片，但仍难以制作多镜头、类似电影的视频。现有的模型在丰富的计算资源的支持下，使用大规模数据进行训练，但不足为奇的是，它们通常以单镜头目标进行训练，无法在连贯脚本的多个镜头中保持合乎逻辑的故事情节和视觉一致性。为此，我们提出了 VideoGen-of-Thought (VGoT)，这是一种专为多镜头视频生成而设计的协作和免训练架构。VGoT 的设计考虑了以下三个目标。多镜头视频生成：我们将视频生成过程分为一个结构化的模块化序列，包括 (1) 脚本生成，将简短的故事转化为每个镜头的详细提示；(2) 关键帧生成，负责创建忠实于角色描绘的视觉一致的关键帧；(3) 镜头级视频生成，将脚本和关键帧中的信息转换为镜头；(4) 平滑机制，确保一致的多镜头输出。合理的叙事设计：受电影剧本创作的启发，我们的提示生成方法涵盖五个关键领域，确保整个视频的逻辑一致性、角色发展和叙事流畅性。跨镜头一致性：我们通过利用跨镜头的身份保留 (IP) 嵌入来确保时间和身份一致性，这些嵌入是从叙事中自动创建的。此外，我们还采用了跨镜头平滑机制，该机制集成了重置边界，可有效结合相邻镜头的潜在特征，从而实现平滑过渡并保持整个视频的视觉连贯性。我们的实验表明，VGoT 在制作高质量、连贯的多镜头视频方面超越了现有的视频生成方法。

Title: Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Authors: Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02261
Pdf URL: https://arxiv.org/pdf/2412.02261
Copy Paste: [[2412.02261]] Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis(https://arxiv.org/abs/2412.02261)
Keywords: generation
Abstract: Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. this https URL
摘要：人体运动生成是一个长期存在的问题，而场景感知运动合成由于其众多应用而最近得到了广泛的研究。现行方法严重依赖于数量有限的成对运动场景数据。同时，如果仅在少数特定场景上进行训练，则很难推广到不同的场景。因此，我们提出了一个统一的框架，称为扩散隐式策略 (DIP)，用于场景感知运动合成，其中不再需要成对的运动场景数据。在这个框架中，我们在训练期间将人景交互从运动合成中分离出来，然后在推理期间将基于交互的隐式策略引入运动扩散。合成运动可以通过迭代扩散去噪和隐式策略优化得出，从而可以同时保持运动自然性和交互合理性。所提出的隐式策略以 GAN 反转的方式优化中间噪声运动，以保持运动连续性并通过 ControlNet 分支和运动修复控制关键帧姿势。对于长期运动合成，我们引入了运动混合，以实现多个子任务之间的稳定过渡，其中运动融合在旋转功率空间和平移线性空间中。所提出的方法在 ShapeNet 家具的合成场景以及 PROX 和 Replica 的真实场景上进行了评估。结果表明，我们的框架比尖端方法具有更好的运动自然性和交互合理性。这也表明在更一般的任务和多功能场景中利用 DIP 进行运动合成的可行性。此 https URL

Title: Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation

Authors: Sepand Dyanatkar, Angran Li, Alexander Dungate
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02262
Pdf URL: https://arxiv.org/pdf/2412.02262
Copy Paste: [[2412.02262]] Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation(https://arxiv.org/abs/2412.02262)
Keywords: generation
Abstract: Climate change's destruction of marine biodiversity is threatening communities and economies around the world which rely on healthy oceans for their livelihoods. The challenge of applying computer vision to niche, real-world domains such as ocean conservation lies in the dynamic and diverse environments where traditional top-down learning struggle with long-tailed distributions, generalization, and domain transfer. Scalable species identification for ocean monitoring is particularly difficult due to the need to adapt models to new environments and identify rare or unseen species. To overcome these limitations, we propose leveraging bottom-up, open-domain learning frameworks as a resilient, scalable solution for image and video analysis in marine applications. Our preliminary demonstration uses pretrained vision-language models (VLMs) combined with retrieval-augmented generation (RAG) as grounding, leaving the door open for numerous architectural, training and engineering optimizations. We validate this approach through a preliminary application in classifying fish from video onboard fishing vessels, demonstrating impressive emergent retrieval and prediction capabilities without domain-specific training or knowledge of the task itself.
摘要：气候变化对海洋生物多样性的破坏正在威胁世界各地依赖健康海洋维持生计的社区和经济体。将计算机视觉应用于海洋保护等小众现实世界领域的挑战在于动态和多样化的环境，传统的自上而下的学习难以应对长尾分布、泛化和领域转移。由于需要使模型适应新环境并识别稀有或未见过的物种，因此可扩展的物种识别对于海洋监测尤其困难。为了克服这些限制，我们建议利用自下而上的开放领域学习框架作为海洋应用中图像和视频分析的弹性、可扩展的解决方案。我们的初步演示使用预训练的视觉语言模型 (VLM) 结合检索增强生成 (RAG) 作为基础，为众多架构、训练和工程优化敞开了大门。我们通过对渔船上视频中的鱼类进行分类的初步应用验证了这种方法，展示了令人印象深刻的新兴检索和预测能力，而无需特定领域的培训或对任务本身的了解。

Title: Sustainable Self-evolution Adversarial Training

Authors: Wenxuan Wang, Chenglei Wang, Huihui Qi, Menghao Ye, Xuelin Qian, Peng Wang, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02270
Pdf URL: https://arxiv.org/pdf/2412.02270
Copy Paste: [[2412.02270]] Sustainable Self-evolution Adversarial Training(https://arxiv.org/abs/2412.02270)
Keywords: generation
Abstract: With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.
摘要：随着深度神经网络模型在各类计算机视觉任务中的广泛应用，对抗样本生成策略也层出不穷，旨在深入探索模型安全性。然而，现有的对抗训练防御模型依赖于一次性学习过程下的单一或有限类型的攻击，难以适应攻击方法的动态和不断发展的特性。因此，为了实现模型在长期应用中的防御性能提升，我们提出了一种新颖的可持续自进化对抗训练 (SSEAT) 框架。具体而言，我们引入了一个持续的对抗防御管道，以实现跨多个阶段从各种对抗样本中学习。此外，为了解决因不断从正在进行的新攻击中不断学习而导致的模型灾难性遗忘问题，我们提出了一个对抗数据重放模块，以更好地选择更多样化和关键的再学习数据。此外，我们设计了一种一致性正则化策略，以鼓励当前的防御模型从之前训练过的模型中学习更多，引导它们保留更多过去的知识并保持在干净样本上的准确性。已经进行了大量实验来验证所提出的 SSEAT 防御方法的有效性，与竞争对手相比，该方法表现出了卓越的防御性能和分类准确性。

Title: PCIM: Learning Pixel Attributions via Pixel-wise Channel Isolation Mixing in High Content Imaging

Authors: Daniel Siegismund, Mario Wieser, Stephan Heyse, Stephan Steigele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02275
Pdf URL: https://arxiv.org/pdf/2412.02275
Copy Paste: [[2412.02275]] PCIM: Learning Pixel Attributions via Pixel-wise Channel Isolation Mixing in High Content Imaging(https://arxiv.org/abs/2412.02275)
Keywords: generation
Abstract: Deep Neural Networks (DNNs) have shown remarkable success in various computer vision tasks. However, their black-box nature often leads to difficulty in interpreting their decisions, creating an unfilled need for methods to explain the decisions, and ultimately forming a barrier to their wide acceptance especially in biomedical applications. This work introduces a novel method, Pixel-wise Channel Isolation Mixing (PCIM), to calculate pixel attribution maps, highlighting the image parts most crucial for a classification decision but without the need to extract internal network states or gradients. Unlike existing methods, PCIM treats each pixel as a distinct input channel and trains a blending layer to mix these pixels, reflecting specific classifications. This unique approach allows the generation of pixel attribution maps for each image, but agnostic to the choice of the underlying classification network. Benchmark testing on three application relevant, diverse high content Imaging datasets show state-of-the-art performance, particularly for model fidelity and localization ability in both, fluorescence and bright field High Content Imaging. PCIM contributes as a unique and effective method for creating pixel-level attribution maps from arbitrary DNNs, enabling interpretability and trust.
摘要：深度神经网络 (DNN) 在各种计算机视觉任务中都取得了显著的成功。然而，它们的黑箱性质往往导致难以解释其决策，导致对解释决策的方法的需求无法得到满足，并最终阻碍其得到广泛接受，尤其是在生物医学应用中。这项工作引入了一种新方法，即逐像素通道隔离混合 (PCIM)，用于计算像素归因图，突出显示对分类决策最关键的图像部分，但无需提取内部网络状态或梯度。与现有方法不同，PCIM 将每个像素视为一个不同的输入通道，并训练一个混合层来混合这些像素，以反映特定的分类。这种独特的方法允许为每个图像生成像素归因图，但与底层分类网络的选择无关。在三个应用相关的、多样化的高内涵成像数据集上进行的基准测试显示出最先进的性能，特别是在荧光和明场高内涵成像中的模型保真度和定位能力方面。 PCIM 是一种独特而有效的方法，可以从任意 DNN 创建像素级归因图，从而实现可解释性和信任度。

Title: GQWformer: A Quantum-based Transformer for Graph Representation Learning

Authors: Lei Yu, Hongyang Chen, Jingsong Lv, Linyao Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02285
Pdf URL: https://arxiv.org/pdf/2412.02285
Copy Paste: [[2412.02285]] GQWformer: A Quantum-based Transformer for Graph Representation Learning(https://arxiv.org/abs/2412.02285)
Keywords: generation
Abstract: Graph Transformers (GTs) have demonstrated significant advantages in graph representation learning through their global attention mechanisms. However, the self-attention mechanism in GTs tends to neglect the inductive biases inherent in graph structures, making it chanllenging to effectively capture essential structural information. To address this issue, we propose a novel approach that integrate graph inductive bias into self-attention mechanisms by leveraging quantum technology for structural encoding. In this paper, we introduce the Graph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that utilizes quantum walks on attributed graphs to generate node quantum states. These quantum states encapsulate rich structural attributes and serve as inductive biases for the transformer, thereby enabling the generation of more meaningful attention scores. By subsequently incorporating a recurrent neural network, our design amplifies the model's ability to focus on both local and global information. We conducted comprehensive experiments across five publicly available datasets to evaluate the effectiveness of our model. These results clearly indicate that GQWformer outperforms existing state-of-the-art graph classification algorithms. These findings highlight the significant potential of integrating quantum computing methodologies with traditional GNNs to advance the field of graph representation learning, providing a promising direction for future research and applications.
摘要：图变换器 (GT) 通过其全局注意机制在图表示学习中表现出了显著的优势。然而，GT 中的自注意力机制往往会忽略图结构中固有的归纳偏差，这使得有效捕获必要的结构信息变得具有挑战性。为了解决这个问题，我们提出了一种新方法，通过利用量子技术进行结构编码，将图归纳偏差集成到自注意力机制中。在本文中，我们介绍了图量子行走变换器 (GQWformer)，这是一个突破性的 GNN 框架，它利用属性图上的量子行走来生成节点量子态。这些量子态封装了丰富的结构属性，并作为变换器的归纳偏差，从而能够生成更有意义的注意力分数。通过随后加入循环神经网络，我们的设计增强了模型关注局部和全局信息的能力。我们在五个公开可用的数据集上进行了全面的实验，以评估我们模型的有效性。这些结果清楚地表明，GQWformer 优于现有的最先进的图分类算法。这些发现凸显了将量子计算方法与传统 GNN 相结合以推动图形表示学习领域发展的巨大潜力，为未来的研究和应用提供了一个有希望的方向。

Title: Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance

Authors: Qing Zhang, Zehao Chen, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02287
Pdf URL: https://arxiv.org/pdf/2412.02287
Copy Paste: [[2412.02287]] Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance(https://arxiv.org/abs/2412.02287)
Keywords: generation
Abstract: Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.
摘要：尽管文本到 3D 生成技术最近取得了进展，但当前的方法经常受到几何不一致的影响，通常称为 Janus 问题。本文确定了 Janus 问题的根本原因：扩散模型中的视点生成偏差，这导致实际生成的视点与优化 3D 模型所需的预期视点之间存在很大差距。为了解决这个问题，我们提出了一种无需调整的方法，称为注意力和 CLIP 指导 (ACG) 机制。ACG 通过自适应控制交叉注意力图来增强所需视点，采用基于 CLIP 的视图文本相似性来过滤掉错误的视点，并使用由粗到细的优化策略和分阶段提示来逐步完善 3D 生成。大量实验表明，我们的方法在不影响生成速度的情况下显着减少了 Janus 问题，使 ACG 成为现有文本到 3D 框架的高效、即插即用组件。

Title: Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based Model Integrating Temporal and Covariate Interactions

Authors: Guang Wu, Yun Wang, Qian Zhou, Ziyang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02302
Pdf URL: https://arxiv.org/pdf/2412.02302
Copy Paste: [[2412.02302]] Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based Model Integrating Temporal and Covariate Interactions(https://arxiv.org/abs/2412.02302)
Keywords: generation
Abstract: Accurate photovoltaic (PV) power forecasting is critical for integrating renewable energy sources into the grid, optimizing real-time energy management, and ensuring energy reliability amidst increasing demand. However, existing models often struggle with effectively capturing the complex relationships between target variables and covariates, as well as the interactions between temporal dynamics and multivariate data, leading to suboptimal forecasting accuracy. To address these challenges, we propose a novel model architecture that leverages the iTransformer for feature extraction from target variables and employs long short-term memory (LSTM) to extract features from covariates. A cross-attention mechanism is integrated to fuse the outputs of both models, followed by a Kolmogorov-Arnold network (KAN) mapping for enhanced representation. The effectiveness of the proposed model is validated using publicly available datasets from Australia, with experiments conducted across four seasons. Results demonstrate that the proposed model effectively capture seasonal variations in PV power generation and improve forecasting accuracy.
摘要：准确的光伏 (PV) 电力预测对于将可再生能源整合到电网、优化实时能源管理以及在需求不断增长的情况下确保能源可靠性至关重要。然而，现有模型通常难以有效捕捉目标变量和协变量之间的复杂关系，以及时间动态和多变量数据之间的相互作用，导致预测准确性不理想。为了应对这些挑战，我们提出了一种新颖的模型架构，该架构利用 iTransformer 从目标变量中提取特征，并使用长短期记忆 (LSTM) 从协变量中提取特征。集成交叉注意机制以融合两个模型的输出，然后进行 Kolmogorov-Arnold 网络 (KAN) 映射以增强表示。使用来自澳大利亚的公开数据集验证了所提模型的有效性，并在四个季节进行了实验。结果表明，所提模型有效地捕捉了光伏发电的季节性变化并提高了预测准确性。

Title: HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset

Authors: Zedong Chu, Feng Xiong, Meiduo Liu, Jinzhi Zhang, Mingqi Shao, Zhaoxu Sun, Di Wang, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02317
Pdf URL: https://arxiv.org/pdf/2412.02317
Copy Paste: [[2412.02317]] HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset(https://arxiv.org/abs/2412.02317)
Keywords: generation
Abstract: With the rapid evolution of 3D generation algorithms, the cost of producing 3D humanoid character models has plummeted, yet the field is impeded by the lack of a comprehensive dataset for automatic rigging, which is a pivotal step in character animation. Addressing this gap, we present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging, encompassing 11,434 meticulously curated T-posed meshes adhered to a uniform skeleton topology. Capitalizing on this dataset, we introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods in handling complex AI-generated meshes. Our approach integrates a Prior-Guided Skeleton Estimator (PGSE) module, which uses 2D skeleton joints to provide a preliminary 3D skeleton, and a Mesh-Skeleton Mutual Attention Network (MSMAN) that fuses skeleton features with 3D mesh features extracted by a U-shaped point transformer. This enables a coarse-to-fine 3D skeleton joint regression and a robust skinning estimation, surpassing previous methods in quality and versatility. This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.
摘要：随着 3D 生成算法的快速发展，制作 3D 人形角色模型的成本大幅下降，然而，由于缺乏用于自动装配的综合数据集，而自动装配是角色动画的关键步骤，因此该领域的发展受到阻碍。为了解决这一差距，我们提出了 HumanRig，这是第一个专为 3D 人形角色装配设计的大规模数据集，包含 11,434 个精心策划的 T 型网格，这些网格遵循统一的骨架拓扑。利用这个数据集，我们引入了一个创新的数据驱动自动装配框架，它克服了基于 GNN 的方法在处理复杂的 AI 生成的网格方面的局限性。我们的方法集成了一个先验引导骨架估计器 (PGSE) 模块，它使用 2D 骨架关节提供初步的 3D 骨架，以及一个网格骨架相互注意网络 (MSMAN)，它将骨架特征与 U 形点变换器提取的 3D 网格特征融合在一起。这使得从粗到细的 3D 骨架关节回归和稳健的蒙皮估计成为可能，在质量和多功能性方面超越了以前的方法。这项工作不仅弥补了绑定研究中的数据集不足，而且还推动了动画行业朝着更高效和自动化的角色绑定流程迈进。

Title: Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

Authors: Xinjie Li, Yang Zhao, Dong Wang, Yuan Chen, Li Cao, Xiaoping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02322
Pdf URL: https://arxiv.org/pdf/2412.02322
Copy Paste: [[2412.02322]] Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation(https://arxiv.org/abs/2412.02322)
Keywords: generation, generative
Abstract: Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.
摘要：大规模生成模型在各种视觉任务中取得了显著的进步，但将其应用于图像阴影去除仍然具有挑战性。这些模型通常生成多样化、逼真的细节，而没有足够关注保真度，无法满足阴影去除的关键要求，即精确保留图像内容。与之前旨在从头开始重新生成无阴影图像的方法相比，本文利用扩散模型来生成和细化图像残差。该策略充分利用了阴影图像中固有的详细信息，从而更高效、更忠实地重建无阴影内容。此外，为了防止生成过程中错误的积累，提出了一种跨时间步的自增强训练策略。该策略利用网络本身来增强训练数据，不仅增加了数据量，还使网络能够动态校正其生成轨迹，确保更准确和鲁棒的输出。此外，针对大型生成模型在图像编码和解码过程中丢失原始细节的问题，设计了一种具有控制机制和多尺度跳跃连接的内容保留编解码器结构，以实现高保真无阴影图像重建。实验结果表明，所提方法可以在大潜在扩散先验的基础上再现高质量的结果，并忠实地保留阴影区域中的原始内容。

Title: SimuScope: Realistic Endoscopic Synthetic Dataset Generation through Surgical Simulation and Diffusion Models

Authors: Sabina Martyniak, Joanna Kaleta, Diego Dall'Alba, Michał Naskręt, Szymon Płotka, Przemysław Korzeniowski
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.02332
Pdf URL: https://arxiv.org/pdf/2412.02332
Copy Paste: [[2412.02332]] SimuScope: Realistic Endoscopic Synthetic Dataset Generation through Surgical Simulation and Diffusion Models(https://arxiv.org/abs/2412.02332)
Keywords: generation
Abstract: Computer-assisted surgical (CAS) systems enhance surgical execution and outcomes by providing advanced support to surgeons. These systems often rely on deep learning models trained on complex, challenging-to-annotate data. While synthetic data generation can address these challenges, enhancing the realism of such data is crucial. This work introduces a multi-stage pipeline for generating realistic synthetic data, featuring a fully-fledged surgical simulator that automatically produces all necessary annotations for modern CAS systems. This simulator generates a wide set of annotations that surpass those available in public synthetic datasets. Additionally, it offers a more complex and realistic simulation of surgical interactions, including the dynamics between surgical instruments and deformable anatomical environments, outperforming existing approaches. To further bridge the visual gap between synthetic and real data, we propose a lightweight and flexible image-to-image translation method based on Stable Diffusion (SD) and Low-Rank Adaptation (LoRA). This method leverages a limited amount of annotated data, enables efficient training, and maintains the integrity of annotations generated by our simulator. The proposed pipeline is experimentally validated and can translate synthetic images into images with real-world characteristics, which can generalize to real-world context, thereby improving both training and CAS guidance. The code and the dataset are available at this https URL.
摘要：计算机辅助手术 (CAS) 系统通过为外科医生提供高级支持来增强手术执行和结果。这些系统通常依赖于在复杂、难以注释的数据上训练的深度学习模型。虽然合成数据生成可以解决这些挑战，但增强此类数据的真实性至关重要。这项工作引入了一个用于生成逼真合成数据的多阶段管道，具有一个功能齐全的手术模拟器，可自动生成现代 CAS 系统所需的所有注释。该模拟器生成的注释范围广泛，超过了公共合成数据集中可用的注释。此外，它提供了更复杂、更逼真的手术交互模拟，包括手术器械和可变形解剖环境之间的动态，优于现有方法。为了进一步弥合合成数据和真实数据之间的视觉差距，我们提出了一种基于稳定扩散 (SD) 和低秩自适应 (LoRA) 的轻量级灵活图像到图像转换方法。该方法利用有限数量的带注释数据，实现高效训练，并保持我们的模拟器生成的注释的完整性。所提出的流程经过实验验证，可以将合成图像转换为具有真实世界特征的图像，这些图像可以推广到真实世界环境，从而改善训练和 CAS 指导。代码和数据集可在此 https URL 上找到。

Title: Amodal Depth Anything: Amodal Depth Estimation in the Wild

Authors: Zhenyu Li, Mykola Lavreniuk, Jian Shi, Shariq Farooq Bhat, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02336
Pdf URL: https://arxiv.org/pdf/2412.02336
Copy Paste: [[2412.02336]] Amodal Depth Anything: Amodal Depth Estimation in the Wild(https://arxiv.org/abs/2412.02336)
Keywords: generative
Abstract: Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 69.5% improvement in accuracy over the previous SoTA on the ADIW dataset.
摘要：非模态深度估计旨在预测场景中物体被遮挡（不可见）部分的深度。此任务解决了模型是否能够根据可见线索有效感知被遮挡区域的几何形状的问题。先前的方法主要依赖于合成数据集并侧重于度量深度估计，由于域转移和可扩展性挑战，它们仅限于现实世界设置。在本文中，我们提出了一种新颖的自然非模态深度估计公式，重点关注相对深度预测，以提高模型在不同自然图像中的泛化能力。我们引入了一个新的大型数据集，即自然非模态深度 (ADIW)，它是使用利用分割数据集和合成技术的可扩展管道创建的。深度图是使用大型预训练深度模型生成的，并采用缩放和移位对齐策略来细化和混合深度预测，确保地面实况注释的一致性。为了解决非模态深度任务，我们提出了两个互补的框架：Amodal-DAV2（基于 Depth Anything V2 的确定性模型）和 Amodal-DepthFM（集成条件流匹配原理的生成模型）。我们提出的框架有效地利用了大型预训练模型的功能，只需进行少量修改即可实现高质量的非模态深度预测。实验验证了我们的设计选择，证明了我们的模型在为遮挡区域生成多样化、合理的深度结构方面的灵活性。我们的方法在 ADIW 数据集上的准确率比之前的 SoTA 提高了 69.5%。

Title: GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing

Authors: Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, Karthik Nandakumar, Naveed Akhtar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02366
Pdf URL: https://arxiv.org/pdf/2412.02366
Copy Paste: [[2412.02366]] GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing(https://arxiv.org/abs/2412.02366)
Keywords: generative
Abstract: Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.
摘要：数据增强被广泛用于增强视觉分类任务的泛化能力。然而，当源域和目标域不同时，传统方法就会遇到困难，例如在域自适应中，因为它们无法解决域差距。本文介绍了 GenMix，这是一种可泛化的提示引导生成数据增强方法，可增强域内和跨域图像分类。我们的技术利用图像编辑来生成基于自定义条件提示的增强图像，这些提示是专门为每种问题类型设计的。通过将输入图像的部分与其编辑后的生成对应部分混合并结合分形图案，我们的方法可以减轻不切实际的图像和标签歧义，从而提高生成模型的性能和对抗鲁棒性。我们的方法的有效性是通过在域内和跨域设置中对八个公共数据集进行大量实验来确定的，这些实验用于一般和细粒度分类。此外，我们还展示了自监督学习、数据稀缺学习和对抗鲁棒性的性能改进。与现有的最先进方法相比，我们的技术全面实现了更强劲的性能。

Title: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02467
Pdf URL: https://arxiv.org/pdf/2412.02467
Copy Paste: [[2412.02467]] DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators(https://arxiv.org/abs/2412.02467)
Keywords: generation
Abstract: Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at this https URL.
摘要：在差异隐私 (DP) 保护下生成表格数据可确保理论上的隐私保证，但对训练机器学习模型提出了挑战，这主要是因为需要在嘈杂的监督信号下捕捉复杂结构。最近，预先训练的大型语言模型 (LLM)——即使是 GPT-2 规模的模型——也已展示出在合成表格数据方面的巨大潜力。然而，它们在 DP 约束下的应用仍未得到充分探索。在这项工作中，我们通过将 DP 技术应用于合成表格数据的生成来解决这一差距。我们的研究结果表明，当使用 DP 进行微调时，LLM 难以生成连贯的文本，因为隐私预算分配给表格结构等非隐私元素的效率低下。为了解决这个问题，我们提出了一个用于差异隐私表格数据生成的两阶段微调框架。第一阶段涉及在伪数据集上进行非隐私微调，然后在隐私数据集上进行 DP 微调。我们的实证结果表明，与直接在 DP 环境中微调的 LLM 相比，这种方法可以提高各种设置和指标的性能。我们在此 https URL 上发布了我们的代码和设置。

Title: WEM-GAN: Wavelet transform based facial expression manipulation

Authors: Dongya Sun, Yunfei Hu, Xianzhe Zhang, Yingsong Hu
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.02530
Pdf URL: https://arxiv.org/pdf/2412.02530
Copy Paste: [[2412.02530]] WEM-GAN: Wavelet transform based facial expression manipulation(https://arxiv.org/abs/2412.02530)
Keywords: generation
Abstract: Facial expression manipulation aims to change human facial expressions without affecting face recognition. In order to transform the facial expressions to target expressions, previous methods relied on expression labels to guide the manipulation process. However, these methods failed to preserve the details of facial features, which causes the weakening or the loss of identity information in the output image. In our work, we propose WEM-GAN, in short for wavelet-based expression manipulation GAN, which puts more efforts on preserving the details of the original image in the editing process. Firstly, we take advantage of the wavelet transform technique and combine it with our generator with a U-net autoencoder backbone, in order to improve the generator's ability to preserve more details of facial features. Secondly, we also implement the high-frequency component discriminator, and use high-frequency domain adversarial loss to further constrain the optimization of our model, providing the generated face image with more abundant details. Additionally, in order to narrow the gap between generated facial expressions and target expressions, we use residual connections between encoder and decoder, while also using relative action units (AUs) several times. Extensive qualitative and quantitative experiments have demonstrated that our model performs better in preserving identity features, editing capability, and image generation quality on the AffectNet dataset. It also shows superior performance in metrics such as Average Content Distance (ACD) and Expression Distance (ED).
摘要：面部表情操纵旨在改变人类的面部表情而不影响人脸识别。为了将面部表情转换为目标表情，以前的方法依靠表情标签来指导操纵过程。然而，这些方法未能保留面部特征的细节，导致输出图像中的身份信息减弱或丢失。在我们的工作中，我们提出了 WEM-GAN，即基于小波的表情操纵 GAN，它在编辑过程中更加注重保留原始图像的细节。首先，我们利用小波变换技术，并将其与带有 U-net 自动编码器主干的生成器相结合，以提高生成器保留更多面部特征细节的能力。其次，我们还实现了高频分量鉴别器，并使用高频域对抗损失进一步约束我们模型的优化，为生成的面部图像提供更丰富的细节。此外，为了缩小生成的面部表情和目标表情之间的差距，我们在编码器和解码器之间使用残差连接，同时还多次使用相对动作单元 (AU)。大量定性和定量实验表明，我们的模型在 AffectNet 数据集上的身份特征保留、编辑能力和图像生成质量方面表现更佳，并且在平均内容距离 (ACD) 和表情距离 (ED) 等指标上也表现出优异的性能。

Title: Unveiling Concept Attribution in Diffusion Models

Authors: Quang H. Nguyen, Hoang Phan, Khoa D. Doan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02542
Pdf URL: https://arxiv.org/pdf/2412.02542
Copy Paste: [[2412.02542]] Unveiling Concept Attribution in Diffusion Models(https://arxiv.org/abs/2412.02542)
Keywords: generative
Abstract: Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize layers storing knowledge in generative models without showing how those layers contribute to the target concept. In this work, we approach the model interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show there exist components that contribute negatively to a concept, which has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models. Our code is available at \url{this https URL}
摘要：扩散模型在根据文本提示生成逼真的高质量图像方面表现出了卓越的能力。然而，经过训练的模型仍然是黑箱；我们对其组件在展示对象或样式等概念方面的作用知之甚少。最近的研究采用因果追踪来定位生成模型中存储知识的层，而不展示这些层如何对目标概念做出贡献。在这项工作中，我们从更一般的角度探讨模型可解释性问题，并提出一个问题：\textit{``模型组件如何共同展示知识？''}。我们采用组件归因来分解扩散模型，揭示组件如何对概念做出贡献。我们的框架允许有效的模型编辑，特别是，我们可以通过删除正组件同时保留其他概念的知识来从扩散模型中删除概念。令人惊讶的是，我们还表明存在对概念产生负面影响的组件，这在知识定位方法中尚未发现。实验结果证实了我们框架所确定的正负成分的作用，描绘了解释生成模型的完整视图。我们的代码可以在 \url{此 https URL} 上找到

Title: OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Authors: Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02592
Pdf URL: https://arxiv.org/pdf/2412.02592
Copy Paste: [[2412.02592]] OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation(https://arxiv.org/abs/2412.02592)
Keywords: generation
Abstract: Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: this https URL
摘要：检索增强生成 (RAG) 通过整合外部知识来减少幻觉并整合最新信息而无需重新训练，从而增强了大型语言模型 (LLM)。作为 RAG 的重要组成部分，外部知识库通常是通过使用光学字符识别 (OCR) 从非结构化 PDF 文档中提取结构化数据来构建的。然而，鉴于 OCR 预测不完善以及结构化数据固有的非统一表示，知识库不可避免地包含各种 OCR 噪声。在本文中，我们介绍了 OHRBench，这是第一个用于了解 OCR 对 RAG 系统的级联影响的基准。 OHRBench 包含来自六个真实 RAG 应用领域的 350 个精心挑选的非结构化 PDF 文档，以及从文档中的多模态元素派生的问答，对用于 RAG 的现有 OCR 解决方案提出了挑战。为了更好地理解 OCR 对 RAG 系统的影响，我们确定了两种主要类型的 OCR 噪声：语义噪声和格式噪声，并应用扰动来生成一组结构化数据，每种 OCR 噪声的程度各不相同。使用 OHRBench，我们首先对当前的 OCR 解决方案进行全面评估，并发现没有一种解决方案能够胜任为 RAG 系统构建高质量的知识库。然后，我们系统地评估这两种噪声类型的影响，并展示 RAG 系统的脆弱性。此外，我们讨论了在 RAG 系统中采用没有 OCR 的视觉语言模型 (VLM) 的潜力。代码：这个 https URL

Title: Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Authors: Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02617
Pdf URL: https://arxiv.org/pdf/2412.02617
Copy Paste: [[2412.02617]] Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback(https://arxiv.org/abs/2412.02617)
Keywords: generation
Abstract: Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
摘要：大型文本转视频模型对各种下游应用具有巨大潜力。然而，这些模型难以准确描述动态对象交互，通常会导致不切实际的运动和频繁违反现实世界的物理规律。大型语言模型启发的一种解决方案是使用外部反馈将生成的输出与期望结果对齐。这使模型能够自主改进其响应，从而消除了大量的手动数据收集。在这项工作中，我们研究了使用反馈来增强文本转视频模型中的对象动态。我们旨在回答一个关键问题：哪些类型的反馈与哪些特定的自我改进算法相结合，可以最有效地改善文本视频对齐和真实的对象交互？我们首先推导出一个统一的概率目标，用于离线 RL 文本转视频模型的微调。这个观点强调了现有算法中的设计元素（如 KL 正则化和策略投影）如何作为统一框架内的特定选择出现。然后，我们使用派生方法来优化一组文本视频对齐指标（例如 CLIP 分数、光流），但注意到它们通常无法与人类对生成质量的感知相一致。为了解决这一限制，我们建议利用视觉语言模型来提供更细致入微的反馈，这些反馈专门针对视频中的物体动态。我们的实验表明，我们的方法可以有效地优化各种奖励，二进制人工智能反馈可以显著提高动态交互的视频质量，这一点得到了人工智能和人类评估的证实。值得注意的是，我们观察到，在使用来自人工智能反馈的奖励信号时，效果显著，特别是在涉及多个物体之间复杂交互和物体坠落的真实描述的场景中。

Title: Continual Learning of Personalized Generative Face Models with Experience Replay

Authors: Annie N. Wang, Luchao Qi, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02627
Pdf URL: https://arxiv.org/pdf/2412.02627
Copy Paste: [[2412.02627]] Continual Learning of Personalized Generative Face Models with Experience Replay(https://arxiv.org/abs/2412.02627)
Keywords: generative
Abstract: We introduce a novel continual learning problem: how to sequentially update the weights of a personalized 2D and 3D generative face model as new batches of photos in different appearances, styles, poses, and lighting are captured regularly. We observe that naive sequential fine-tuning of the model leads to catastrophic forgetting of past representations of the individual's face. We then demonstrate that a simple random sampling-based experience replay method is effective at mitigating catastrophic forgetting when a relatively large number of images can be stored and replayed. However, for long-term deployment of these models with relatively smaller storage, this simple random sampling-based replay technique also forgets past representations. Thus, we introduce a novel experience replay algorithm that combines random sampling with StyleGAN's latent space to represent the buffer as an optimal convex hull. We observe that our proposed convex hull-based experience replay is more effective in preventing forgetting than a random sampling baseline and the lower bound.
摘要：我们引入了一个新颖的持续学习问题：随着定期拍摄具有不同外观、风格、姿势和光线的新批次照片，如何顺序更新个性化 2D 和 3D 生成人脸模型的权重。我们观察到，对模型进行简单的顺序微调会导致灾难性地遗忘个人脸部过去的表征。然后，我们证明，当可以存储和重放相对大量的图像时，一种简单的基于随机采样的经验重放方法可以有效缓解灾难性遗忘。但是，对于长期部署这些具有相对较小存储空间的模型，这种简单的基于随机采样的重放技术也会忘记过去的表征。因此，我们引入了一种新颖的经验重放算法，该算法将随机采样与 StyleGAN 的潜在空间相结合，将缓冲区表示为最佳凸包。我们观察到，我们提出的基于凸包的经验重放在防止遗忘方面比随机采样基线和下限更有效。

Title: Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Authors: Yiftach Edelstein, Or Patashnik, Dana Cohen-Bar, Lihi Zelnik-Manor
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02631
Pdf URL: https://arxiv.org/pdf/2412.02631
Copy Paste: [[2412.02631]] Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation(https://arxiv.org/abs/2412.02631)
Keywords: generation, generative
Abstract: Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.
摘要：文本到图像扩散模型的进步已导致快速 3D 内容创建方面取得重大进展。一种常见的方法是生成对象的一组多视图图像，然后将其重建为 3D 模型。但是，这种方法绕过了对象的原生 3D 表示的使用，因此容易出现几何伪影，并且可控性和操作能力有限。另一种方法涉及直接生成 3D 表示的原生 3D 生成模型。然而，这些模型的分辨率通常有限，导致 3D 对象质量较低。在这项工作中，我们弥合了直接生成 3D 表示的方法和从多视图图像重建 3D 对象的方法之间的质量差距。我们引入了一种称为 Sharp-It 的多视图到多视图扩散模型，它采用从低质量对象渲染的一组 3D 一致的多视图图像并丰富其几何细节和纹理。扩散模型并行地对多视图集进行操作，因为它在生成的视图之间共享特征。然后可以从丰富的多视图集中重建高质量的 3D 模型。通过利用 2D 和 3D 方法的优势，我们的方法为高质量 3D 内容创建提供了一种高效且可控的方法。我们证明了 Sharp-It 可以实现各种 3D 应用，例如快速合成、编辑和受控生成，同时获得高质量资产。

Title: AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

Authors: Lingteng Qiu, Shenhao Zhu, Qi Zuo, Xiaodong Gu, Yuan Dong, Junfei Zhang, Chao Xu, Zhe Li, Weihao Yuan, Liefeng Bo, Guanying Chen, Zilong Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02684
Pdf URL: https://arxiv.org/pdf/2412.02684
Copy Paste: [[2412.02684]] AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction(https://arxiv.org/abs/2412.02684)
Keywords: generation, generative
Abstract: Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.
摘要：从单个图像生成可动画的人体化身对于各种数字人体建模应用至关重要。现有的 3D 重建方法通常难以捕捉可动画模型中的精细细节，而可控动画的生成方法虽然避免了显式 3D 建模，但却存在极端姿势下的视点不一致和计算效率低下的问题。在本文中，我们利用生成模型的强大功能来生成详细的多视图规范姿势图像，从而解决可动画人体重建中的歧义问题，从而解决这些挑战。然后，我们提出了一种用于不一致图像 3D 重建的稳健方法，可在推理过程中实现实时渲染。具体而言，我们采用基于 Transformer 的视频生成模型来生成多视图规范姿势图像和法线图，并在大型视频数据集上进行预训练以提高泛化能力。为了处理视图不一致问题，我们将重建问题重新定义为 4D 任务，并引入了一种使用 4D Gaussian Splatting 的高效 3D 建模方法。实验表明，我们的方法可以根据自然图像实现逼真的实时 3D 人体动画，展示了其有效性和泛化能力。

Title: SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Authors: Viet Nguyen, Anh Aengus Nguyen, Trung Dao, Khoi Nguyen, Cuong Pham, Toan Tran, Anh Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02687
Pdf URL: https://arxiv.org/pdf/2412.02687
Copy Paste: [[2412.02687]] SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(https://arxiv.org/abs/2412.02687)
Keywords: generation
Abstract: Recent approaches have yielded promising results in distilling multi-step text-to-image diffusion models into one-step ones. The state-of-the-art efficient distillation technique, i.e., SwiftBrushv2 (SBv2), even surpasses the teacher model's performance with limited resources. However, our study reveals its instability when handling different diffusion model backbones due to using a fixed guidance scale within the Variational Score Distillation (VSD) loss. Another weakness of the existing one-step diffusion models is the missing support for negative prompt guidance, which is crucial in practical image generation. This paper presents SNOOPI, a novel framework designed to address these limitations by enhancing the guidance in one-step diffusion models during both training and inference. First, we effectively enhance training stability through Proper Guidance-SwiftBrush (PG-SB), which employs a random-scale classifier-free guidance approach. By varying the guidance scale of both teacher models, we broaden their output distributions, resulting in a more robust VSD loss that enables SB to perform effectively across diverse backbones while maintaining competitive performance. Second, we propose a training-free method called Negative-Away Steer Attention (NASA), which integrates negative prompts into one-step diffusion models via cross-attention to suppress undesired elements in generated images. Our experimental results show that our proposed methods significantly improve baseline models across various metrics. Remarkably, we achieve an HPSv2 score of 31.08, setting a new state-of-the-art benchmark for one-step diffusion models.
摘要：最近的方法在将多步文本到图像扩散模型提炼为一步模型方面取得了令人鼓舞的结果。最先进的高效提炼技术，即 SwiftBrushv2 (SBv2)，甚至在资源有限的情况下超越了教师模型的性能。然而，我们的研究表明，由于在变分得分提炼 (VSD) 损失中使用固定的指导尺度，它在处理不同的扩散模型主干时不稳定。现有一步扩散模型的另一个弱点是缺乏对负面提示指导的支持，这在实际图像生成中至关重要。本文介绍了 SNOOPI，这是一个新颖的框架，旨在通过在训练和推理过程中增强一步扩散模型中的指导来解决这些限制。首先，我们通过 Proper Guidance-SwiftBrush (PG-SB) 有效地提高了训练稳定性，它采用了一种随机尺度的无分类器指导方法。通过改变两个教师模型的指导尺度，我们扩大了它们的输出分布，从而产生了更稳健的 VSD 损失，使 SB 能够在不同的主干上有效运行，同时保持竞争性能。其次，我们提出了一种无需训练的方法，称为负向引导注意力 (NASA)，该方法通过交叉注意力将负向提示集成到一步扩散模型中，以抑制生成图像中的不良元素。我们的实验结果表明，我们提出的方法在各种指标上显著改善了基线模型。值得注意的是，我们获得了 31.08 的 HPSv2 分数，为一步扩散模型设定了新的最先进基准。

Title: FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Authors: Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Hampali, Cem Keskin, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02690
Pdf URL: https://arxiv.org/pdf/2412.02690
Copy Paste: [[2412.02690]] FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation(https://arxiv.org/abs/2412.02690)
Keywords: generation
Abstract: Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.
摘要：尽管图像生成模型取得了显著进展，但由于手部关节复杂、视点变化多端且经常发生遮挡，生成逼真的手部仍然是一项持续的挑战。我们提出了 FoundHand，这是一种用于合成单手和双手图像的大规模领域特定扩散模型。为了训练我们的模型，我们引入了 FoundHand-10M，这是一个具有 2D 关键点和分割掩码注释的大规模手部数据集。我们的见解是使用 2D 手部关键点作为通用表示，同时编码手部关节和相机视点。FoundHand 从图像对中学习以捕捉物理上合理的手部关节，通过 2D 关键点原生实现精确控制，并支持外观控制。我们的模型展示了核心功能，包括重新摆放手部、转移手部外观甚至合成新视图的能力。这带来了零样本能力，可以修复先前生成的图像中的畸形手部，或合成手部视频序列。我们进行了大量的实验和评估，证明了我们方法的先进性能。

Title: Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Authors: Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, Limin Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02692
Pdf URL: https://arxiv.org/pdf/2412.02692
Copy Paste: [[2412.02692]] Taming Scalable Visual Tokenizer for Autoregressive Image Generation(https://arxiv.org/abs/2412.02692)
Keywords: generation
Abstract: Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction ($1.00$ rFID) and autoregressive visual generation ($2.05$ gFID). The code and models are available at this https URL.
摘要：现有的矢量量化 (VQ) 方法在可扩展性方面存在困难，这很大程度上归因于在训练期间经历部分更新的码本的不稳定性。由于非激活代码和视觉特征之间的分布差距逐渐扩大，随着利用率的降低，码本容易崩溃。为了解决这个问题，我们提出了索引反向传播量化 (IBQ)，这是一种用于联合优化所有码本嵌入和视觉编码器的新 VQ 方法。通过在编码特征和码本之间的独热分类分布上应用直通估计器，所有代码都是可微的并与视觉编码器保持一致的潜在空间。IBQ 支持对视觉标记器的可扩展训练，并首次实现了具有高维度 ($256$) 和高利用率的大规模码本 ($2^{18}$)。在标准 ImageNet 基准上进行的实验证明了 IBQ 的可扩展性和优越性，在重建 ($1.00$ rFID) 和自回归视觉生成 ($2.05$ gFID) 方面都取得了有竞争力的结果。代码和模型可在此 https URL 上找到。

Title: Diffusion-based Visual Anagram as Multi-task Learning

Authors: Zhiyuan Xu, Yinhe Chen, Huan-ang Gao, Weiyan Zhao, Guiyu Zhang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02693
Pdf URL: https://arxiv.org/pdf/2412.02693
Copy Paste: [[2412.02693]] Diffusion-based Visual Anagram as Multi-task Learning(https://arxiv.org/abs/2412.02693)
Keywords: generation
Abstract: Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method's superior ability to generate visual anagrams spanning diverse concepts.
摘要：视觉字谜是经过翻转或旋转等变换后外观会发生变化的图像。随着扩散模型的出现，可以通过在反向去噪过程中对多个视图中的噪声进行平均来实现此类视觉错觉的生成。然而，我们观察到这种方法有两种关键的失败模式：（i）概念分离，即不同视图中的概念是独立生成的，这不能被视为真正的字谜；（ii）概念支配，即某些概念压倒其他概念。在这项工作中，我们将视觉字谜生成问题置于多任务学习环境中，其中不同的视点提示类似于不同的任务，并同时得出与任务很好地一致的去噪轨迹。我们设计的框架的核心是两种新引入的技术，其中（i）一种反分离优化策略，可促进不同概念之间的交叉注意力图重叠；（ii）一种噪声矢量平衡方法，可自适应地调整不同任务的影响。此外，我们观察到直接平均噪声预测会产生次优性能，因为统计特性可能无法保留，这促使我们推导出噪声方差校正方法。大量定性和定量实验证明了我们的方法能够生成涵盖不同概念的视觉字谜。

Title: Motion Prompting: Controlling Video Generation with Motion Trajectories

Authors: Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02700
Pdf URL: https://arxiv.org/pdf/2412.02700
Copy Paste: [[2412.02700]] Motion Prompting: Controlling Video Generation with Motion Trajectories(https://arxiv.org/abs/2412.02700)
Keywords: generation, generative
Abstract: Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: this https URL
摘要：运动控制对于生成富有表现力和吸引力的视频内容至关重要；然而，大多数现有的视频生成模型主要依靠文本提示进行控制，而文本提示很难捕捉到动态动作和时间构图的细微差别。为此，我们训练了一个以时空稀疏或密集运动轨迹为条件的视频生成模型。与之前的运动调节工作相比，这种灵活的表示可以编码任意数量的轨迹、特定于对象或全局场景的运动以及时间稀疏的运动；由于其灵活性，我们将这种调节称为运动提示。虽然用户可以直接指定稀疏轨迹，但我们也展示了如何将高级用户请求转换为详细的半密集运动提示，我们将这一过程称为运动提示扩展。我们通过各种应用展示了我们方法的多功能性，包括相机和物体运动控制、与图像“交互”、运动传输和图像编辑。我们的结果展示了新兴行为，例如现实物理，表明运动提示在探测视频模型和与未来生成世界模型交互方面的潜力。最后，我们进行定量评估，进行人工研究，并展示出强大的性能。视频结果可在我们的网页上找到：此 https URL