2024-11-26

Title: Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

Authors: Yucheng Xing, Xiaodong Liu, Xin Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15199
Pdf URL: https://arxiv.org/pdf/2411.15199
Copy Paste: [[2411.15199]] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation(https://arxiv.org/abs/2411.15199)
Keywords: generation, generative
Abstract: With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, $\textit{Adaptively Controllable Diffusion (AC-Diff) Model}$, to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a $\textit{Conditional Time-Step (CTS) Module}$ to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our $\textit{Adaptive Hybrid Noise Schedule (AHNS) Module}$. We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.
摘要：随着人工智能的发展，生成模型越来越受到关注，它代表着创造力，而创造力是智能的一个非常重要的方面。近年来，扩散模型得到了研究，并被证明比以前的方法更加合理和有效。然而，常见的扩散框架存在可控性问题。虽然一些工作考虑了额外的条件来指导特定目标生成的扩散过程，但它只控制生成结果，而不控制其过程。在本文中，我们提出了一种新的自适应框架，即$\textit{自适应可控扩散 (AC-Diff) 模型}$，以自动和完全控制生成过程，不仅包括生成结果的类型，还包括生成过程的长度和参数。输入和条件都将首先输入到$\textit{条件时间步长 (CTS) 模块}$中，以确定生成所需的步数。然后根据过程的长度，通过我们的$\textit{自适应混合噪声调度 (AHNS) 模块}$估计扩散速率参数。我们进一步用相应的自适应采样机制训练网络，使其学会根据条件调整自身，从而提高整体性能。为了实现实际应用，AC-Diff 有望大幅减少平均生成步骤数和执行时间，同时保持与文献扩散模型相同的性能。

Title: Multimodal large language model for wheat breeding: a new exploration of smart breeding

Authors: Guofeng Yang, Yu Li, Yong He, Zhenjiang Zhou, Lingzhen Ye, Hui Fang, Yiqi Luo, Xuping Feng
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15203
Pdf URL: https://arxiv.org/pdf/2411.15203
Copy Paste: [[2411.15203]] Multimodal large language model for wheat breeding: a new exploration of smart breeding(https://arxiv.org/abs/2411.15203)
Keywords: generation
Abstract: UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.
摘要：无人机遥感技术已成为作物育种的关键技术，可实现作物表型数据的高通量、无损采集。然而育种的多学科性给知识挖掘带来了技术壁垒和效率挑战，因此研发一种智能育种目标工具来挖掘跨领域多模态数据具有重要意义。本研究基于不同的预训练开源多模态大型语言模型（MLLM）（例如Qwen-VL、InternVL、Deepseek-VL），利用监督微调（SFT）、检索增强生成（RAG）和基于人工反馈的强化学习（RLHF）技术将跨领域知识注入MLLM，从而构建了多个用于小麦育种的多模态大型语言模型（WBLM）。使用本研究新创建的评估基准对上述WBLM进行了评估。结果表明，使用SFT、RAG和RLHF技术以及InternVL2-8B构建的WBLM具有领先的性能。随后，对WBLM模型进行了后续实验。消融实验表明，SFT、RAG和RLHF技术的组合可以提高整体生成性能，提高生成质量，平衡生成答案的时效性和适应性，减少幻觉和偏差。在同时使用跨领域数据（遥感、表型、气象、种质）进行小麦产量预测时，WBLM模型表现最佳，R2和RMSE分别为0.821和489.254 kg/ha。此外，WBLM模型还可以为表型估计、环境胁迫评估、目标种质筛选、栽培技术推荐和种子价格查询等任务生成专业的决策支持答案。

Title: DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

Authors: Jingyu Zhuang, Di Kang, Linchao Bao, Liang Lin, Guanbin Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.15205
Pdf URL: https://arxiv.org/pdf/2411.15205
Copy Paste: [[2411.15205]] DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh(https://arxiv.org/abs/2411.15205)
Keywords: generation
Abstract: Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
摘要：文本驱动的虚拟形象生成因其便利性而备受关注。然而，现有的方法通常将人体连同所有服装建模为单个 3D 模型，这限制了其可用性（例如更换服装），并降低了用户对生成过程的控制。为了克服上述限制，我们提出了 DAGSM，这是一种新颖的流程，可根据给定的文本提示生成解开的人体和服装。具体来说，我们将穿衣人体的每个部分（例如身体、上衣/下衣）建模为一个 GS 增强网格（GSM），这是一种附加了 2D 高斯的传统网格，可以更好地处理复杂纹理（例如羊毛、半透明衣服）并制作逼真的布料动画。在生成过程中，我们首先创建未穿衣的身体，然后根据身体生成一系列单独的布料，其中我们引入了一种基于语义的算法来实现更好的人与布料和服装与服装之间的分离。为了提高纹理质量，我们提出了一个视图一致的纹理细化模块，包括用于纹理样式一致性的跨视图注意机制和用于更新外观的入射角加权去噪 (IAW-DE) 策略。大量实验表明，DAGSM 可以生成高质量的解缠头像，支持服装更换和逼真的动画，并且在视觉质量方面优于基线。

Title: S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Authors: Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.15215
Pdf URL: https://arxiv.org/pdf/2411.15215
Copy Paste: [[2411.15215]] S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning(https://arxiv.org/abs/2411.15215)
Keywords: generation
Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.
摘要：抗体通过与特定抗原的精确和有效结合来保护我们的健康，在包括 COVID-19 在内的多种疾病的治疗中显示出良好的治疗效果。生物医学语言模型的最新进展显示出解释复杂生物结构和功能的巨大潜力。然而，现有的抗体特异性模型有一个明显的局限性，即它们缺乏对抗体结构信息的明确考虑，尽管一维序列和三维结构都对抗体的行为和功能具有独特和互补的见解。本文提出了序列结构多级预训练抗体语言模型 (S$^2$ALM)，将整体序列和结构信息结合在一个统一的通用抗体基础模型中。我们构建了一个分层的预训练范式，结合了两个定制的多级训练目标，以促进全面抗体表示的建模。S$^2$ALM 的表示空间揭示了固有的功能结合机制、生物进化特性和结构相互作用模式。 S$^2$ALM 经过了超过 7500 万个序列和 1170 万个结构的预训练，可用于各种下游任务：准确预测抗原抗体结合亲和力、精确区分 B 细胞成熟阶段、识别抗体关键结合位置以及专门设计新型冠状病毒结合抗体。值得注意的是，S$^2$ALM 的表现优于成熟且知名的基准，并在广泛的抗体特定理解和生成任务中树立了新的最先进性能。S$^2$ALM 能够对全面和通用的表示进行建模，这进一步提高了其推动现实世界治疗性抗体开发的潜力，有可能解决尚未满足的学术、工业和临床需求。

Title: Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Authors: Yoel Zimmermann, Adib Bazgir, Zartashia Afzal, Fariha Agbere, Qianxiang Ai, Nawaf Alampara, Alexander Al-Feghali, Mehrad Ansari, Dmytro Antypov, Amro Aswad, Jiaru Bai, Viktoriia Baibakova, Devi Dutta Biswajeet, Erik Bitzek, Joshua D. Bocarsly, Anna Borisova, Andres M Bran, L. Catherine Brinson, Marcel Moran Calderon, Alessandro Canalicchio, Victor Chen, Yuan Chiang, Defne Circi, Benjamin Charmes, Vikrant Chaudhary, Zizhang Chen, Min-Hsueh Chiu, Judith Clymo, Kedar Dabhadkar, Nathan Daelman, Archit Datar, Matthew L. Evans, Maryam Ghazizade Fard, Giuseppe Fisicaro, Abhijeet Sadashiv Gangan, Janine George, Jose D. Cojal Gonzalez, Michael Götte, Ankur K. Gupta, Hassan Harb, Pengyu Hong, Abdelrahman Ibrahim, Ahmed Ilyas, Alishba Imran, Kevin Ishimwe, Ramsey Issa, Kevin Maik Jablonka, Colin Jones, Tyler R. Josephson, Greg Juhasz, Sarthak Kapoor, Rongda Kang, Ghazal Khalighinejad, Sartaaj Khan, Sascha Klawohn, Suneel Kuman, Alvin Noe Ladines, Sarom Leang, Magdalena Lederbauer, Sheng-Lun Mark Liao, Hao Liu, Xuefeng Liu, Stanley Lo, Sandeep Madireddy, Piyush Ranjan Maharana, Shagun Maheshwari, Soroush Mahjoubi, José A. Márquez, Rob Mills, Trupti Mohanty, Bernadette Mohr, Seyed Mohamad Moosavi, Alexander Moßhammer, Amirhossein D. Naghdi, Aakash Naik, Oleksandr Narykov, Hampus Näsström, Xuan Vu Nguyen, Xinyi Ni, Dana O'Connor, Teslim Olayiwola, Federico Ottomano, Aleyna Beste Ozhan, Sebastian Pagel, Chiku Parida, Jaehee Park, Vraj Patel, Elena Patyukova, Martin Hoffmann Petersen, Luis Pinto, José M. Pizarro, Dieter Plessers, Tapashree Pradhan, Utkarsh Pratiush, Charishma Puli, Andrew Qin, Mahyar Rajabi, Francesco Ricci, Elliot Risch, Martiño Ríos-García
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.15221
Pdf URL: https://arxiv.org/pdf/2411.15221
Copy Paste: [[2411.15221]] Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry(https://arxiv.org/abs/2411.15221)
Keywords: generation
Abstract: Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.
摘要：这里，我们展示了第二届大型语言模型 (LLM) 材料科学和化学应用黑客马拉松的成果，该活动吸引了来自全球不同地点的参与者，共收到 34 份团队提交的论文。提交的论文涵盖了七个关键应用领域，并展示了 LLM 在以下应用领域的多种用途：(1) 分子和材料特性预测；(2) 分子和材料设计；(3) 自动化和新界面；(4) 科学交流和教育；(5) 研究数据管理和自动化；(6) 假设生成和评估；(7) 从科学文献中提取和推理知识。每个团队提交的论文都以汇总表的形式呈现，其中包含代码链接和附录中的简短论文。除了团队成果之外，我们还讨论了黑客马拉松活动及其混合形式，其中包括多伦多、蒙特利尔、旧金山、柏林、洛桑和东京的物理中心，以及一个全球在线中心，以实现本地和虚拟协作。总体而言，此次活动突出了自上一届黑客马拉松以来 LLM 能力的显著提高，表明 LLM 在材料科学和化学研究领域的应用将继续扩展。这些成果证明了 LLM 的双重效用，既可以作为适用于各种机器学习任务的多用途模型，又可以作为科学研究中快速原型定制应用程序的平台。

Title: Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Authors: Jeeyung Kim, Erfan Esmaeili, Qiang Qiu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15236
Pdf URL: https://arxiv.org/pdf/2411.15236
Copy Paste: [[2411.15236]] Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps(https://arxiv.org/abs/2411.15236)
Keywords: generation
Abstract: In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.
摘要：在文本到图像的扩散模型中，每个文本标记的交叉注意图指示所关注的特定图像区域。比较这些句法相关标记的图可以深入了解生成的图像如何很好地反映文本提示。例如，在提示“一辆黑色汽车和一个白色时钟”中，“黑色”和“汽车”的交叉注意图应关注重叠区域以描绘黑色汽车，而“汽车”和“时钟”则不应如此。图中的不正确重叠通常会导致生成缺陷，例如缺少对象和属性绑定不正确。我们的研究在现有的文本到图像模型中调查此问题时做出了关键观察：（1）用作条件输入的不同标记之间的文本嵌入相似性可能导致它们的交叉注意图聚焦于相同的图像区域；（2）文本嵌入通常无法忠实地捕捉文本注意图中已有的句法关系。因此，这种句法关系可能会在交叉注意模块中被忽略，从而导致图像生成不准确。为了解决这个问题，我们提出了一种方法，通过测试时优化将句法关系从文本注意力图直接转移到交叉注意力模块。我们的方法利用文本注意力图中固有但尚未开发的信息来增强不同提示之间的图像-文本语义对齐，而无需依赖外部指导。

Title: AnyText2: Visual Text Generation and Editing With Customizable Attributes

Authors: Yuxiang Tuo, Yifeng Geng, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15245
Pdf URL: https://arxiv.org/pdf/2411.15245
Copy Paste: [[2411.15245]] AnyText2: Visual Text Generation and Editing With Customizable Attributes(https://arxiv.org/abs/2411.15245)
Keywords: generation
Abstract: As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in this https URL.
摘要：随着文本转图像 (T2I) 领域的发展，生成与视觉内容无缝集成的文本已引起广泛关注。然而，即使生成准确的文本，无法控制字体和颜色也会极大地限制某些应用，而且这个问题仍未得到充分解决。本文介绍了一种新方法 AnyText2，它可以在自然场景图像生成和编辑中精确控制多语言文本属性。我们的方法由两个主要部分组成。首先，我们提出了一种 WriteNet+AttnX 架构，将文本渲染功能注入预先训练的 T2I 模型中。与其前身 AnyText 相比，我们的新方法不仅增强了图像真实感，而且推理速度提高了 19.8%。其次，我们探索了从场景图像中提取字体和颜色的技术，并开发了一个文本嵌入模块，将这些文本属性分别编码为条件。作为 AnyText 的扩展，该方法允许自定义每行文本的属性，从而分别将中文和英文的文本准确率提高了 3.3% 和 9.3%。通过全面的实验，我们展示了我们方法的最先进的性能。代码和模型将在此 https URL 中开源。

Title: Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Authors: Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15247
Pdf URL: https://arxiv.org/pdf/2411.15247
Copy Paste: [[2411.15247]] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward(https://arxiv.org/abs/2411.15247)
Keywords: generation
Abstract: Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at this https URL.
摘要：最近的研究表明，使用强化学习 (RL) 技术可以对具有任意奖励（包括不可微奖励）的扩散模型 (DM) 进行微调，从而实现灵活的模型调整。但是，将现有的 RL 方法应用于时间步长提炼的 DM 对于超快速 ($\le2$ 步) 图像生成具有挑战性。我们的分析表明，基于策略的 RL 方法（例如 PPO 或 DPO）在实现这一目标方面存在一些局限性。基于这些见解，我们提出使用学习到的可微分替代奖励对 DM 进行微调。我们的方法名为 LaSRO，它在 SDXL 的潜在空间中学习替代奖励模型，将任意奖励转换为可微分奖励，以实现有效的奖励梯度引导。LaSRO 利用预先训练的潜在 DM 进行奖励建模，并专门针对图像生成 $\le2$ 步骤进行奖励优化，从而提高通用性和效率。LaSRO 可以有效且稳定地改善具有不同奖励目标的超快速图像生成，优于包括 PPO 和 DPO 在内的流行 RL 方法。我们进一步展示了 LaSRO 与基于价值的 RL 的联系，提供了理论见解。请访问此 https URL 查看我们的网页。

Title: LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

Authors: Fan Deng, Yaguang Wu, Xinyang Yu, Xiangjun Huang, Jian Yang, Guangyu Yan, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15252
Pdf URL: https://arxiv.org/pdf/2411.15252
Copy Paste: [[2411.15252]] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation(https://arxiv.org/abs/2411.15252)
Keywords: generation
Abstract: Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.
摘要：最近，基于扩散的文本到图像模型在生成高质量图像方面取得了显著成功。然而，在这些图像中个性化、可控地生成实例的挑战仍然是一个需要进一步发展的领域。在本文中，我们提出了 LocRef-Diffusion，这是一种新颖的、无需调整的模型，能够个性化定制图像中多个实例的外观和位置。为了提高实例放置的精度，我们引入了一个布局网络，它通过利用显式实例布局信息和实例区域交叉注意模块来控制实例生成位置。为了提高参考图像的外观保真度，我们采用了一个外观网络，它提取实例外观特征并通过交叉注意机制将它们集成到扩散模型中。我们在 COCO 和 OpenImages 数据集上进行了广泛的实验，结果表明，我们提出的方法在布局和外观引导生成方面实现了最先进的性能。

Title: MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Authors: Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15262
Pdf URL: https://arxiv.org/pdf/2411.15262
Copy Paste: [[2411.15262]] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation(https://arxiv.org/abs/2411.15262)
Keywords: generation
Abstract: Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.
摘要：视频生成模型的最新进展（如稳定视频扩散）显示出有希望的结果，但主要集中在短的单场景视频上。这些模型难以生成涉及多个场景、连贯的叙述和一致的角色的长视频。此外，没有专门用于分析、评估和训练长视频生成模型的公开数据集。在本文中，我们提出了 MovieBench：用于长视频生成的分层电影级数据集，它通过提供独特的贡献来解决这些挑战：（1）具有丰富、连贯的故事情节和多场景叙述的电影长度视频，（2）角色外观和音频在场景之间的一致性，以及（3）分层数据结构包含高级电影信息和详细的镜头级描述。实验表明，MovieBench 带来了一些新的见解和挑战，例如在多个场景中为各种角色保持角色 ID 的一致性。该数据集将公开并持续维护，旨在推动长视频生成领域的发展。数据可以在以下位置找到：此 https URL。

Title: Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Authors: Won Jun Kim, Hyungjin Chung, Jaemin Kim, Sangmin Lee, Byeongsu Sim, Jong Chul Ye
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15265
Pdf URL: https://arxiv.org/pdf/2411.15265
Copy Paste: [[2411.15265]] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI(https://arxiv.org/abs/2411.15265)
Keywords: generation
Abstract: Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
摘要：基于梯度的方法是可解释性技术的典型家族，尤其是对于基于图像的模型。尽管如此，它们仍有几个缺点，即它们 (1) 需要白盒访问模型，(2) 容易受到对抗性攻击，以及 (3) 产生偏离图像流形的归因，导致解释实际上并不忠实于模型，并且与人类感知不太一致。为了克服这些挑战，我们引入了无导数扩散流形约束梯度 (FreeMCG)，这是一种新方法，与传统梯度相比，它可作为给定神经网络可解释性的改进基础。具体而言，通过利用集合卡尔曼滤波器和扩散模型，我们推导出投影到数据流形上的模型梯度的无导数近似，只需要访问模型的输出。我们通过将 FreeMCG 应用于反事实生成和特征归因来证明其有效性，而这两者传统上被视为不同的任务。通过对两个任务、反事实解释和特征归因的全面评估，我们表明我们的方法可以产生最先进的结果，同时保留 XAI 工具所期望的基本属性。

Title: EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration

Authors: Linrui Gong, Jiuming Liu, Junyi Ma, Lihao Liu, Yaonan Wang, Hesheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15271
Pdf URL: https://arxiv.org/pdf/2411.15271
Copy Paste: [[2411.15271]] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration(https://arxiv.org/abs/2411.15271)
Keywords: generation
Abstract: Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.
摘要：扩散模型在点云配准 (PCR) 任务中显示出巨大的潜力，尤其是在增强对具有挑战性的情况的鲁棒性方面。然而，现有的基于扩散的 PCR 方法主要关注实例级场景，并且难以处理室外 LiDAR 点，其中 LiDAR 点固有的稀疏性、不规则性和巨大的点尺度对建立密集的全局点对点对应关系构成了挑战。为了解决这个问题，我们提出了一个名为 EADReg 的新框架，用于基于自回归扩散模型对 LiDAR 点云进行高效、鲁棒的配准。EADReg 遵循从粗到细的配准范式。在粗配准阶段，我们采用双向高斯混合模型 (BGMM) 来拒绝异常点并获得纯化的点云对。BGMM 在源帧和目标帧的高斯混合模型 (GMM) 之间建立对应关系，从而实现基于过滤特征和几何信息的可靠粗配准。在精细阶段，我们将基于扩散的 PCR 视为自回归过程，以生成稳健的点对应关系，然后在上层对其进行迭代细化。尽管人们普遍批评基于扩散的方法在推理速度方面不够快，但 EADReg 的运行时间与基于卷积的方法相当。在 KITTI 和 NuScenes 基准数据集上进行的大量实验凸显了我们提出的方法的先进性能。代码将在发布后发布。

Title: Foundation Cures Personalization: Recovering Facial Personalized Models' Prompt Consistency

Authors: Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Wenhan Luo, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15277
Pdf URL: https://arxiv.org/pdf/2411.15277
Copy Paste: [[2411.15277]] Foundation Cures Personalization: Recovering Facial Personalized Models' Prompt Consistency(https://arxiv.org/abs/2411.15277)
Keywords: generation
Abstract: Facial personalization represents a crucial downstream task in the domain of text-to-image generation. To preserve identity fidelity while ensuring alignment with user-defined prompts, current mainstream frameworks for facial personalization predominantly employ identity embedding mechanisms to associate identity information with textual embeddings. However, our experiments show that identity embeddings compromise the effectiveness of other tokens within the prompt, thereby hindering high prompt consistency, particularly when prompts involve multiple facial attributes. Moreover, previous works overlook the fact that their corresponding foundation models hold great potential to generate faces aligning to prompts well and can be easily leveraged to cure these ill-aligned attributes in personalized models. Building upon these insights, we propose FreeCure, a training-free framework that harnesses the intrinsic knowledge from the foundation models themselves to improve the prompt consistency of personalization models. First, by extracting cross-attention and semantic maps from the denoising process of foundation models, we identify easily localized attributes (e.g., hair, accessories, etc). Second, we enhance multiple attributes in the outputs of personalization models through a novel noise-blending strategy coupled with an inversion-based process. Our approach offers several advantages: it eliminates the need for training; it effectively facilitates the enhancement for a wide array of facial attributes in a non-intrusive manner; and it can be seamlessly integrated into existing popular personalization models. FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models while maintaining the integrity of original identity fidelity.
摘要：面部个性化是文本到图像生成领域中一项至关重要的下游任务。为了在确保与用户定义的提示保持一致的同时保持身份保真度，当前主流的面部个性化框架主要采用身份嵌入机制将身份信息与文本嵌入关联起来。然而，我们的实验表明，身份嵌入会损害提示中其他标记的有效性，从而阻碍提示的高一致性，尤其是当提示涉及多个面部属性时。此外，以前的研究忽略了一个事实，即它们相应的基础模型具有生成与提示很好地一致的面部的巨大潜力，并且可以轻松利用来修复个性化模型中这些不一致的属性。基于这些见解，我们提出了 FreeCure，这是一个无需训练的框架，它利用基础模型本身的内在知识来提高个性化模型的提示一致性。首先，通过从基础模型的去噪过程中提取交叉注意力和语义图，我们可以识别容易定位的属性（例如头发、配饰等）。其次，我们通过一种新颖的噪声混合策略和基于反转的过程增强了个性化模型输出中的多个属性。我们的方法有几个优点：它消除了训练的需要；它有效地以非侵入方式促进了各种面部属性的增强；并且可以无缝集成到现有的流行个性化模型中。FreeCure 已证明在保持原始身份保真度的同时，在多种最先进的面部个性化模型中实现了即时一致性的显著改善。

Title: Don't Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM

Authors: Maximilian Mews, Ansar Aynetdinov, Vivian Schiller, Peter Eisert, Alan Akbik
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2411.15279
Pdf URL: https://arxiv.org/pdf/2411.15279
Copy Paste: [[2411.15279]] Don't Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM(https://arxiv.org/abs/2411.15279)
Keywords: generation
Abstract: While recent advancements in machine learning, such as LLMs, are revolutionizing software development and creative industries, they have had minimal impact on engineers designing mechanical parts, which remains largely a manual process. Existing approaches to generate 3D geometry most commonly use meshes as a 3D representation. While meshes are suitable for assets in video games or animations, they lack sufficient precision and adaptability for mechanical engineering purposes. This paper introduces a novel approach for the generation of 3D geometry that generates surface-based Constructive Solid Geometry (CSG) by leveraging a code-generation LLM. First, we create a dataset of 3D mechanical parts represented as code scripts by converting Boundary Representation geometry (BREP) into CSG-based Python scripts. Second, we create annotations in natural language using GPT-4. The resulting dataset is used to fine-tune a code-generation LLM. The fine-tuned LLM can complete geometries based on positional input and natural language in a plausible way, demonstrating geometric understanding.
摘要：虽然机器学习领域的最新进展（例如 LLM）正在彻底改变软件开发和创意产业，但它们对工程师设计机械零件的影响却微乎其微，因为机械零件设计在很大程度上仍是一个手动过程。现有的生成 3D 几何图形的方法最常使用网格作为 3D 表示。虽然网格适用于视频游戏或动画中的资产，但它们缺乏足够的精度和适应性，无法满足机械工程用途。本文介绍了一种生成 3D 几何图形的新方法，该方法利用代码生成 LLM 生成基于表面的构造性实体几何图形 (CSG)。首先，我们通过将边界表示几何图形 (BREP) 转换为基于 CSG 的 Python 脚本，创建以代码脚本表示的 3D 机械零件数据集。其次，我们使用 GPT-4 以自然语言创建注释。生成的数据集用于微调代码生成 LLM。经过微调的 LLM 可以根据位置输入和自然语言以合理的方式完成几何图形，从而展示几何理解。

Title: There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15288
Pdf URL: https://arxiv.org/pdf/2411.15288
Copy Paste: [[2411.15288]] There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks(https://arxiv.org/abs/2411.15288)
Keywords: generation
Abstract: The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.
摘要：任何分割模型 (SAM) 最初是为标签无关的掩码生成而设计的。这个模型是否也具有固有的语义理解，对更广泛的视觉任务有价值？在这项工作中，我们采用多阶段方法来探索这个问题。我们首先通过比较分类任务下的基本图像编码器效率与已建立的模型 (CLIP 和 DINOv2) 来量化 SAM 的语义能力。我们的研究结果表明，SAM 特征表示中存在严重的语义辨别能力不足，限制了需要类别区分的任务的潜力。这一初步结果激发了我们的探索性研究，该研究尝试通过轻量级微调的上下文学习来启用语义信息，我们观察到对看不见的类别的通用性仍然有限。我们的观察最终提出了一种利用 DINOv2 特征的无训练方法，旨在更好地赋予 SAM 语义理解并通过基于特征的相似性实现实例级类别区分。我们的研究表明，结合外部语义源为增强 SAM 在需要语义理解的复杂视觉任务中的实用性提供了一个有希望的方向。

Title: Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Authors: Soumil Datta, Shih-Chieh Dai, Leo Yu, Guanhong Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15367
Pdf URL: https://arxiv.org/pdf/2411.15367
Copy Paste: [[2411.15367]] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage(https://arxiv.org/abs/2411.15367)
Keywords: generation, generative
Abstract: Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech{}, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.
摘要：文本到图像的扩散模型（例如稳定扩散）在生成高质量图像方面表现出非凡的潜力。然而，最近的研究强调了人们对在训练这些模型时使用未经授权的数据的担忧，这可能会导致侵犯知识产权或侵犯隐私。缓解这些问题的一种有希望的方法是将水印应用于图像，然后检查生成模型是否重现了类似的水印特征。在本文中，我们研究了应用于文本到图像模型的各种基于水印的保护方法的鲁棒性。我们观察到常见的图像转换无法有效去除水印效果。因此，我们提出了 \tech{}，它利用扩散过程对受保护的输入进行受控图像生成，保留输入的高级特征，同时忽略水印使用的低级细节。然后使用少量生成的图像来微调受保护的模型。我们对三个数据集和 140 个文本到图像扩散模型进行的实验表明，现有的最先进保护措施对 RATTAN 并不鲁棒。

Title: Gradient-Free Classifier Guidance for Diffusion Model Sampling

Authors: Rahul Shenoy, Zhihong Pan, Kaushik Balakrishnan, Qisen Cheng, Yongmoon Jeon, Heejune Yang, Jaewon Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15393
Pdf URL: https://arxiv.org/pdf/2411.15393
Copy Paste: [[2411.15393]] Gradient-Free Classifier Guidance for Diffusion Model Sampling(https://arxiv.org/abs/2411.15393)
Keywords: generation
Abstract: Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512$\times$512, we achieve a record $\text{FD}_{\text{DINOv2}}$ of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)
摘要：使用扩散模型的图像生成已展示出出色的学习能力，能够有效捕捉训练数据集的完整分布。众所周知，它们会在采样图像中产生很大的变化，尽管会牺牲图像保真度。引导采样方法，例如分类器引导 (CG) 和无分类器引导 (CFG)，专注于在经过良好学习的高概率区域进行采样以生成高保真度的图像，但每种方法都有其局限性。由于使用反向传播进行分类器梯度下降，CG 在计算上非常昂贵，而无梯度的 CFG 与 CG 相比效率更高但会损害类标签对齐。在这项工作中，我们提出了一种高效的引导方法，该方法充分利用了预训练的分类器而不使用梯度下降。通过仅在推理模式下使用分类器，在每个时间步骤中确定时间自适应参考类标签和相应的引导尺度以进行引导采样。在类条件和文本到图像生成扩散模型上进行的实验表明，所提出的无梯度分类器引导 (GFCG) 方法可以持续提高类预测准确率。我们还表明 GFCG 可以补充其他引导采样方法，例如 CFG。当与最先进的自动引导 (ATG) 结合使用时，无需额外的计算开销，它可以在保持多样性的同时提高图像保真度。对于 ImageNet 512$\times$512，我们实现了创纪录的 $\text{FD}_{\text{DINOv2}}$ 23.09，同时与 ATG (90.2%) 相比，获得了更高的分类精度 (94.3%)

Title: FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation

Authors: Trong Thang Pham, Ngoc-Vuong Ho, Nhat-Tan Bui, Thinh Phan, Patel Brijesh, Donald Adjeroh, Gianfranco Doretto, Anh Nguyen, Carol C. Wu, Hien Nguyen, Ngan Le
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15413
Pdf URL: https://arxiv.org/pdf/2411.15413
Copy Paste: [[2411.15413]] FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation(https://arxiv.org/abs/2411.15413)
Keywords: generation
Abstract: Developing an interpretable system for generating reports in chest X-ray (CXR) analysis is becoming increasingly crucial in Computer-aided Diagnosis (CAD) systems, enabling radiologists to comprehend the decisions made by these systems. Despite the growth of diverse datasets and methods focusing on report generation, there remains a notable gap in how closely these models' generated reports align with the interpretations of real radiologists. In this study, we tackle this challenge by initially introducing Fine-Grained CXR (FG-CXR) dataset, which provides fine-grained paired information between the captions generated by radiologists and the corresponding gaze attention heatmaps for each anatomy. Unlike existing datasets that include a raw sequence of gaze alongside a report, with significant misalignment between gaze location and report content, our FG-CXR dataset offers a more grained alignment between gaze attention and diagnosis transcript. Furthermore, our analysis reveals that simply applying black-box image captioning methods to generate reports cannot adequately explain which information in CXR is utilized and how long needs to attend to accurately generate reports. Consequently, we propose a novel explainable radiologist's attention generator network (Gen-XAI) that mimics the diagnosis process of radiologists, explicitly constraining its output to closely align with both radiologist's gaze attention and transcript. Finally, we perform extensive experiments to illustrate the effectiveness of our method. Our datasets and checkpoint is available at this https URL.
摘要：在计算机辅助诊断 (CAD) 系统中，开发一个可解释的胸部 X 光 (CXR) 分析报告生成系统变得越来越重要，它使放射科医生能够理解这些系统做出的决策。尽管专注于报告生成的各种数据集和方法不断增长，但这些模型生成的报告与真实放射科医生的解释在多大程度上保持一致仍然存在明显差距。在本研究中，我们通过首先引入细粒度 CXR (FG-CXR) 数据集来应对这一挑战，该数据集提供了放射科医生生成的字幕与每个解剖结构对应的注视注意热图之间的细粒度配对信息。与现有数据集在报告旁边包含原始注视序列且注视位置和报告内容之间存在显著错位的数据集不同，我们的 FG-CXR 数据集提供了注视注意和诊断记录之间更细粒度的对齐。此外，我们的分析表明，仅仅应用黑盒图像字幕方法来生成报告无法充分解释 CXR 中的哪些信息被使用以及需要多长时间才能准确生成报告。因此，我们提出了一种新颖的可解释放射科医生注意力生成器网络 (Gen-XAI)，该网络模仿放射科医生的诊断过程，明确限制其输出以与放射科医生的注视注意力和转录紧密结合。最后，我们进行了大量实验来说明我们方法的有效性。我们的数据集和检查点可在此 https URL 上找到。

Title: Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention

Authors: Wei Zhoua, Xinzhe Shia, Yunfeng Shea, Kunlong Liua, Yongqin Zhanga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15420
Pdf URL: https://arxiv.org/pdf/2411.15420
Copy Paste: [[2411.15420]] Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention(https://arxiv.org/abs/2411.15420)
Keywords: generation
Abstract: In the domain of single-view 3D reconstruction, traditional techniques have frequently relied on expensive and time-intensive 3D annotation data. Facing the challenge of annotation acquisition, semi-supervised learning strategies offer an innovative approach to reduce the dependence on labeled data. Despite these developments, the utilization of this learning paradigm in 3D reconstruction tasks remains relatively constrained. In this research, we created an innovative semi-supervised framework for 3D reconstruction that distinctively uniquely introduces a multi shape prior fusion strategy, intending to guide the creation of more realistic object structures. Additionally, to improve the quality of shape generation, we integrated a self-attention module into the traditional decoder. In benchmark tests on the ShapeNet dataset, our method substantially outperformed existing supervised learning methods at diverse labeled ratios of 1\%, 10\%, and 20\%. Moreover, it showcased excellent performance on the real-world Pix3D dataset. Through comprehensive experiments on ShapeNet, our framework demonstrated a 3.3\% performance improvement over the baseline. Moreover, stringent ablation studies further confirmed the notable effectiveness of our approach. Our code has been released on this https URL
摘要：在单视图 3D 重建领域，传统技术通常依赖于昂贵且耗时的 3D 注释数据。面对注释获取的挑战，半监督学习策略提供了一种创新方法来减少对标记数据的依赖。尽管取得了这些进展，但这种学习范式在 3D 重建任务中的利用仍然相对受限。在这项研究中，我们创建了一个创新的半监督 3D 重建框架，该框架独特地引入了多形状先验融合策略，旨在指导创建更逼真的对象结构。此外，为了提高形状生成的质量，我们将自注意力模块集成到传统解码器中。在 ShapeNet 数据集的基准测试中，我们的方法在 1%、10% 和 20% 的不同标记率下大大优于现有的监督学习方法。此外，它在现实世界的 Pix3D 数据集上表现出色。通过对 ShapeNet 的全面实验，我们的框架比基线性能提高了 3.3%。此外，严格的消融研究进一步证实了我们方法的显著有效性。我们的代码已在此 https URL 上发布

Title: Learning a local trading strategy: deep reinforcement learning for grid-scale renewable energy integration

Authors: Caleb Ju, Constance Crozier
Subjects: cs.LG, cs.AI, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2411.15422
Pdf URL: https://arxiv.org/pdf/2411.15422
Copy Paste: [[2411.15422]] Learning a local trading strategy: deep reinforcement learning for grid-scale renewable energy integration(https://arxiv.org/abs/2411.15422)
Keywords: generation
Abstract: Variable renewable generation increases the challenge of balancing power supply and demand. Grid-scale batteries co-located with generation can help mitigate this misalignment. This paper explores the use of reinforcement learning (RL) for operating grid-scale batteries co-located with solar power. Our results show RL achieves an average of 61% (and up to 96%) of the approximate theoretical optimal (non-causal) operation, outperforming advanced control methods on average. Our findings suggest RL may be preferred when future signals are hard to predict. Moreover, RL has two significant advantages compared to simpler rules-based control: (1) that solar energy is more effectively shifted towards high demand periods, and (2) increased diversity of battery dispatch across different locations, reducing potential ramping issues caused by super-position of many similar actions.
摘要：可变的可再生能源发电增加了平衡电力供需的挑战。与发电厂共置的电网规模电池可以帮助缓解这种错位。本文探讨了使用强化学习 (RL) 来操作与太阳能共置的电网规模电池。我们的结果表明，RL 平均实现了近似理论最优（非因果）操作的 61%（最高可达 96%），平均优于高级控制方法。我们的研究结果表明，当未来信号难以预测时，RL 可能是首选。此外，与更简单的基于规则的控制相比，RL 具有两个显着的优势：（1）太阳能更有效地转向高需求期，（2）增加了不同位置电池调度的多样性，减少了由许多类似动作叠加而导致的潜在斜坡问题。

Title: What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15435
Pdf URL: https://arxiv.org/pdf/2411.15435
Copy Paste: [[2411.15435]] What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation(https://arxiv.org/abs/2411.15435)
Keywords: generation
Abstract: While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.
摘要：虽然文本到图像的生成已经得到了广泛的研究，但从场景图生成图像仍然相对未被充分探索，这主要是因为在准确建模空间关系和对象交互方面存在挑战。为了填补这一空白，我们引入了 Scene-Bench，这是一个全面的基准，旨在评估和增强生成自然场景的事实一致性。Scene-Bench 包含 MegaSG，这是一个包含一百万张带有场景图注释的图像的大型数据集，有助于在多样化和复杂的场景中训练和公平比较模型。此外，我们提出了 SGScore，这是一种新颖的评估指标，它利用多模态大型语言模型 (LLM) 的思路推理能力来评估对象存在和关系准确性，提供比 FID 和 CLIPScore 等传统指标更有效的事实一致性衡量标准。在此评估框架的基础上，我们开发了一个场景图反馈管道，通过识别和纠正场景图和图像之间的差异来迭代细化生成的图像。大量实验表明，与现有基准相比，Scene-Bench 提供了更全面、更有效的评估框架，尤其是对于复杂场景生成。此外，我们的反馈策略显著增强了图像生成模型的事实一致性，推动了可控图像生成领域的发展。

Title: ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15436
Pdf URL: https://arxiv.org/pdf/2411.15436
Copy Paste: [[2411.15436]] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance(https://arxiv.org/abs/2411.15436)
Keywords: generation
Abstract: Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL
摘要：扩散模型在说话头像生成方面表现出了令人印象深刻的潜力。虽然这些方法可以实现合理的外观和说话效果，但由于误差累积和单幅图像生成能力的固有限制，这些方法仍然存在时间、3D 或表情不一致的问题。在本文中，我们提出了 ConsistentAvatar，这是一个用于完全一致和高保真说话头像生成的新框架。我们的方法不是直接将多模态条件应用于扩散过程，而是首先学习对相邻帧之间的稳定性的时间表示进行建模。具体来说，我们提出了一个时间敏感细节 (TSD) 图，其中包含沿时间轴显着变化的高频特征和轮廓。使用时间一致的扩散模块，我们学习将初始结果的 TSD 与视频帧基本事实的 TSD 对齐。最终的头像由完全一致的扩散模块生成，以对齐的 TSD、粗略的头部法线和情绪提示嵌入为条件。我们发现，代表时间模式的对齐 TSD 约束了扩散过程以生成时间稳定的说话头像。此外，其可靠的指导弥补了其他条件的不准确性，抑制了累积误差，同时提高了各方面的一致性。大量实验表明，ConsistentAvatar 在生成的外观、3D、表情和时间一致性方面优于最先进的方法。项目页面：此 https URL

Title: Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Authors: Zhiying Li, Zhi Liu, Guanggang Geng, Shreyank N Gowda, Shuyuan Lin, Jian Weng, Xiaobo Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15439
Pdf URL: https://arxiv.org/pdf/2411.15439
Copy Paste: [[2411.15439]] Twin Trigger Generative Networks for Backdoor Attacks against Object Detection(https://arxiv.org/abs/2411.15439)
Keywords: generative
Abstract: Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.
摘要：物体检测器在实际应用中被广泛使用，但容易受到后门攻击。由于数据和资源的限制，许多用户依赖第三方提供的数据集或预训练模型，因此会出现这种漏洞。然而，大多数关于后门攻击的研究都集中在图像分类上，对物体检测的研究有限。此外，大多数现有的物体检测后门攻击的触发器都是手动生成的，需要先验知识和训练与推理阶段之间的一致模式。这种方法使得攻击要么易于检测，要么难以适应各种场景。为了解决这些限制，我们提出了频域中的新型双触发器生成网络，以生成不可见的触发器，在训练期间将隐秘的后门植入模型中，并生成可见的触发器，以便在推理期间稳定激活，从而使攻击过程难以追踪。具体而言，对于不可见的触发器生成网络，我们部署了高斯平滑层和高频伪影分类器，以增强物体检测器中后门植入的隐秘性。对于可见触发器生成网络，我们设计了一种新颖的对齐损失来优化可见触发器，使其与原始模式不同，但仍与不可见触发器的恶意激活行为保持一致。大量实验结果和分析证明了在训练阶段和推理阶段使用不同触发器的可能性，并证明了我们提出的可见触发器和不可见触发器生成网络的攻击有效性，将物体检测器的 mAP_0.5 显著降低了 70.0% 和 84.5%，包括具有不同设置的 YOLOv5 和 YOLOv7。

Title: Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Authors: Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15453
Pdf URL: https://arxiv.org/pdf/2411.15453
Copy Paste: [[2411.15453]] Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy(https://arxiv.org/abs/2411.15453)
Keywords: generation
Abstract: Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM's multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.
摘要：大型语言模型 (LLM) 具有强大的指令跟随能力，可以解释和执行人类命令指示的任务。与 LLM 相比，多模态大型语言模型 (MLLM) 的指令跟随能力较差。然而，MLLM 和 LLM 之间的指令跟随能力存在显著差距。在本研究中，我们进行了一项试点实验，结果表明空间下采样视觉标记显著增强了 MLLM 的指令跟随能力。这归因于视觉模态的大量冗余。然而，这种直观的方法严重损害了 MLLM 的多模态理解能力。在本文中，我们提出了视觉模态标记压缩 (VMTC) 和跨模态注意抑制 (CMAI) 策略来缓解 MLLM 和 LLM 之间的这种差距，通过抑制内容生成过程中不相关的视觉标记的影响，提高 MLLM 的指令跟踪能力，同时保留其多模态理解能力。在 VMTC 模块中，保留主要标记，并通过标记聚类和合并压缩冗余标记。在 CMAI 过程中，我们通过文本到文本的注意力聚合文本到图像的注意力，以获得文本到图像的焦点分数。对得分较低的文本-图像标记对执行注意力抑制。我们对指令跟踪能力和 VQA-V2、GQA、TextVQA、MME 和 MMBench 五个基准进行的全面实验表明，提出的策略显着增强了 MLLM 的指令跟踪能力，同时保留了理解和处理多模态输入的能力。

Title: Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Authors: Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15466
Pdf URL: https://arxiv.org/pdf/2411.15466
Copy Paste: [[2411.15466]] Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator(https://arxiv.org/abs/2411.15466)
Keywords: generation
Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: this https URL
摘要：主题驱动的文本到图像生成旨在通过准确捕捉主题的视觉特征和文本提示的语义内容，在所需的上下文中生成新主题的图像。传统方法依赖于耗费时间和资源的微调来实现主题对齐，而最近的零样本方法则利用即时图像提示，这往往会牺牲主题对齐。在本文中，我们介绍了双联画提示，这是一种新颖的零样本方法，它通过利用大规模文本到图像模型中双联画生成的突发特性，将其重新解释为具有精确主题对齐的修复任务。双联画提示将不完整的双联画与左侧面板中的参考图排列在一起，并在右侧面板上执行文本条件的修复。我们通过去除参考图中的背景进一步防止不必要的内容泄漏，并通过在修复过程中增强面板之间的注意力权重来改善生成主题中的细粒度细节。实验结果证实，我们的方法明显优于零样本图像提示方法，从而生成用户视觉上喜欢的图像。此外，我们的方法不仅支持主题驱动生成，还支持风格化图像生成和主题驱动图像编辑，展现了跨各种图像生成应用的多功能性。项目页面：此 https URL

Title: SplatSDF: Boosting Neural Implicit SDF via Gaussian Splatting Fusion

Authors: Runfa Blark Li, Keito Suzuki, Bang Du, Ki Myung Brian Le, Nikolay Atanasov, Truong Nguyen
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2411.15468
Pdf URL: https://arxiv.org/pdf/2411.15468
Copy Paste: [[2411.15468]] SplatSDF: Boosting Neural Implicit SDF via Gaussian Splatting Fusion(https://arxiv.org/abs/2411.15468)
Keywords: generation
Abstract: A signed distance function (SDF) is a useful representation for continuous-space geometry and many related operations, including rendering, collision checking, and mesh generation. Hence, reconstructing SDF from image observations accurately and efficiently is a fundamental problem. Recently, neural implicit SDF (SDF-NeRF) techniques, trained using volumetric rendering, have gained a lot of attention. Compared to earlier truncated SDF (TSDF) fusion algorithms that rely on depth maps and voxelize continuous space, SDF-NeRF enables continuous-space SDF reconstruction with better geometric and photometric accuracy. However, the accuracy and convergence speed of scene-level SDF reconstruction require further improvements for many applications. With the advent of 3D Gaussian Splatting (3DGS) as an explicit representation with excellent rendering quality and speed, several works have focused on improving SDF-NeRF by introducing consistency losses on depth and surface normals between 3DGS and SDF-NeRF. However, loss-level connections alone lead to incremental improvements. We propose a novel neural implicit SDF called "SplatSDF" to fuse 3DGSandSDF-NeRF at an architecture level with significant boosts to geometric and photometric accuracy and convergence speed. Our SplatSDF relies on 3DGS as input only during training, and keeps the same complexity and efficiency as the original SDF-NeRF during inference. Our method outperforms state-of-the-art SDF-NeRF models on geometric and photometric evaluation by the time of submission.
摘要：有符号距离函数 (SDF) 是连续空间几何和许多相关操作（包括渲染、碰撞检查和网格生成）的有用表示。因此，准确高效地从图像观测中重建 SDF 是一个基本问题。最近，使用体积渲染训练的神经隐式 SDF (SDF-NeRF) 技术引起了广泛关注。与依赖深度图和体素化连续空间的早期截断 SDF (TSDF) 融合算法相比，SDF-NeRF 能够以更好的几何和光度精度实现连续空间 SDF 重建。然而，对于许多应用来说，场景级 SDF 重建的精度和收敛速度需要进一步改进。随着 3D 高斯分层 (3DGS) 作为具有出色渲染质量和速度的显式表示的出现，一些研究集中在通过在 3DGS 和 SDF-NeRF 之间引入深度和表面法线的一致性损失来改进 SDF-NeRF。然而，仅靠损失级连接就能带来渐进式改进。我们提出了一种名为“SplatSDF”的新型神经隐式 SDF，在架构层面融合了 3DGS 和 SDF-NeRF，显著提高了几何和光度精度以及收敛速度。我们的 SplatSDF 在训练期间仅依赖 3DGS 作为输入，并在推理期间保持与原始 SDF-NeRF 相同的复杂性和效率。截至提交时，我们的方法在几何和光度评估方面的表现优于最先进的 SDF-NeRF 模型。

Title: KinMo: Kinematic-aware Human Motion Understanding and Generation

Authors: Pengfei Zhang, Pinxin Liu, Hyeongwoo Kim, Pablo Garrido, Bindita Chaudhuri
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2411.15472
Pdf URL: https://arxiv.org/pdf/2411.15472
Copy Paste: [[2411.15472]] KinMo: Kinematic-aware Human Motion Understanding and Generation(https://arxiv.org/abs/2411.15472)
Keywords: generation
Abstract: Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: {\small\url{this https URL}}
摘要：基于文本控制人体运动是计算机视觉领域的一项重要挑战。传统方法通常依赖于整体动作描述来进行运动合成，而这些动作描述很难捕捉到局部身体部位的细微动作。这种限制限制了分离和操纵特定动作的能力。为了解决这个问题，我们提出了一种新颖的运动表示，从运动学角度将运动分解为不同的身体关节组运动和相互作用。我们设计了一个自动数据集收集管道，通过结合细粒度的局部关节组运动和相互作用描述来增强现有的文本运动基准。为了弥合文本和运动域之间的差距，我们引入了一种分层运动语义方法，该方法逐步将关节级交互信息融合到全局动作级语义中以进行模态对齐。通过这种层次结构，我们为各种生成和编辑下游应用引入了一种从粗到细的运动合成程序。我们的定量和定性实验表明，所提出的公式通过改善关节空间理解增强了文本运动检索，并实现了更精确的关节运动生成和控制。项目页面：{\small\url{this https URL}}

Title: Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

Authors: Junhyeok Lee, Yujin Oh, Dahyoun Lee, Hyon Keun Joh, Chul-Ho Sohn, Sung Hyun Baik, Cheol Kyu Jung, Jung Hyun Park, Kyu Sung Choi, Byung-Hoon Kim, Jong Chul Ye
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15490
Pdf URL: https://arxiv.org/pdf/2411.15490
Copy Paste: [[2411.15490]] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation(https://arxiv.org/abs/2411.15490)
Keywords: generation
Abstract: Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
摘要：急性缺血性中风 (AIS) 需要时间紧迫的管理，数小时的延迟干预会导致患者不可逆转的残疾。由于使用磁共振图像 (MRI) 的扩散加权成像 (DWI) 在 AIS 检测中起着至关重要的作用，因此从 DWI 自动预测 AIS 已成为具有临床重要性的研究课题。虽然文本放射学报告包含来自图像发现的最相关临床信息，但跨不同模态映射的难度限制了传统直接 DWI 到报告生成方法的真实性。在这里，我们提出了配对图像域检索和文本域增强 (PIRTA)，这是一种跨模态检索增强生成 (RAG) 框架，用于提供具有更高真实性的临床医生解释性 AIS 放射学报告。PIRTA 通过将跨模态映射问题转换为具有配对真实文本放射学报告的类似 DWI 图像的域内检索，减轻了学习跨模态映射的需求，这给图像到文本的生成带来了困难。通过利用检索到的放射学报告来增强查询图像的报告生成过程，我们通过对大量内部和公共数据集的实验表明，PIRTA 可以准确地从 3D DWI 图像中检索相关报告。与使用最先进的多模态语言模型直接进行图像到文本生成相比，这种方法可以以更高的准确度生成放射学报告。

Title: AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Authors: Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15497
Pdf URL: https://arxiv.org/pdf/2411.15497
Copy Paste: [[2411.15497]] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation(https://arxiv.org/abs/2411.15497)
Keywords: generation, generative
Abstract: Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively. The code is available at this https URL.
摘要：遥感图像目标检测 (RSIOD) 旨在识别和定位卫星或航空图像中的特定目标。然而，当前 RSIOD 数据集中标记数据稀缺，这严重限制了当前检测算法的性能。尽管现有技术（例如数据增强和半监督学习）可以在一定程度上缓解这种稀缺问题，但它们严重依赖高质量的标记数据，并且在稀有目标类别中表现较差。为了解决这个问题，本文提出了一种针对 RSIOD 量身定制的布局可控扩散生成模型（即 AeroGen）。据我们所知，AeroGen 是第一个同时支持水平和旋转边界框条件生成的模型，从而能够生成满足特定布局和目标类别要求的高质量合成图像。此外，我们提出了一个端到端数据增强框架，该框架集成了多样性条件生成器和过滤机制，以增强生成数据的多样性和质量。实验结果表明，我们的方法生成的合成数据具有高质量和多样性。此外，合成的 RSIOD 数据可以显著提高现有 RSIOD 模型的检测性能，即 DIOR、DIOR-R 和 HRSC 数据集上的 mAP 指标分别提高了 3.7%、4.3% 和 2.43%。代码可在此 https URL 上获取。

Title: Interactive Visual Assessment for Text-to-Image Generation Models

Authors: Xiaoyue Mi, Fan Tang, Juan Cao, Qiang Sheng, Ziyao Huang, Peng Li, Yang Liu, Tong-Yee Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15509
Pdf URL: https://arxiv.org/pdf/2411.15509
Copy Paste: [[2411.15509]] Interactive Visual Assessment for Text-to-Image Generation Models(https://arxiv.org/abs/2411.15509)
Keywords: generation, generative
Abstract: Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.
摘要：视觉生成模型在计算机图形应用方面取得了显著进展，但在实际部署中仍面临重大挑战。当前对视觉生成任务的评估方法通常遵循一个孤立的三阶段框架：测试输入收集、模型输出生成和用户评估。这些方式存在固定覆盖范围、不断演变的难度和数据泄露风险，限制了它们在全面评估日益复杂的生成模型方面的有效性。为了解决这些限制，我们提出了 DyEval，这是一个由 LLM 驱动的动态交互式视觉评估框架，可促进人类和文本到图像系统的生成模型之间的协作评估。DyEval 具有直观的可视化界面，使用户能够以交互方式探索和分析模型行为，同时自适应地生成分层、细粒度和多样化的文本输入，以根据他们的反馈不断探测模型的能力边界。此外，为了为用户提供可解释的分析以进一步改进测试模型，我们开发了一个上下文反射模块，该模块挖掘测试输入的故障触发器并反映模型潜在的故障模式，支持使用 LLM 的逻辑推理能力进行深入分析。定性和定量实验表明，DyEval 可以有效地帮助用户识别比传统方法最多高出 2.56 倍的生成故障，并发现复杂且罕见的故障模式，例如代词生成和特定文化背景生成问题。我们的框架为改进生成模型提供了宝贵的见解，并对提高各个领域的视觉生成系统的可靠性和能力具有广泛的意义。

Title: MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15537
Pdf URL: https://arxiv.org/pdf/2411.15537
Copy Paste: [[2411.15537]] MUNBa: Machine Unlearning via Nash Bargaining(https://arxiv.org/abs/2411.15537)
Keywords: generation
Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.
摘要：机器反学习 (MU) 旨在选择性地从模型中消除有害行为，同时保留模型的整体效用。作为一个多任务学习问题，MU 涉及平衡与忘记特定概念/数据和保持一般性能相关的目标。这些遗忘和保留目标的简单整合可能会导致梯度冲突，阻碍 MU 算法达到最佳解决方案。为了解决梯度冲突问题，我们将 MU 重新表述为双人合作游戏，其中两个玩家，即遗忘玩家和保留玩家，通过他们的梯度提议做出贡献，以最大化他们的整体收益。为此，受纳什讨价还价理论的启发，我们得出了一个闭式解决方案，以引导模型走向帕累托前沿，有效地避免梯度冲突。我们对 MU 的表述保证了一个平衡解决方案，其中任何偏离最终状态的行为都会导致两个玩家的总体目标减少，从而确保每个目标的最优性。我们评估了我们的算法在图像分类和图像生成的各种任务上的有效性。使用 ResNet、视觉语言模型 CLIP 和文本到图像扩散模型进行的大量实验表明，我们的方法优于最先进的 MU 算法，在多个基准上取得了优异的表现。例如，在样本遗忘这一具有挑战性的场景中，我们的算法接近黄金标准的再训练基线。我们的结果还突出了遗忘精度、泛化保持和对抗攻击的鲁棒性的改进。

Title: Large Language Model with Region-guided Referring and Grounding for CT Report Generation

Authors: Zhixuan Chen, Yequan Bie, Haibo Jin, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15539
Pdf URL: https://arxiv.org/pdf/2411.15539
Copy Paste: [[2411.15539]] Large Language Model with Region-guided Referring and Grounding for CT Report Generation(https://arxiv.org/abs/2411.15539)
Keywords: generation
Abstract: Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.
摘要：计算机断层扫描 (CT) 报告生成对于协助放射科医生解释 CT 体积至关重要，这可能既耗时又费力。现有方法主要仅考虑整个体积的全局特征，因此很难专注于特定区域并可能遗漏异常。为了解决这个问题，我们提出了 Reg2RG，这是第一个用于 CT 报告生成的区域引导参考和基础框架，它通过关注体积内的解剖区域来提高诊断性能。具体来说，我们利用通用分割模块中的掩码来捕获每个参考区域的局部特征。提出了一种局部特征解耦 (LFD) 策略，以很少的计算开销保留局部高分辨率细节。然后将局部特征与全局特征相结合，以在有凝聚力的背景下捕捉区域间关系。此外，我们提出了一种新颖的区域报告对齐 (RRA) 训练策略。它利用参考区域的识别来指导区域特定报告的生成，增强模型的参考和基础能力，同时提高报告的可解释性。大型语言模型 (LLM) 还被用作语言解码器，以从集成的视觉特征生成报告，从而促进区域级理解。在两个大型胸部 CT 报告数据集上进行的大量实验证明了我们方法的优越性，它在自然语言生成和临床疗效指标方面均优于几种最先进的方法，同时保持了良好的可解释性。代码将公开发布。

Title: Optical-Flow Guided Prompt Optimization for Coherent Video Generation

Authors: Hyelin Nam, Jaemin Kim, Dohun Lee, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15540
Pdf URL: https://arxiv.org/pdf/2411.15540
Copy Paste: [[2411.15540]] Optical-Flow Guided Prompt Optimization for Coherent Video Generation(https://arxiv.org/abs/2411.15540)
Keywords: generation
Abstract: While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
摘要：虽然文本到视频的扩散模型取得了重大进展，但许多模型在生成具有时间一致性的视频方面仍然面临挑战。在扩散框架中，引导技术已被证明可有效提高推理过程中的输出质量；然而，将这些方法应用于视频扩散模型会带来处理整个序列计算的额外复杂性。为了解决这个问题，我们提出了一个名为 MotionPrompt 的新框架，它通过光流引导视频生成过程。具体来说，我们训练一个鉴别器来区分真实视频和生成视频的随机帧对之间的光流。鉴于提示会影响整个视频，我们通过使用来自应用于随机帧对的训练有素的鉴别器的梯度来优化反向采样步骤中的可学习标记嵌入。这种方法使我们的方法能够生成视觉连贯的视频序列，这些序列可以紧密反映自然运动动态，而不会损害生成内容的保真度。我们证明了我们的方法在各种模型中的有效性。

Title: Improving Transferable Targeted Attacks with Feature Tuning Mixup

Authors: Kaisheng Liang, Xuelong Dai, Yanjie Li, Dong Wang, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15553
Pdf URL: https://arxiv.org/pdf/2411.15553
Copy Paste: [[2411.15553]] Improving Transferable Targeted Attacks with Feature Tuning Mixup(https://arxiv.org/abs/2411.15553)
Keywords: generation
Abstract: Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
摘要：深度神经网络容易受到可跨不同模型迁移的对抗性示例的影响。一个特别具有挑战性的问题是开发可迁移的针对性攻击，这些攻击可能会误导模型预测特定的目标类别。虽然已经提出了各种方法来增强攻击的可迁移性，但它们通常会产生大量的计算成本，而产生的改进却有限。最近的清洁特征混合方法使用随机清洁特征来扰乱特征空间，但缺乏对破坏对抗性示例的优化，忽略了攻击特定扰动的优势。在本文中，我们提出了特征调整混合 (FTM)，这是一种通过在特征空间中结合随机和优化噪声来增强针对性攻击可迁移性的新方法。FTM 引入了可学习的特征扰动并采用有效的随机更新策略进行优化。这些可学习的扰动有助于生成更强大的对抗性示例并提高可迁移性。我们进一步证明，可以通过多个 FTM 扰动的代理模型的集合来增强攻击性能。在各种模型上对 ImageNet 兼容数据集进行的大量实验表明，我们的方法在保持较低计算成本的同时，比最先进的方法取得了显着的改进。

Title: TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Authors: Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Takahiro Shirakawa, Ko Watanabe, Andreas Dengel, Jinjia Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15580
Pdf URL: https://arxiv.org/pdf/2411.15580
Copy Paste: [[2411.15580]] TKG-DM: Training-free Chroma Key Content Generation Diffusion Model(https://arxiv.org/abs/2411.15580)
Keywords: generation, generative
Abstract: Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
摘要：扩散模型能够生成高质量图像，并且高度注重真实感和文本保真度。然而，大规模文本转图像模型（例如稳定扩散）难以生成前景物体位于色度键背景上方的图像，这限制了它们在不进行微调的情况下分离前景和背景元素的能力。为了解决这一限制，我们提出了一种新颖的无需训练的色度键内容生成扩散模型 (TKG-DM)，该模型优化了初始随机噪声，以在可指定颜色的背景上生成具有前景物体的图像。我们提出的方法是第一个探索操纵初始噪声中的颜色方面以生成受控背景的方法，无需微调即可精确分离前景和背景。大量实验表明，我们的无需训练方法在定性和定量评估方面均优于现有方法，可匹敌或超越微调模型。最后，我们成功地将其扩展到其他任务（例如一致性模型和文本到视频），突出了它在各种生成应用中的变革潜力，其中前景和背景的独立控制至关重要。

Title: FLD+: Data-efficient Evaluation Metric for Generative Models

Authors: Pranav Jeevan, Neeraj Nixon, Amit Sethi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15584
Pdf URL: https://arxiv.org/pdf/2411.15584
Copy Paste: [[2411.15584]] FLD+: Data-efficient Evaluation Metric for Generative Models(https://arxiv.org/abs/2411.15584)
Keywords: generative
Abstract: We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.
摘要：我们引入了一种新的指标来评估生成图像的质量，它比以前的指标（例如 Fréchet Inception Distance (FID)）更可靠、数据效率更高、计算效率更高，并且更能适应新领域。所提出的指标基于正则化流，可以计算来自任何域的图像密度（精确对数似然）。因此，与 FID 不同，所提出的基于流的似然距离加 (FLD+) 指标对于不同类型的图像退化（包括噪声、遮挡、扩散步骤和生成模型大小）表现出强烈的单调行为。此外，由于可以稳定高效地训练正则化流，因此 FLD+ 可以使用比 FID 少两个数量级的图像（FID 需要更多图像才能可靠地计算大量真实图像和生成图像特征之间的 Fréchet 距离）即可获得稳定的结果。我们通过将正则化流应用于在低维潜在空间中提取的特征（而不是使用预训练网络），使 FLD+ 在计算上更加高效。我们还表明，FLD+ 可以轻松地在新领域（例如医学图像）上进行重新训练，这与之前指标背后的网络（例如在 ImageNet 上进行预训练的 InceptionNetV3）不同。

Title: Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Authors: Yadong Qu, Yuxin Wang, Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15585
Pdf URL: https://arxiv.org/pdf/2411.15585
Copy Paste: [[2411.15585]] Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing(https://arxiv.org/abs/2411.15585)
Keywords: generation
Abstract: Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
摘要：现有的场景文本识别 (STR) 方法难以识别具有挑战性的文本，尤其是艺术和严重扭曲的字符。其局限性在于对字符形态的探索不足，包括广泛使用的合成训练数据的单调性和模型对字符形态的敏感性。为了解决这些问题，我们受到人类观看和总结的学习过程的启发，通过利用合成和真实的未标记数据，以自我激励的方式促进基于对比学习的 STR 框架，而无需任何人力成本。在观看过程中，为了弥补合成数据的简单性并丰富字符形态多样性，我们提出了一种在线生成策略来生成具有多种字符样式的无背景样本。通过排除背景噪声干扰，鼓励模型专注于字符形态，并在仅使用简单的合成数据进行训练时推广识别复杂样本的能力。为了加速总结过程，我们从理论上证明了先前的字符对比损失中的推导误差，该误差错误地导致类内分布的稀疏性并加剧了具有挑战性的样本的歧义。因此，我们提出了一种新的字符单向对齐损失来纠正此错误，并通过将学生模型中的字符特征与教师模型中的参考特征对齐来统一所有样本中相同字符的表示。大量实验结果表明，我们的方法实现了 SOTA 性能（在常见基准和 Union14M-Benchmark 上的平均准确率分别为 94.7% 和 70.9%）。代码将在此 https URL 上提供。

Title: Fixing the Perspective: A Critical Examination of Zero-1-to-3

Authors: Jack Yu, Xueying Jia, Charlie Sun, Prince Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15706
Pdf URL: https://arxiv.org/pdf/2411.15706
Copy Paste: [[2411.15706]] Fixing the Perspective: A Critical Examination of Zero-1-to-3(https://arxiv.org/abs/2411.15706)
Keywords: generation
Abstract: Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3's theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.
摘要：新视图合成是图像到 3D 生成中的一个基本挑战，需要从一组条件图像及其相对姿势生成目标视图图像。虽然最近的方法（如 Zero-1-to-3）使用条件潜在扩散模型已经显示出有希望的结果，但它们在生成一致且准确的新视图方面面临着重大挑战，特别是在处理多个条件图像时。在这项工作中，我们对扩散 2D 条件 UNet 的空间变换器中的 Zero-1-to-3 交叉注意机制进行了彻底的研究。我们的分析揭示了 Zero-1-to-3 的理论框架与其实现之间存在关键差异，特别是在图像条件上下文的处理方面。我们提出了两项重大改进：（1）一种能够有效利用交叉注意机制的修正实现，以及（2）一种可以同时利用多个条件视图的增强架构。我们的理论分析和初步结果表明，新视图合成的一致性和准确性可能会有所提高。

Title: ROOT: VLM based System for Indoor Scene Understanding and Beyond

Authors: Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15714
Pdf URL: https://arxiv.org/pdf/2411.15714
Copy Paste: [[2411.15714]] ROOT: VLM based System for Indoor Scene Understanding and Beyond(https://arxiv.org/abs/2411.15714)
Keywords: generation
Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{this https URL}.
摘要：最近，视觉语言模型 (VLM) 取得了重大进展，但这些模型在室内场景的空间分层推理方面仍然面临挑战。在本研究中，我们介绍了 ROOT，这是一个基于 VLM 的系统，旨在增强对室内场景的分析。具体来说，我们首先使用 GPT-4V 开发一种迭代对象感知算法来检测室内场景中的对象实体。然后使用视觉基础模型来获取有关场景的其他元信息，例如边界框。基于这些基础数据，我们提出了一个专门的 VLM，SceneVLM，它能够生成空间分层场景图并提供室内环境中物体的距离信息。这些信息增强了我们对室内场景空间排列的理解。为了训练我们的 SceneVLM，我们从各种公共室内数据集中收集了超过 610,000 张图像，并使用半自动化技术实施场景数据生成管道，以建立关系并估计室内物体之间的距离。通过利用这些丰富的数据，我们进行了各种训练并完成了 SceneVLM。我们的实验表明 \rootname 有助于理解室内场景，并在各种下游应用中证明是有效的，例如 3D 场景生成和具身 AI。代码将在 \url{此 https URL} 上发布。

Title: Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Authors: Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15720
Pdf URL: https://arxiv.org/pdf/2411.15720
Copy Paste: [[2411.15720]] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks(https://arxiv.org/abs/2411.15720)
Keywords: generation
Abstract: Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.
摘要：预训练的视觉语言模型 (VLM) 在图像和自然语言理解方面表现出色，例如图像字幕和响应生成。随着视觉语言模型的实际应用越来越广泛，它们的潜在安全性和鲁棒性问题引发了人们的担忧，即攻击者可能会逃避系统并通过恶意攻击导致这些模型生成有毒内容。因此，评估开源 VLM 对抗对抗攻击的鲁棒性引起了越来越多的关注，其中基于转移的攻击是一种代表性的黑盒攻击策略。然而，大多数现有的基于转移的攻击忽略了视觉和文本模态之间语义相关性的重要性，导致对抗性示例生成和攻击性能不佳。为了解决这个问题，我们提出了攻击链 (CoA)，它使用一系列中间攻击步骤基于多模态语义更新迭代地增强对抗性示例的生成，从而实现卓越的对抗性可转移性和效率。进一步提出了一种统一的攻击成功率计算方法用于自动逃避评估。在最现实和高风险场景下进行的大量实验表明，我们的攻击策略可以有效地误导模型生成有针对性的响应，仅使用黑盒攻击，而无需了解受害模型。我们论文中的全面稳健性评估深入了解了 VLM 的漏洞，并为未来模型开发的安全考虑提供了参考。

Title: LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration

Authors: Gaojing Zhang, Jinglun Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15740
Pdf URL: https://arxiv.org/pdf/2411.15740
Copy Paste: [[2411.15740]] LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration(https://arxiv.org/abs/2411.15740)
Keywords: restoration
Abstract: We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.
摘要：我们引入了 LTCF-Net，这是一种专为增强低光图像而设计的新型网络架构。与基于 Retinex 的方法不同，我们的方法利用两个颜色空间（LAB 和 YUV）来有效地分离和处理颜色信息，利用彩色图像中亮度与色度成分的分离。此外，我们的模型结合了 Transformer 架构，可以全面理解图像内容，同时保持计算效率。为了动态平衡输出图像中的亮度，我们还引入了一个傅里叶变换模块，用于调整频域中的亮度通道。该机制可以均匀地平衡不同区域的亮度，同时消除背景噪音，从而提高视觉质量。通过结合这些创新组件，LTCF-Net 可以有效提高低光图像质量，同时保持模型轻量级。实验结果表明，我们的方法在多个评估指标和数据集上均优于当前最先进的方法，实现了更自然的色彩恢复和均衡的亮度分布。

Title: Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting

Authors: Liran Nochumsohn, Michal Moshkovitz, Orly Avner, Dotan Di Castro, Omri Azencot
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15743
Pdf URL: https://arxiv.org/pdf/2411.15743
Copy Paste: [[2411.15743]] Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting(https://arxiv.org/abs/2411.15743)
Keywords: generation
Abstract: Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.
摘要：时间序列预测在许多实际应用中都至关重要，需要根据观察到的模式准确预测未来值。虽然传统的预测技术在具有充足数据的领域场景中效果很好，但在数据稀缺或根本没有数据时，它们会遇到困难，这促使零样本和少样本学习环境的出现。最近的进展通常利用大规模基础模型来完成此类任务，但这些方法需要大量数据和计算资源，并且它们的性能可能会受到现有训练集中无效学习的阻碍。这提出了一个基本问题：哪些因素会影响时间序列预测中从数据中进行的有效学习？为了解决这个问题，我们建议使用傅里叶分析来研究模型如何从合成和现实世界的时间序列数据中学习。我们的研究结果表明，预测者通常无法从具有多个频率的数据中学习，并且无法很好地泛化到看不见的频率，这阻碍了他们的预测性能。为了缓解这些问题，我们提出了一个新颖的合成数据生成框架，旨在通过创建特定于任务的频率信息来增强真实数据或完全替换它，只需要目标数据的采样率。我们的方法 Freq-Synth 提高了零样本和少样本设置中基础和非基础预测模型的稳健性，从而有助于在有限数据场景下实现更可靠的时间序列预测。

Title: Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Authors: Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, Boyu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15843
Pdf URL: https://arxiv.org/pdf/2411.15843
Copy Paste: [[2411.15843]] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing(https://arxiv.org/abs/2411.15843)
Keywords: generative
Abstract: Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
摘要：利用流变换器的大量生成先验进行免调优图像编辑需要真实的反演来将图像投影到模型域中，并需要灵活的不变性控制机制来保留非目标内容。然而，现行的扩散反演在基于流的模型中表现不佳，不变性控制无法协调各种刚性和非刚性编辑任务。为了解决这些问题，我们系统地分析了基于流变换器的 \textbf{反演和不变性} 控制。具体而言，我们发现欧拉反演与 DDIM 具有相似的结构，但更容易受到近似误差的影响。因此，我们提出了一个两阶段反演，首先改进速度估计，然后补偿剩余误差，这与模型先验密切相关并有利于编辑。同时，我们提出了不变性控制，它可以在自适应层规范化中操纵文本特征，将文本提示的变化与图像语义联系起来。该机制可以同时保留非目标内容，同时允许刚性和非刚性操作，从而实现多种编辑类型，如视觉文本、数量、面部表情等。在多种场景下的实验验证了我们的框架实现了灵活、准确的编辑，释放了流变换器在多种图像编辑方面的潜力。

Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Authors: Teng Zhou, Xiaoyu Zhang, Yongchuan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15867
Pdf URL: https://arxiv.org/pdf/2411.15867
Copy Paste: [[2411.15867]] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs(https://arxiv.org/abs/2411.15867)
Keywords: generation
Abstract: Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.
摘要：全景图像生成已成为图像生成中一项重要的任务，这得益于创意和技术应用中对大规模视觉效果日益增长的需求。虽然扩散模型主导了这一领域，但它们面临着固有的局限性，包括多级一致性挑战和实施复杂性，导致结果不理想。在本文中，我们介绍了 PanoLlama，这是一个新颖的框架，它将全景图像生成重新定义为下一个标记预测任务。基于预先训练的 LlamaGen 架构，我们以自回归的方式生成图像，并开发了一种扩展策略来处理尺寸限制。该方法以裁剪和无训练的方式与图像标记结构对齐，从而产生具有最小接缝和最大可扩展性的高质量全景图。PanoLlama 在我们的实验中展示了其有效性和多功能性，实现了最佳的整体性能，同时为多尺度、多布局和多引导生成提供了灵活性。它克服了基于扩散的方法无法解决的挑战，为全景图像生成任务树立了新典范。代码可在此 https URL 上获取。

Title: Making Images from Images: Interleaving Denoising and Transformation

Authors: Shumeet Baluja, David Marwood, Ashwin Baluja
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.15925
Pdf URL: https://arxiv.org/pdf/2411.15925
Copy Paste: [[2411.15925]] Making Images from Images: Interleaving Denoising and Transformation(https://arxiv.org/abs/2411.15925)
Keywords: generation
Abstract: Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.
摘要：只需重新排列图像区域，我们就可以创建任何主题的新图像。区域的定义是用户可定义的，范围从规则和不规则形状的块、同心环，甚至单个像素。我们的方法扩展并改进了最近在生成视觉错觉方面的工作，通过同时学习图像的内容以及将所需图像相互转换所需的参数化转换。通过学习图像变换，我们可以预先指定任何源图像；任何现有图像（例如蒙娜丽莎）都可以转换为新主题。我们将此过程表述为约束优化问题，并通过将图像扩散步骤与能量最小化步骤交错来解决它。与以前的方法不同，增加区域数量实际上使问题更容易并改善结果。我们在像素和潜在空间中展示了我们的方法。还给出了创造性的扩展，例如使用源图像的无限副本和使用多个源图像。

Title: Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Authors: Soumava Paul, Prakhar Kaushik, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15966
Pdf URL: https://arxiv.org/pdf/2411.15966
Copy Paste: [[2411.15966]] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors(https://arxiv.org/abs/2411.15966)
Keywords: generative
Abstract: In this work, we introduce a generative approach for pose-free reconstruction of $360^{\circ}$ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex $360^{\circ}$ scenes.
摘要：在这项工作中，我们介绍了一种从有限数量的未校准二维图像中无姿势重建 $360^{\circ}$ 场景的生成方法。从不完整、未摆姿势的观测中进行无姿势场景重建通常使用深度估计或 3D 基础先验进行正则化。虽然最近的进展已经能够使用扩散先验对具有已知相机姿势的无界场景进行稀疏视图重建，但这些方法依赖于显式相机嵌入来推断未观察到的区域。这种依赖限制了它们在无姿势设置中的应用，其中视图特定数据仅隐式可用。为了解决这个问题，我们提出了一种遵循指令的 RGBD 扩散模型，旨在修复缺失的细节并去除 3D 场景的新视图渲染和深度图中的伪影。我们还提出了一种新的高斯表示置信度测量，以便更好地检测这些伪影。通过在高斯 SLAM 启发过程中逐步整合这些新视图，我们实现了多视图一致的高斯表示。在 MipNeRF360 数据集上的评估表明，我们的方法超越了现有的无姿势技术，并且在复杂的 $360^{\circ}$ 场景中的表现可与最先进的姿势重建方法相媲美。

Title: CNNs for Style Transfer of Digital to Film Photography

Authors: Pierre Mackenzie, Mika Senghaas, Raphael Achddou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15967
Pdf URL: https://arxiv.org/pdf/2411.15967
Copy Paste: [[2411.15967]] CNNs for Style Transfer of Digital to Film Photography(https://arxiv.org/abs/2411.15967)
Keywords: generation
Abstract: The use of deep learning in stylistic effect generation has seen increasing use over recent years. In this work, we use simple convolutional neural networks to model Cinestill800T film given a digital input. We test the effect of different loss functions, the addition of an input noise channel and the use of random scales of patches during training. We find that a combination of MSE/VGG loss gives the best colour production and that some grain can be produced, but it is not of a high quality, and no halation is produced. We contribute our dataset of aligned paired images taken with a film and digital camera for further work.
摘要：近年来，深度学习在风格效果生成中的应用越来越广泛。在这项工作中，我们使用简单的卷积神经网络对给定数字输入的 Cinestill800T 胶片进行建模。我们在训练期间测试了不同损失函数、添加输入噪声通道和使用随机比例的补丁的效果。我们发现 MSE/VGG 损失的组合可以产生最佳的色彩效果，虽然会产生一些颗粒，但质量不高，也不会产生光晕。我们将用胶片和数码相机拍摄的对齐配对图像数据集贡献出来，以供进一步研究。

Title: From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events

Authors: Yan Miao, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, Danil Prokhorov, Sayan Mitra
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16027
Pdf URL: https://arxiv.org/pdf/2411.16027
Copy Paste: [[2411.16027]] From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events(https://arxiv.org/abs/2411.16027)
Keywords: generation
Abstract: Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a novel framework that automates the conversion of real-world car crash videos into detailed simulation scenarios for ADS testing. Our approach leverages prompt-engineered Video Language Models(VLM) to transform dashcam footage into SCENIC scripts, which define the environment and driving behaviors in the CARLA simulator, enabling the generation of realistic simulation scenarios. Importantly, rather than solely aiming for one-to-one scenario reconstruction, our framework focuses on capturing the essential driving behaviors from the original video while offering flexibility in parameters such as weather or road conditions to facilitate search-based testing. Additionally, we introduce a similarity metric that helps iteratively refine the generated scenario through feedback by comparing key features of driving behaviors between the real and simulated videos. Our preliminary results demonstrate substantial time efficiency, finishing the real-to-sim conversion in minutes with full automation and no human intervention, while maintaining high fidelity to the original driving events.
摘要：使用真实驾驶场景在模拟中测试自动驾驶系统 (ADS) 对于验证其性能非常重要。然而，将真实世界的驾驶视频转换为模拟场景是一项重大挑战，因为解释高维视频数据的复杂性以及精确的手动场景重建耗时性。在这项工作中，我们提出了一个新颖的框架，可以自动将真实世界的车祸视频转换为详细的模拟场景以进行 ADS 测试。我们的方法利用快速设计的视频语言模型 (VLM) 将行车记录仪镜头转换为 SCENIC 脚本，这些脚本定义了 CARLA 模拟器中的环境和驾驶行为，从而能够生成真实的模拟场景。重要的是，我们的框架不仅仅针对一对一的场景重建，而是专注于从原始视频中捕捉基本的驾驶行为，同时在天气或道路状况等参数方面提供灵活性，以促进基于搜索的测试。此外，我们引入了一个相似性度量，通过比较真实视频和模拟视频之间驾驶行为的关键特征，通过反馈帮助迭代地改进生成的场景。我们的初步结果证明了显著的时间效率，在几分钟内完成真实到模拟的转换，完全自动化，无需人工干预，同时保持了对原始驾驶事件的高保真度。

Title: Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models

Authors: Donggeun Ko, Dongjun Lee, Namjun Park, Wonkyeong Shim, Jaekwang Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16079
Pdf URL: https://arxiv.org/pdf/2411.16079
Copy Paste: [[2411.16079]] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models(https://arxiv.org/abs/2411.16079)
Keywords: generation, generative
Abstract: Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top-$K$ losses from a biased classifier ($f_B$) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.
摘要：当神经网络学习到偏差并误导相关性时，神经网络很难进行图像分类，从而影响其泛化和性能。以前的方法需要属性标签（例如背景、颜色）或利用生成对抗网络 (GAN) 来减轻偏差。我们引入了 DiffuBias，这是一种用于文本到图像生成的新型管道，它通过生成偏差冲突样本来增强分类器的鲁棒性，而无需在生成阶段进行训练。利用预先训练的扩散和图像字幕模型，DiffuBias 生成挑战分类器偏差的图像，使用有偏差分类器 ($f_B$) 的 top-$K$ 损失来创建更具代表性的数据样本。这种方法不仅可以有效地消除偏差，还可以提高分类器的泛化能力。据我们所知，DiffuBias 是第一种利用稳定扩散模型在消除偏差任务中生成偏差冲突样本的方法。我们全面的实验评估表明，DiffuBias 在基准数据集上实现了最先进的性能。我们还从碳排放和能源消耗的角度对各种生成模型进行了比较分析，以强调计算效率的重要性。

Title: Boosting 3D Object Generation through PBR Materials

Authors: Yitong Wang, Xudong Xu, Li Ma, Haoran Wang, Bo Dai
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16080
Pdf URL: https://arxiv.org/pdf/2411.16080
Copy Paste: [[2411.16080]] Boosting 3D Object Generation through PBR Materials(https://arxiv.org/abs/2411.16080)
Keywords: generation
Abstract: Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.
摘要：自动 3D 内容创建最近受到越来越多的关注，因为它在视频游戏、电影行业和 AR/VR 等各种应用领域中都具有潜力。扩散模型和多模态模型的最新进展显着提高了给定单个 RGB 图像的 3D 对象生成的质量和效率。然而，即使是通过最先进的方法生成的 3D 对象与人造资产相比仍然不令人满意。只考虑纹理而不是材质使得这些方法在照片级真实感渲染、重新照明和灵活的外观编辑方面遇到挑战。而且它们还存在几何形状和高频纹理细节之间严重错位的问题。在这项工作中，我们提出了一种从基于物理的渲染 (PBR) 材料的角度提高生成的 3D 对象质量的新方法。通过分析 PBR 材料的成分，我们选择考虑反照率、粗糙度、金属性和凹凸贴图。对于反照率和凹凸贴图，我们利用在合成数据上微调的稳定扩散来提取这些值，并采用这些微调模型的新颖用法来为生成的对象获取 3D 一致的反照率 UV 和凹凸 UV。在粗糙度和金属度贴图方面，我们采用半自动化流程来提供交互式调整的空间，我们认为这更为实用。大量实验表明，我们的模型通常对各种最先进的生成方法都有好处，可以显著提高其生成的 3D 对象的质量和真实感，具有自然的重新照明效果和显著改善的几何形状。

Title: AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Authors: Jili Xia, Lihuo He, Fei Gao, Kaifan Zhang, Leida Li, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16087
Pdf URL: https://arxiv.org/pdf/2411.16087
Copy Paste: [[2411.16087]] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity(https://arxiv.org/abs/2411.16087)
Keywords: generative, quality assessment
Abstract: Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.
摘要：最近，通过给定的提示（初始提示）创建的人工智能生成图像 (AIGI) 引起了广泛关注。然而，由于技术不熟练，它们通常存在感知质量差和文本到图像错位的问题。因此，评估 AIGI 的感知质量和对齐质量对于提高生成模型的性能至关重要。现有的评估方法在任务提示设计中过度依赖初始提示，并使用相同的提示来指导感知和对齐质量评估，忽略了这两个任务之间的区别。为了解决这一限制，我们提出了一种名为 TSP-MGS 的新型 AIGI 质量评估方法，该方法设计特定于任务的提示并测量 AIGI 和提示之间的多粒度相似性。具体而言，首先构建特定于任务的提示来分别描述感知和对齐质量程度，并引入初始提示以进行详细的质量感知。然后，计算 AIGI 与特定于任务的提示之间的粗粒度相似性，这有助于整体质量意识。此外，为了提高对AIGI细节的理解，测量了图像与初始提示之间的细粒度相似性。最后，通过整合多粒度相似性获得精确的质量预测。在常用的AGIQA-1K和AGIQA-3K基准上的实验证明了所提出的TSP-MGS的优越性。

Title: Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

Authors: Hangyul Yoon, Doohyuk Jang, Jungeun Kim, Eunho Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16123
Pdf URL: https://arxiv.org/pdf/2411.16123
Copy Paste: [[2411.16123]] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain(https://arxiv.org/abs/2411.16123)
Keywords: generation
Abstract: Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.
摘要：利用预先训练的模型和定制的提示进行情境学习已被证明在 NLP 任务中非常有效。基于这一成功，最近的研究在“一次性”框架内将类似的方法应用于 Segment Anything Model (SAM)，其中仅使用单个参考图像及其标签。然而，这些方法在医学领域面临限制，主要是因为 SAM 对视觉提示的基本要求以及过度依赖像素相似性来生成它们。这种依赖性可能导致 (1) 提示生成不准确和 (2) 点提示聚类，从而导致结果不理想。为了应对这些挑战，我们引入了 \textbf{Med-PerSAM}，这是一种专为医学领域设计的新颖而直接的一次性框架。由于我们新颖的自动提示生成过程，Med-PerSAM 仅使用视觉提示工程，并且无需对预训练的 SAM 进行额外训练或人工干预。通过将我们基于变形的轻量级提示调整模型与 SAM 集成，我们可以提取和迭代细化视觉提示，从而提高预训练 SAM 的性能。这一进步在医学领域尤其有意义，因为对于缺乏医学专业知识的人来说，创建视觉提示是一项重大挑战。我们的模型在各种 2D 医学成像数据集上的表现优于各种基础模型和之前基于 SAM 的方法。

Title: TreeFormer: Single-view Plant Skeleton Estimation via Tree-constrained Graph Generation

Authors: Xinpeng Liu, Hiroaki Santo, Yosuke Toda, Fumio Okura
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16132
Pdf URL: https://arxiv.org/pdf/2411.16132
Copy Paste: [[2411.16132]] TreeFormer: Single-view Plant Skeleton Estimation via Tree-constrained Graph Generation(https://arxiv.org/abs/2411.16132)
Keywords: generation
Abstract: Accurate estimation of plant skeletal structure (e.g., branching structure) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. While recent graph generation methods successfully infer thin structures from images, it is challenging to constrain the output graph strictly to a tree structure. To this problem, we present TreeFormer, a plant skeleton estimator via tree-constrained graph generation. Our approach combines learning-based graph generation with traditional graph algorithms to impose the constraints during the training loop. Specifically, our method projects an unconstrained graph onto a minimum spanning tree (MST) during the training loop and incorporates this prior knowledge into the gradient descent optimization by suppressing unwanted feature values. Experiments show that our method accurately estimates target plant skeletal structures for multiple domains: Synthetic tree patterns, real botanical roots, and grapevine branches. Our implementations are available at this https URL.
摘要：从图像中准确估计植物骨架结构（例如，分支结构）对于智能农业和植物科学至关重要。与具有固定拓扑结构的人类骨骼不同，植物骨架估计提出了一个独特的挑战，即从图像中估计任意树形图。虽然最近的图形生成方法成功地从图像中推断出细结构，但将输出图严格限制为树结构具有挑战性。针对这个问题，我们提出了 TreeFormer，一种通过树约束图形生成的植物骨架估计器。我们的方法将基于学习的图形生成与传统图形算法相结合，以在训练循环期间施加约束。具体而言，我们的方法在训练循环期间将不受约束的图形投影到最小生成树 (MST) 上，并通过抑制不需要的特征值将这些先验知识纳入梯度下降优化中。实验表明，我们的方法可以准确估计多个领域的目标植物骨架结构：合成树形图案、真实植物根和葡萄藤树枝。我们的实现可在此 https URL 上找到。

Title: Context Awareness Gate For Retrieval Augmented Generation

Authors: Mohammad Hassan Heydari, Arshia Hemmat, Erfan Naman, Afsaneh Fatemi
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2411.16133
Pdf URL: https://arxiv.org/pdf/2411.16133
Copy Paste: [[2411.16133]] Context Awareness Gate For Retrieval Augmented Generation(https://arxiv.org/abs/2411.16133)
Keywords: generation
Abstract: Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information -- which can impair the ability of the model to utilize its internal knowledge effectively -- has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs' input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.
摘要：检索增强生成 (RAG) 已成为一种广泛采用的方法，用于缓解大型语言模型 (LLM) 在回答特定领域问题方面的局限性。先前的研究主要集中于提高检索数据块的准确性和质量，以提高生成管道的整体性能。然而，尽管不断取得进展，但检索不相关信息这一关键问题（这可能会损害模型有效利用其内部知识的能力）却很少受到关注。在这项工作中，我们调查了检索不相关信息在开放域问答中的影响，强调了其对 LLM 输出质量的重大不利影响。为了应对这一挑战，我们提出了上下文感知门 (CAG) 架构，这是一种新颖的机制，可根据用户查询是否需要外部上下文检索来动态调整 LLM 的输入提示。此外，我们还引入了向量候选方法，这是 CAG 的一个核心数学组件，具有统计性、独立于 LLM 且高度可扩展。我们进一步研究了上下文和问题之间的关系分布，并对这些分布进行了统计分析。可以利用此分析来增强检索增强生成 (RAG) 系统中的上下文检索过程。

Title: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Authors: Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16157
Pdf URL: https://arxiv.org/pdf/2411.16157
Copy Paste: [[2411.16157]] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model(https://arxiv.org/abs/2411.16157)
Keywords: generation
Abstract: We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.
摘要：我们推出了 MVGenMaster，这是一种通过 3D 先验增强的多视图扩散模型，可解决多功能新视图合成 (NVS) 任务。MVGenMaster 利用使用度量深度和相机姿势扭曲的 3D 先验，显着增强了 NVS 中的泛化和 3D 一致性。我们的模型具有简单而有效的管道，可以通过单个前向过程生成多达 100 个以任意参考视图和相机姿势为条件的新视图。此外，我们还开发了一个全面的大规模多视图图像数据集，其中包含多达 120 万个场景，并配备了高度对齐的度量深度。此外，我们提出了几种训练和模型修改，以使用扩大的数据集来增强模型。在域内和域外基准测试中的广泛评估证明了我们提出的方法和数据公式的有效性。模型和代码将在此 https URL 上发布。

Title: Text-to-Image Synthesis: A Decade Survey

Authors: Nonghai Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16164
Pdf URL: https://arxiv.org/pdf/2411.16164
Copy Paste: [[2411.16164]] Text-to-Image Synthesis: A Decade Survey(https://arxiv.org/abs/2411.16164)
Keywords: generation, generative
Abstract: When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.
摘要：人类在阅读特定文本时，往往会将相应的图像可视化，我们希望计算机也能做到这一点。文本到图像合成（T2I）专注于从文本描述生成高质量的图像，已成为人工智能生成内容（AIGC）的重要方面和人工智能研究的变革方向。基础模型在 T2I 中起着至关重要的作用。在本综述中，我们回顾了 440 多个关于 T2I 的最新研究。我们首先简要介绍 GAN、自回归模型和扩散模型如何用于图像生成。在此基础上，我们讨论了这些模型在 T2I 中的开发，重点关注它们在以文本为条件时的生成能力和多样性。我们还探讨了 T2I 各个方面的前沿研究，包括性能、可控性、个性化生成、安全问题以及内容和空间关系的一致性。此外，我们总结了 T2I 研究中常用的数据集和评估指标。最后，我们讨论了 T2I 在 AIGC 中的潜在应用，以及该领域的挑战和未来研究机会。

Title: BadSFL: Backdoor Attack against Scaffold Federated Learning

Authors: Xingshuo Han, Xiang Lan, Haozhao Wang, Shengmin Xu, Shen Ren, Jason Zeng, Ming Wu, Michael Heinrich, Tianwei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16167
Pdf URL: https://arxiv.org/pdf/2411.16167
Copy Paste: [[2411.16167]] BadSFL: Backdoor Attack against Scaffold Federated Learning(https://arxiv.org/abs/2411.16167)
Keywords: generative
Abstract: Federated learning (FL) enables the training of deep learning models on distributed clients to preserve data privacy. However, this learning paradigm is vulnerable to backdoor attacks, where malicious clients can upload poisoned local models to embed backdoors into the global model, leading to attacker-desired predictions. Existing backdoor attacks mainly focus on FL with independently and identically distributed (IID) scenarios, while real-world FL training data are typically non-IID. Current strategies for non-IID backdoor attacks suffer from limitations in maintaining effectiveness and durability. To address these challenges, we propose a novel backdoor attack method, \name, specifically designed for the FL framework using the scaffold aggregation algorithm in non-IID settings. \name leverages a Generative Adversarial Network (GAN) based on the global model to complement the training set, achieving high accuracy on both backdoor and benign samples. It utilizes a specific feature as the backdoor trigger to ensure stealthiness, and exploits the Scaffold's control variate to predict the global model's convergence direction, ensuring the backdoor's persistence. Extensive experiments on three benchmark datasets demonstrate the high effectiveness, stealthiness, and durability of \name. Notably, our attack remains effective over 60 rounds in the global model and up to 3 times longer than existing baseline attacks after stopping the injection of malicious updates.
摘要：联邦学习 (FL) 支持在分布式客户端上训练深度学习模型，以保护数据隐私。然而，这种学习范式容易受到后门攻击，恶意客户端可以上传有毒的本地模型，将后门嵌入全局模型，从而导致攻击者想要的预测。现有的后门攻击主要集中在具有独立同分布 (IID) 场景的 FL，而现实世界的 FL 训练数据通常是非 IID 的。当前针对非 IID 后门攻击的策略在保持有效性和持久性方面存在局限性。为了应对这些挑战，我们提出了一种新颖的后门攻击方法 \name，专门为非 IID 设置中使用脚手架聚合算法的 FL 框架设计。 \name 利用基于全局模型的生成对抗网络 (GAN) 来补充训练集，在后门和良性样本上均实现高精度。它利用特定特征作为后门触发器来确保隐蔽性，并利用 Scaffold 的控制变量来预测全局模型的收敛方向，从而确保后门的持久性。在三个基准数据集上进行的大量实验证明了 \name 的高效性、隐蔽性和持久性。值得注意的是，在停止注入恶意更新后，我们的攻击在全局模型中持续有效超过 60 轮，并且比现有基线攻击的持续时间长 3 倍。

Title: Image Generation Diversity Issues and How to Tame Them

Authors: Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16171
Pdf URL: https://arxiv.org/pdf/2411.16171
Copy Paste: [[2411.16171]] Image Generation Diversity Issues and How to Tame Them(https://arxiv.org/abs/2411.16171)
Keywords: generation, generative
Abstract: Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.
摘要：现在，生成方法产生的输出几乎与真实数据难以区分，但往往无法完全捕捉数据分布。与质量问题不同，生成模型中的多样性限制很难通过视觉检测到，需要特定的指标进行评估。在本文中，我们提请关注当前生成模型缺乏多样性以及无法用通用指标来衡量这一点。我们通过将多样性定义为图像检索问题来实现这一点，我们测量使用合成数据作为查询可以检索到多少张真实图像。这产生了图像检索分数 (IRS)，这是一种可解释的、无超参数的指标，可量化生成模型输出的多样性。IRS 只需要一部分合成样本并提供置信度的统计测量。我们的实验表明，目前常用于生成模型评估的特征提取器不足以有效评估多样性。因此，我们对最佳特征提取器进行了广泛的搜索以评估多样性。评估表明，当前的扩散模型会收敛到实际分布的有限子集，目前最先进的模型都无法超越训练数据多样性的 77%。为了解决这一限制，我们引入了多样性感知扩散模型 (DiADM)，这是一种新颖的方法，可以在不损失图像质量的情况下提高无条件扩散模型的多样性。我们通过使用使用伪无条件特征作为输入的多样性感知模块，将多样性与图像质量分离开来，从而实现这一点。我们提供了一个 Python 包，提供统一的特征提取和度量计算，以进一步促进生成模型的评估，此 https URL。

Title: U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields

Authors: Vinayak Gupta, Manoj S, Mukund Varma T, Kaushik Mitra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16172
Pdf URL: https://arxiv.org/pdf/2411.16172
Copy Paste: [[2411.16172]] U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields(https://arxiv.org/abs/2411.16172)
Keywords: restoration
Abstract: Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.
摘要：由于光的吸收、折射和散射，水下图像会出现色差、对比度低和模糊等问题，而恢复这些图像引起了广泛关注。在这项工作中，我们提出了无监督水下神经辐射场 U2NeRF，这是一种基于变压器的架构，可学习同时渲染和恢复以多视图几何为条件的新视图。由于缺乏监督，我们尝试将恢复功能隐式地烘焙到 NeRF 管道上，并将预测颜色分解为几个部分——场景辐射、直接传输图、反向散射传输图和全局背景光，然后将它们组合起来以自监督的方式重建水下图像。此外，我们发布了一个水下视图合成 UVS 数据集，该数据集由 12 个水下场景组成，包含合成生成的数据和真实世界数据。我们的实验表明，在单个场景上进行优化时，U2NeRF 的表现优于多个基线，LPIPS 高达 11%，UIQM 高达 5%，UCIQE 高达 4%（平均），并且展示了改进的渲染和恢复能力。代码将在接受后提供。

Title: Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Authors: Qiao Yu, Xianzhi Li, Yuan Tang, Xu Han, Long Hu, Yixue Hao, Min Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16185
Pdf URL: https://arxiv.org/pdf/2411.16185
Copy Paste: [[2411.16185]] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation(https://arxiv.org/abs/2411.16185)
Keywords: generation
Abstract: Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
摘要：从单幅图像生成 3D 网格是一项重要但不适定的任务。现有方法主要采用 2D 多视图扩散模型来生成中间多视图图像，并使用大型重建模型 (LRM) 创建最终网格。然而，多视图图像表现出局部不一致性，并且网格通常缺乏对输入图像的保真度或看起来很模糊。我们提出了 Fancy123，它具有两个增强模块和一个反投影操作，分别解决上述三个问题。外观增强模块对 2D 多视图图像进行变形以重新对齐未对齐的像素，从而获得更好的多视图一致性。保真度增强模块对 3D 网格进行变形以匹配输入图像。将输入图像和变形的多视图图像反投影到 LRM 生成的网格上可确保高清晰度，从而丢弃 LRM 预测的模糊网格颜色。大量的定性和定量实验验证了 Fancy123 的 SoTA 性能有显着改善。此外，这两个增强模块是即插即用的，可以在推理时工作，从而可以无缝集成到各种现有的单图像到 3D 方法中。

Title: VIRES: Video Instance Repainting with Sketch and Text Guidance

Authors: Shuchen Weng, Haojie Zheng, Peixuan Zhan, Yuchen Hong, Han Jiang, Si Li, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16199
Pdf URL: https://arxiv.org/pdf/2411.16199
Copy Paste: [[2411.16199]] VIRES: Video Instance Repainting with Sketch and Text Guidance(https://arxiv.org/abs/2411.16199)
Keywords: generation, generative
Abstract: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
摘要：我们引入了 VIRES，这是一种带有草图和文本指导的视频实例重绘方法，可以实现视频实例的重绘、替换、生成和删除。现有方法在时间一致性和与提供的草图序列的准确对齐方面存在困难。VIRES 利用文本到视频模型的生成先验来保持时间一致性并产生视觉上令人愉悦的结果。我们提出了具有标准化自缩放的顺序控制网络，它可以有效地提取结构布局并自适应地捕获高对比度草图细节。我们进一步用草图注意力增强了扩散变换器主干，以解释和注入细粒度的草图语义。草图感知编码器可确保重绘结果与提供的草图序列对齐。此外，我们还提供了 VireSet，这是一个带有详细注释的数据集，专为训练和评估视频实例编辑方法而量身定制。实验结果证明了 VIRES 的有效性，它在视觉质量、时间一致性、条件对齐和人工评分方面均优于最先进的方法。

Title: Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models

Authors: Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16201
Pdf URL: https://arxiv.org/pdf/2411.16201
Copy Paste: [[2411.16201]] Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models(https://arxiv.org/abs/2411.16201)
Keywords: generation
Abstract: High-quality video-text preference data is crucial for Multimodal Large Language Models (MLLMs) alignment. However, existing preference data is very scarce. Obtaining VQA preference data for preference training is costly, and manually annotating responses is highly unreliable, which could result in low-quality pairs. Meanwhile, AI-generated responses controlled by temperature adjustment lack diversity. To address these issues, we propose a high-quality VQA preference dataset, called \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}rtificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V}), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V and ensure sufficient optimization, we propose \textit{\textbf{Iter}ative \textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}earning from \textbf{AI} \textbf{F}eedback for video MLLMs} (\textbf{Iter-W2S-RLAIF}), a framework that gradually enhances MLLMs' alignment capabilities by iteratively updating the reference model and performing parameter extrapolation. Finally, we propose an unbiased and information-complete evaluation scheme in VQA evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in preference learning and Iter-W2S-RLAIF fully exploits the alignment information in MMAIP-V. We believe that the proposed automatic VQA preference data generation pipeline based on AI feedback can greatly promote future work in the MLLMs alignment. \textbf{Code and dataset are available} \href{this https URL}{MMAIP-V\_Iter-W2S-RLAIF-702F}.
摘要：高质量的视频文本偏好数据对于多模态大型语言模型 (MLLM) 对齐至关重要。然而，现有的偏好数据非常稀缺。获取用于偏好训练的 VQA 偏好数据成本高昂，手动注释响应非常不可靠，这可能导致低质量配对。同时，受温度调节控制的 AI 生成的响应缺乏多样性。为了解决这些问题，我们提出了一个高质量的 VQA 偏好数据集，称为 \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}artificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V})，它是通过从响应分布集中采样并使用外部评分函数进行响应评估而构建的。此外，为了充分利用 MMAIP-V 中的偏好知识并确保充分优化，我们提出了 \textit{\textbf{Iter}ative \textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}learning from \textbf{AI} \textbf{F}eedback for video MLLMs}（\textbf{Iter-W2S-RLAIF}），该框架通过迭代更新参考模型和执行参数外推来逐步增强 MLLM 的对齐能力。最后，我们在 VQA 评估中提出了一种无偏且信息完整的评估方案。实验表明，MMAIP-V 对 MLLM 的偏好学习有益，而 Iter-W2S-RLAIF 充分利用了 MMAIP-V 中的对齐信息。我们相信，基于 AI 反馈的自动 VQA 偏好数据生成流程可以极大地促进未来 MLLMs 对齐方面的工作。\textbf{代码和数据集可用} \href{此 https URL}{MMAIP-V\_Iter-W2S-RLAIF-702F}。

Title: SMGDiff: Soccer Motion Generation using diffusion probabilistic models

Authors: Hongdi Yang, Chengyang Li, Zhenxuan Wu, Gaozheng Li, Jingya Wang, Jingyi Yu, Zhuo Su, Lan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16216
Pdf URL: https://arxiv.org/pdf/2411.16216
Copy Paste: [[2411.16216]] SMGDiff: Soccer Motion Generation using diffusion probabilistic models(https://arxiv.org/abs/2411.16216)
Keywords: generation, generative
Abstract: Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.
摘要：足球是一项全球知名的运动，在电子游戏和 VR/AR 中有着广泛的应用。然而，由于人类球员和球之间错综复杂的互动，生成逼真的足球动作仍然具有挑战性。在本文中，我们介绍了 SMGDiff，这是一种用于生成实时且用户可控制的足球动作的新型两阶段框架。我们的主要思想是将实时角色控制与强大的基于扩散的生成模型相结合，确保高质量和多样化的输出动作。在第一阶段，我们将粗略的用户控制立即转换为角色的不同全局轨迹。在第二阶段，我们采用基于变换器的自回归扩散模型来基于轨迹条件生成足球动作。我们在推理过程中进一步加入了接触引导模块，以优化接触细节，实现逼真的球脚互动。此外，我们还提供了一个大型足球运动数据集，其中包含超过 108 万帧不同的足球动作。大量实验表明，我们的 SMGDiff 在运动质量和条件对齐方面明显优于现有方法。

Title: Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding

Authors: Yubin Gu, Yuan Meng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16217
Pdf URL: https://arxiv.org/pdf/2411.16217
Copy Paste: [[2411.16217]] Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding(https://arxiv.org/abs/2411.16217)
Keywords: restoration
Abstract: Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model. However, in real-world scenarios, images often suffer from combinations of multiple degradation factors. Existing multiple-in-one IR models encounter challenges related to degradation diversity and prompt singularity when addressing this issue. In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations. To address degradation diversity, we design a Local Dynamic Optimization (LDO) module which dynamically processes degraded areas of varying types and granularities. To tackle the prompt singularity issue, we develop an efficient Conditional Feature Embedding (CFE) module that guides the decoder in leveraging degradation-type-related features, significantly improving the model's performance in mixed degradation restoration scenarios. To validate the effectiveness of our model, we introduce a new dataset containing both single and mixed degradation elements. Experimental results demonstrate that our proposed model achieves state-of-the-art (SOTA) performance not only on mixed degradation tasks but also on classic single-task restoration benchmarks.
摘要：多合一图像恢复 (IR) 取得了重大进展，旨在用单个模型处理所有类型的单一退化图像恢复。然而，在现实世界中，图像通常会受到多种退化因素组合的影响。现有的多合一 IR 模型在解决此问题时遇到了与退化多样性和瞬时奇异性相关的挑战。在本文中，我们提出了一种新颖的多合一 IR 模型，可以有效地恢复具有单一和混合退化的图像。为了解决退化多样性问题，我们设计了一个局部动态优化 (LDO) 模块，可以动态处理不同类型和粒度的退化区域。为了解决瞬时奇异性问题，我们开发了一个高效的条件特征嵌入 (CFE) 模块，该模块指导解码器利用与退化类型相关的特征，从而显着提高模型在混合退化恢复场景中的性能。为了验证我们模型的有效性，我们引入了一个包含单一和混合退化元素的新数据集。实验结果表明，我们提出的模型不仅在混合退化任务上而且在经典的单任务恢复基准上也达到了最先进（SOTA）的性能。

Title: Weakly supervised image segmentation for defect-based grading of fresh produce

Authors: Manuel Knott, Divinefavour Odion, Sameer Sontakke, Anup Karwa, Thijs Defraeye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16219
Pdf URL: https://arxiv.org/pdf/2411.16219
Copy Paste: [[2411.16219]] Weakly supervised image segmentation for defect-based grading of fresh produce(https://arxiv.org/abs/2411.16219)
Keywords: quality assessment
Abstract: Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM's potential for low-effort, accurate segmentation in agricultural settings with limited data.
摘要：在农业中实施基于图像的机器学习通常受到数据和注释稀缺的限制，因此很难实现高质量的模型预测。本研究解决了分散供应链中香蕉收获后质量评估的问题。我们提出了一种使用全景分割来检测和分割香蕉图像中的表面缺陷的方法，以量化缺陷的大小和数量。我们使用带有粗标签的弱监督，而不是耗时的像素级注释。在现实世界的田间条件下收集了 476 张智能手机拍摄的香蕉图像数据集，并对瘀伤和疤痕进行了注释。使用最近发布的图像分割基础模型 Segment Anything Model (SAM)，我们从粗边界框生成了密集注释来训练分割模型，显著减少了手动工作量，同时实现了 77.6% 的全景质量得分。这证明了 SAM 在数据有限的农业环境中实现低工作量、准确分割的潜力。

Title: Diagnosis of diabetic retinopathy using machine learning & deep learning technique

Authors: Eric Shah, Jay Patel, Mr.Vishal Katheriya, Parth Pataliya
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2411.16250
Pdf URL: https://arxiv.org/pdf/2411.16250
Copy Paste: [[2411.16250]] Diagnosis of diabetic retinopathy using machine learning & deep learning technique(https://arxiv.org/abs/2411.16250)
Keywords: generation
Abstract: Fundus images are widely used for diagnosing various eye diseases, such as diabetic retinopathy, glaucoma, and age-related macular degeneration. However, manual analysis of fundus images is time-consuming and prone to errors. In this report, we propose a novel method for fundus detection using object detection and machine learning classification techniques. We use a YOLO_V8 to perform object detection on fundus images and locate the regions of interest (ROIs) such as optic disc, optic cup and lesions. We then use machine learning SVM classification algorithms to classify the ROIs into different DR stages based on the presence or absence of pathological signs such as exudates, microaneurysms, and haemorrhages etc. Our method achieves 84% accuracy and efficiency for fundus detection and can be applied for retinal fundus disease triage, especially in remote areas around the world.
摘要：眼底图像广泛用于诊断各种眼部疾病，例如糖尿病视网膜病变、青光眼和年龄相关性黄斑变性。然而，手动分析眼底图像非常耗时并且容易出错。在本报告中，我们提出了一种使用物体检测和机器学习分类技术进行眼底检测的新方法。我们使用 YOLO_V8 对眼底图像执行物体检测并定位感兴趣区域 (ROI)，例如视神经乳头、视神经杯和病变。然后，我们使用机器学习 SVM 分类算法根据是否存在病理体征（例如渗出液、微动脉瘤和出血等）将 ROI 分为不同的 DR 阶段。我们的方法在眼底检测方面实现了 84% 的准确率和效率，并且可以应用于视网膜眼底疾病分类，尤其是在世界各地的偏远地区。

Title: DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation

Authors: Yuxuan Yang, Jingyao Wang, Tao Geng, Wenwen Qiang, Changwen Zheng, Fuchun Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16301
Pdf URL: https://arxiv.org/pdf/2411.16301
Copy Paste: [[2411.16301]] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation(https://arxiv.org/abs/2411.16301)
Keywords: generation, generative
Abstract: Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.
摘要：室内设计是一门复杂而富有创意的学科，涉及美学、功能性、人体工程学和材料科学。有效的解决方案必须满足各种要求，通常会产生多种可交付成果，例如从各个角度呈现的效果图和设计图。因此，室内设计流程通常效率低下，需要极大的创造力。随着机器学习的进步，生成模型已成为一种有前途的提高效率的方法，它可以通过文本描述或草图来创建设计。然而，很少有生成作品关注室内设计，导致输出与实际需求之间存在很大差异，例如尺寸、空间范围的差异以及缺乏可控的生成质量。为了应对这些挑战，我们提出了 DiffDesign，这是一种具有元先验的可控扩散模型，用于高效的室内设计生成。具体来说，我们利用在大型图像数据集上预先训练的 2D 扩散模型的生成先验作为我们的渲染主干。我们通过解开对外观、姿势和尺寸等设计属性的交叉注意力控制来进一步指导去噪过程，并引入基于最佳传输的对齐模块来强制视图一致性。同时，我们构建了一个室内设计专用数据集 DesignHelper，其中包含 15 多种空间类型和 15 种设计风格的 400 多个解决方案。该数据集有助于微调 DiffDesign。在各种基准数据集上进行的大量实验证明了 DiffDesign 的有效性和稳健性。

Title: EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

Authors: Yiying Wei, Hadi Amirpour, Jong Hwan Ko, Christian Timmerer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16312
Pdf URL: https://arxiv.org/pdf/2411.16312
Copy Paste: [[2411.16312]] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training(https://arxiv.org/abs/2411.16312)
Keywords: super-resolution
Abstract: Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
摘要：利用深度神经网络 (DNN) 的过度拟合特性在视频传输系统中是一种趋势，可以在带宽限制内提高质量。现有方法将过度拟合的超分辨率 (SR) 模型流传输到低分辨率 (LR) 比特流，这些比特流用于在解码器处重建高分辨率 (HR) 视频。尽管这些方法显示出有希望的结果，但训练大量视频帧的巨大计算成本限制了它们的实际应用。为了克服这一挑战，我们提出了一种用于视频 SR 网络过度拟合的有效补丁采样方法，称为 EPS，该方法从视频帧中识别出最有价值的训练补丁。为此，我们首先提出两个低复杂度的基于离散余弦变换 (DCT) 的时空特征来直接测量每个补丁的复杂度得分。通过分析这些特征的直方图分布，我们将所有可能的补丁分类到不同的簇中，并从具有最高时空信息的簇中选择训练补丁。采样补丁的数量是根据视频内容自适应的，解决了训练复杂度和效率之间的权衡。我们的方法将训练所需的补丁数量减少到 4% 到 25%，具体取决于分辨率和集群数量，同时保持较高的视频质量并显著提高训练效率。与最先进的补丁采样方法 EMT 相比，我们的方法将总体运行时间减少了 83%。

Title: One Diffusion to Generate Them All

Authors: Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16318
Pdf URL: https://arxiv.org/pdf/2411.16318
Copy Paste: [[2411.16318]] One Diffusion to Generate Them All(https://arxiv.org/abs/2411.16318)
Keywords: generation
Abstract: We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL
摘要：我们引入了 OneDiffusion，这是一种多功能的大规模扩散模型，可无缝支持跨不同任务的双向图像合成和理解。它支持从文本、深度、姿势、布局和语义图等输入进行条件生成，同时还可以处理图像去模糊、放大和深度估计和分割等逆向过程等任务。此外，OneDiffusion 允许使用连续图像输入进行多视图生成、相机姿势估计和即时个性化。我们的模型采用一种简单而有效的方法，在训练期间将所有任务视为具有不同噪声尺度的帧序列，允许任何帧在推理时充当条件图像。我们的统一训练框架消除了对专门架构的需求，支持可扩展的多任务训练，并可顺利适应任何分辨率，从而增强了泛化和可扩展性。实验结果表明，尽管训练数据集相对较小，但在生成和预测等任务（例如文本到图像、多视图生成、ID 保存、深度估计和相机姿势估计）方面都具有竞争力。我们的代码和检查点可在此 https URL 上免费获取

Title: Luminance Component Analysis for Exposure Correction

Authors: Jingchao Peng, Thomas Bashford-Rogers, Jingkun Chen, Haitao Zhao, Zhengwei Hu, Kurt Debattista
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.16325
Pdf URL: https://arxiv.org/pdf/2411.16325
Copy Paste: [[2411.16325]] Luminance Component Analysis for Exposure Correction(https://arxiv.org/abs/2411.16325)
Keywords: restoration
Abstract: Exposure correction methods aim to adjust the luminance while maintaining other luminance-unrelated information. However, current exposure correction methods have difficulty in fully separating luminance-related and luminance-unrelated components, leading to distortions in color, loss of detail, and requiring extra restoration procedures. Inspired by principal component analysis (PCA), this paper proposes an exposure correction method called luminance component analysis (LCA). LCA applies the orthogonal constraint to a U-Net structure to decouple luminance-related and luminance-unrelated features. With decoupled luminance-related features, LCA adjusts only the luminance-related components while keeping the luminance-unrelated components unchanged. To optimize the orthogonal constraint problem, LCA employs a geometric optimization algorithm, which converts the constrained problem in Euclidean space to an unconstrained problem in orthogonal Stiefel manifolds. Extensive experiments show that LCA can decouple the luminance feature from the RGB color space. Moreover, LCA achieves the best PSNR (21.33) and SSIM (0.88) in the exposure correction dataset with 28.72 FPS.
摘要：曝光校正方法旨在调整亮度的同时保留与亮度无关的其他信息。然而目前的曝光校正方法难以完全分离亮度相关和不相关的成分，从而导致色彩失真、细节丢失，并需要额外的恢复程序。受主成分分析（PCA）的启发，本文提出了一种称为亮度成分分析（LCA）的曝光校正方法。LCA将正交约束应用于U-Net结构以解耦亮度相关和亮度无关的特征。在解耦亮度相关特征的情况下，LCA仅调整亮度相关成分，而保持与亮度无关的成分不变。为了优化正交约束问题，LCA采用几何优化算法，将欧氏空间中的约束问题转化为正交Stiefel流形中的无约束问题。大量实验表明，LCA可以将亮度特征从RGB颜色空间中解耦。此外，LCA 以 28.72 FPS 在曝光校正数据集中实现了最佳 PSNR（21.33）和 SSIM（0.88）。

Title: CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain

Authors: Jingchao Peng, Thomas Bashford-Rogers, Zhuang Shao, Haitao Zhao, Aru Ranjan Singh, Abhishek Goswami, Kurt Debattista
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16327
Pdf URL: https://arxiv.org/pdf/2411.16327
Copy Paste: [[2411.16327]] CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain(https://arxiv.org/abs/2411.16327)
Keywords: generation
Abstract: Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.
摘要：红外 (IR) 成像具有在极端光照条件下捕捉内容的独特能力，因此在多个领域具有优势。然而，高分辨率红外传感器的苛刻硬件要求限制了其广泛应用。作为替代方案，可见光可用于合成红外图像，但这会导致图像细节保真度降低，并由于缺乏对场景的情境感知而引入不一致。这是由于使用可见光与标准动态范围的组合，尤其是在极端光照下，而缺乏情境感知会导致伪热交叉伪影。当多个温度相似的物体在训练数据中难以区分时，就会发生这种情况，从而进一步加剧保真度的损失。为了解决这一挑战，本文提出了 CapHDR2IR，这是一种新颖的框架，结合了视觉语言模型，使用高动态范围 (HDR) 图像作为输入来生成红外图像。HDR 图像可捕捉更大范围的亮度变化，确保在不同光照条件下可靠地生成红外图像。此外，密集字幕分支集成了语义理解，从而产生更有意义且更易辨别的 IR 输出。在 HDRT 数据集上进行的大量实验表明，与现有的通用域传输方法以及针对可见光到红外图像转换量身定制的方法相比，所提出的 CapHDR2IR 实现了最先进的性能。

Title: Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16375
Pdf URL: https://arxiv.org/pdf/2411.16375
Copy Paste: [[2411.16375]] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing(https://arxiv.org/abs/2411.16375)
Keywords: generation
Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL
摘要：随着传播模型的进步，如今的视频生成质量令人瞩目。为了延长生成长度并方便实际应用，大多数视频传播模型 (VDM) 以自回归的方式生成视频，即，以前一个剪辑的最后一帧为条件生成后续剪辑。然而，现有的自回归 VDM 效率极低且冗余：模型必须重新计算相邻剪辑之间重叠的所有条件帧。当条件帧以自回归方式扩展以向模型提供长期上下文时，这个问题会更加严重。在这种情况下，计算需求会显著增加（即相对于自回归步骤的二次复杂度）。在本文中，我们提出了 Ca2-VDM，一种具有因果生成和缓存共享的高效自回归 VDM。对于因果生成，它引入了单向特征计算，确保条件帧的缓存可以在之前的自回归步骤中预先计算，并在后续的每个步骤中重复使用，从而消除了冗余计算。对于缓存共享，它在所有去噪步骤中共享缓存，以避免巨大的缓存存储成本。大量实验表明，我们的 Ca2-VDM 实现了最先进的定量和定性视频生成结果，并显著提高了生成速度。代码可在此 https URL 上找到

Title: Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN

Authors: Elona Shatri, Kalikidhar Palavala, George Fazekas
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2411.16405
Pdf URL: https://arxiv.org/pdf/2411.16405
Copy Paste: [[2411.16405]] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN(https://arxiv.org/abs/2411.16405)
Keywords: generation, generative
Abstract: The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
摘要：手写乐谱的生成是增强光学音乐识别 (OMR) 系统的关键一步，该系统依赖于大量且多样化的数据集来实现最佳性能。然而，手写乐谱通常在档案中发现，由于其脆弱性、多样的笔迹风格和图像质量，给数字化带来了挑战。本文通过应用生成对抗网络 (GAN) 来合成逼真的手写乐谱，解决了数据稀缺问题。我们对三种 GAN 模型（DCGAN、ProGAN 和 CycleWGAN）进行了全面评估，比较了它们生成多样化和高质量手写音乐图像的能力。所提出的 CycleWGAN 模型增强了风格迁移和训练稳定性，在定性和定量评估中都明显优于 DCGAN 和 ProGAN。CycleWGAN 实现了卓越的性能，FID 得分为 41.87，IS 为 2.29，KID 为 0.05，使其成为改进 OMR 系统的有希望的解决方案。

Title: TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

Authors: Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Si Liu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2411.16425
Pdf URL: https://arxiv.org/pdf/2411.16425
Copy Paste: [[2411.16425]] TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation(https://arxiv.org/abs/2411.16425)
Keywords: generation
Abstract: The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., $+3.9\%$ SR and $+2.0\%$ SPL absolute improvements on HM3D.
摘要：零样本物体导航 (ZSON) 任务要求具身智能体通过在陌生环境中导航来找到以前未见过的物体。这种以目标为导向的探索在很大程度上依赖于基于环境空间信息的感知、理解和推理能力。然而，当前基于 LLM 的方法将视觉观察转换为语言描述并在语言空间中进行推理，导致空间信息的丢失。在本文中，我们介绍了 TopV-Nav，这是一种基于 MLLM 的方法，它直接在具有完整空间信息的顶视图地图上进行推理。为了充分释放 MLLM 在顶视图视角中的空间推理潜力，我们提出了自适应视觉提示生成 (AVPG) 方法来自适应地构建语义丰富的顶视图地图。它使智能体能够直接利用顶视图地图中包含的空间信息进行彻底的推理。此外，我们设计了一种动态地图缩放 (DMS) 机制，以首选比例动态缩放顶视图地图，增强局部细粒度推理。此外，我们设计了一种目标引导导航 (TGN) 机制来预测和利用目标位置，从而促进全球和类似人类的探索。在 MP3D 和 HM3D 基准上进行的实验证明了我们的 TopV-Nav 的优越性，例如，HM3D 上的 $+3.9\%$ SR 和 $+2.0\%$ SPL 绝对改进。

Title: Unsupervised Event Outlier Detection in Continuous Time

Authors: Somjit Nath, Yik Chau Lui, Siqi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16427
Pdf URL: https://arxiv.org/pdf/2411.16427
Copy Paste: [[2411.16427]] Unsupervised Event Outlier Detection in Continuous Time(https://arxiv.org/abs/2411.16427)
Keywords: generative
Abstract: Event sequence data record the occurrences of events in continuous time. Event sequence forecasting based on temporal point processes (TPPs) has been extensively studied, but outlier or anomaly detection, especially without any supervision from humans, is still underexplored. In this work, we develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. Our novel unsupervised outlier detection framework is based on ideas from generative adversarial networks (GANs) and reinforcement learning (RL). We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data, which may contain outliers. A key insight is that if the generator made a mistake in the correction, it would generate anomalies that are different from the anomalies in the real data, so it serves as data augmentation for the discriminator learning. Different from typical GAN-based outlier detection approaches, our method employs the generator to detect outliers in an online manner. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.
摘要：事件序列数据记录连续时间中事件的发生。基于时间点过程 (TPP) 的事件序列预测已得到广泛研究，但异常值或异常检测（尤其是在没有任何人类监督的情况下）仍未得到充分探索。在这项工作中，我们尽我们所知开发了第一种无监督异常值检测方法来检测异常事件。我们新颖的无监督异常值检测框架基于生成对抗网络 (GAN) 和强化学习 (RL) 的思想。我们训练一个“生成器”，该生成器用于纠正数据中的异常值，并使用一个“鉴别器”，该鉴别器学习区分已纠正的数据和可能包含异常值的实际数据。一个关键的见解是，如果生成器在纠正过程中犯了错误，它将生成与实际数据中的异常不同的异常，因此它可以作为鉴别器学习的数据增强。与典型的基于 GAN 的异常值检测方法不同，我们的方法使用生成器以在线方式检测异常值。实验结果表明，我们的方法比最先进的方法可以更准确地检测事件异常值。

Title: SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Authors: Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16443
Pdf URL: https://arxiv.org/pdf/2411.16443
Copy Paste: [[2411.16443]] SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis(https://arxiv.org/abs/2411.16443)
Keywords: generation
Abstract: Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
摘要：基于文本的 3D 场景生成和编辑具有通过直观的用户交互简化内容创建的巨大潜力。虽然最近的进展利用 3D 高斯分层 (3DGS) 实现高保真和实时渲染，但现有方法通常是专门的和以任务为中心的，缺乏用于生成和编辑的统一框架。在本文中，我们介绍了 SplatFlow，这是一个全面的框架，通过支持直接生成和编辑 3DGS 来解决这一差距。SplatFlow 包含两个主要组件：多视图整流 (RF) 模型和高斯分层解码器 (GSDecoder)。多视图 RF 模型在潜在空间中运行，根据文本提示同时生成多视图图像、深度和相机姿势，从而解决现实世界环境中多样化场景规模和复杂相机轨迹等挑战。然后，GSDecoder 通过前馈 3DGS 方法将这些潜在输出有效地转换为 3DGS 表示。 SplatFlow 利用无需训练的反演和修复技术，实现了无缝 3DGS 编辑，并在统一框架内支持广泛的 3D 任务（包括对象编辑、新视图合成和相机姿势估计），而无需额外的复杂管道。我们在 MVImgNet 和 DL3DV-7K 数据集上验证了 SplatFlow 的功能，证明了其在各种 3D 生成、编辑和修复任务中的多功能性和有效性。

Title: VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation

Authors: Jiawei Wang, Zhiming Cui, Changjian Li
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16446
Pdf URL: https://arxiv.org/pdf/2411.16446
Copy Paste: [[2411.16446]] VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation(https://arxiv.org/abs/2411.16446)
Keywords: generation
Abstract: This paper presents VQ-SGen, a novel algorithm for high-quality sketch generation. Recent approaches have often framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in the first stage, we decouple each stroke's shape and location information to ensure the VQ representation prioritizes stroke shape learning. In the second stage, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as conditional generation and semantic-aware stroke editing. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques, underscoring its effectiveness. The code and model will be made publicly available upon publication.
摘要：本文介绍了一种用于高质量草图生成的新算法 VQ-SGen。最近的方法通常将该任务设计为基于像素的整体或部分生成，而忽略了各个笔画之间的内在和上下文关系，例如近端和远端笔画的形状和空间定位。为了克服这些限制，我们建议将草图中的每个笔画视为一个实体，并引入矢量量化 (VQ) 笔画表示以生成细粒度的草图。我们的方法遵循两阶段框架 - 在第一阶段，我们将每个笔画的形状和位置信息分离，以确保 VQ 表示优先考虑笔画形状学习。在第二阶段，我们将精确而紧凑的表示输入到自动解码 Transformer 中，以将笔画语义、位置和形状纳入生成过程。通过利用标记化的笔画表示，我们的方法可以高保真地生成笔画，并促进新应用，例如条件生成和语义感知笔画编辑。综合实验表明，我们的方法超越了现有的最先进技术，凸显了其有效性。代码和模型将在发布后公开。

Title: Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Authors: Yutong Wang, Jiajie Teng, Jiajiong Cao, Yuming Li, Chenguang Ma, Hongteng Xu, Dixin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16468
Pdf URL: https://arxiv.org/pdf/2411.16468
Copy Paste: [[2411.16468]] Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency(https://arxiv.org/abs/2411.16468)
Keywords: restoration
Abstract: As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{this https URL}.
摘要：人脸视频是一种非常常见的视频类型，经常出现在电影、脱口秀、直播等场景中。由于高通信成本和有限的传输带宽导致的高压缩比，现实世界的在线视频经常受到模糊和量化噪声等质量下降的困扰。这些质量下降对人脸视频的影响尤其严重，因为人类的视觉系统对面部细节非常敏感。尽管视频人脸增强取得了重大进展，但当前方法仍然存在 $i)$ 处理时间长和 $ii)$ 时空视觉效果不一致（例如闪烁）的问题。本研究提出了一种新颖而有效的盲视频人脸增强方法来克服上述两个挑战，使用有效的去闪烁机制从压缩的低质量版本恢复高质量视频。具体而言，所提出的方法以 3D-VQGAN 主干为基础开发，该主干与记录高质量肖像特征和基于残差的时间信息的时空码本相关联。我们为该模型开发了一个两阶段的学习框架。在阶段 \Rmnum{1} 中，我们使用正则化器来学习模型，以缓解码本崩溃问题。在阶段 \Rmnum{2} 中，我们学习两个转换器来从码本中查找代码并进一步更新低质量视频的编码器。在 VFHQ-Test 数据集上进行的实验表明，我们的方法在效率和效果上都超越了当前最先进的盲人脸视频修复和去闪烁方法。代码可在 \url{此 https URL} 处获得。

Title: Multi-Resolution Generative Modeling of Human Motion from Limited Data

Authors: David Eduardo Moreno-Villamarín, Anna Hilsmann, Peter Eisert
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16498
Pdf URL: https://arxiv.org/pdf/2411.16498
Copy Paste: [[2411.16498]] Multi-Resolution Generative Modeling of Human Motion from Limited Data(https://arxiv.org/abs/2411.16498)
Keywords: generation, generative
Abstract: We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
摘要：我们提出了一个生成模型，该模型可以学习从有限的训练序列中合成人体运动。我们的框架提供跨多个时间分辨率的条件生成和混合。该模型通过集成骨架卷积层和多尺度架构，巧妙地捕捉人体运动模式。我们的模型包含一组生成和对抗网络以及嵌入模块，每个模块都经过量身定制，以特定帧速率生成运动，同时控制其内容和细节。值得注意的是，我们的方法还扩展到同声手势的合成，展示了它从语音输入生成同步手势的能力，即使在配对数据有限的情况下也是如此。通过直接合成 SMPL 姿势参数，我们的方法避免了测试时间调整以适应人体网格。实验结果展示了我们的模型能够实现训练示例的广泛覆盖，同时生成多样化的运动，如局部和全局多样性指标所示。

Title: LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation

Authors: Steven Song, Anirudh Subramanyam, Irene Madejski, Robert L. Grossman
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16523
Pdf URL: https://arxiv.org/pdf/2411.16523
Copy Paste: [[2411.16523]] LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation(https://arxiv.org/abs/2411.16523)
Keywords: generation, generative
Abstract: In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
摘要：在当前的图像字幕制作范式中，深度学习模型经过训练可以从潜在特征的图像嵌入中生成文本。我们质疑这些潜在特征应该是高维向量的假设，这些向量需要模型微调才能处理。在这里，我们提出了标签增强检索增强生成 (LaB-RAG)，这是一种基于文本的图像字幕制作方法，它利用分类标签形式的图像描述符来增强带有预训练大型语言模型 (LLM) 的标准检索增强生成 (RAG)。我们在放射学报告生成 (RRG) 的背景下研究我们的方法，其中的任务是从一组放射学图像（例如 X 射线）中生成一份临床医生的报告，详细说明他们的观察结果。我们认为，对提取的图像嵌入进行简单的线性分类器可以有效地将 X 射线转换为文本空间作为放射学特定标签。结合标准 RAG，我们表明这些派生的文本标签可以与通用域 LLM 一起使用来生成放射学报告。我们从未训练过我们的生成语言模型或图像特征编码器模型，也从未直接向 LLM“展示”X 射线，但我们证明了与其他基于检索的 RRG 方法相比，LaB-RAG 在自然语言和放射语言指标方面取得了更好的结果，同时与其他经过微调的视觉语言 RRG 模型相比也取得了有竞争力的结果。我们进一步展示了我们对 LaB-RAG 各个组件的实验结果，以便更好地理解我们的方法。最后，我们批评了使用流行的 RRG 指标，认为有可能人为地夸大其结果而不会造成真正的数据泄漏。

Title: Representation Collapsing Problems in Vector Quantization

Authors: Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16550
Pdf URL: https://arxiv.org/pdf/2411.16550
Copy Paste: [[2411.16550]] Representation Collapsing Problems in Vector Quantization(https://arxiv.org/abs/2411.16550)
Keywords: generative
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.
摘要：矢量量化是机器学习中的一种技术，它将连续表示离散化为一组离散向量。它广泛用于对大型语言模型、扩散模型和其他生成模型的数据表示进行标记化。尽管矢量量化非常流行，但它在生成模型中的特性和行为仍未得到充分探索。在本研究中，我们研究了矢量量化中的表示崩溃——这是一种严重的退化，其中码本标记或潜在嵌入通过收敛到有限的值子集而失去其判别能力。这种崩溃从根本上损害了模型捕获不同数据模式的能力。通过利用合成数据集和真实数据集，我们确定了每种崩溃类型的严重程度和触发条件。我们的分析表明，受限的初始化和有限的编码器容量会导致标记崩溃和嵌入崩溃。基于这些发现，我们提出了旨在缓解每次崩溃的潜在解决方案。据我们所知，这是第一项全面研究矢量量化中的表示崩溃问题的研究。

Title: Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches

Authors: Yinqiu Feng, Aoran Shen, Jiacheng Hu, Yingbin Liang, Shiru Wang, Junliang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16567
Pdf URL: https://arxiv.org/pdf/2411.16567
Copy Paste: [[2411.16567]] Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches(https://arxiv.org/abs/2411.16567)
Keywords: generative
Abstract: This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning in a framework designed to tackle the challenges posed by small-sample data. Recognizing the critical limitations of traditional machine learning models that require large datasets-especially in fields such as drug discovery, target recognition, and malicious traffic detection-this study proposes a novel strategy that leverages Generative Adversarial Networks (GANs) and advanced optimization techniques to improve model performance with limited data. Specifically, the paper addresses the noise and bias issues introduced by data augmentation methods, contrasting them with model-based approaches, such as fine-tuning and metric learning, which rely heavily on related datasets. By combining Markov Chain Monte Carlo (MCMC) sampling and discriminative model ensemble strategies within a GAN framework, the proposed model adjusts generative and discriminative distributions to simulate a broader range of relevant data. Furthermore, it employs MHLoss and a reparameterized GAN ensemble to enhance stability and accelerate convergence, ultimately leading to improved classification performance on small-sample images and structured datasets. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning, offering a practical solution that bridges data scarcity with high-performing model adaptability and generalization.
摘要：本文提出了一种创新方法，通过在旨在应对小样本数据挑战的框架中将数据增强与模型微调相结合来增强小样本学习。认识到需要大量数据集的传统机器学习模型的严重局限性——尤其是在药物发现、目标识别和恶意流量检测等领域——本研究提出了一种新策略，利用生成对抗网络 (GAN) 和高级优化技术来提高有限数据下的模型性能。具体来说，本文解决了数据增强方法引入的噪声和偏差问题，并将其与微调和度量学习等严重依赖相关数据集的基于模型的方法进行了对比。通过在 GAN 框架内结合马尔可夫链蒙特卡罗 (MCMC) 采样和判别模型集成策略，所提出的模型可以调整生成和判别分布以模拟更广泛的相关数据。此外，它还采用了 MHLoss 和重新参数化的 GAN 集成来增强稳定性并加速收敛，最终提高了小样本图像和结构化数据集的分类性能。结果证实，本研究开发的 MhERGAN 算法对于小样本学习非常有效，提供了一种实用的解决方案，既解决了数据稀缺问题，又实现了高性能的模型适应性和泛化能力。

Title: Rethinking Diffusion for Text-Driven Human Motion Generation

Authors: Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, Huaizu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16575
Pdf URL: https://arxiv.org/pdf/2411.16575
Copy Paste: [[2411.16575]] Rethinking Diffusion for Text-Driven Human Motion Generation(https://arxiv.org/abs/2411.16575)
Keywords: generation
Abstract: Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
摘要：自 2023 年以来，基于矢量量化 (VQ) 的离散生成方法迅速主导了人体运动生成，主要在标准性能指标上超越了基于扩散的连续生成方法。然而，基于 VQ 的方法有固有的局限性。将连续运动数据表示为有限的离散标记会导致不可避免的信息丢失，降低生成运动的多样性，并限制它们作为运动先验或生成指导有效发挥作用的能力。相比之下，基于扩散的方法的连续空间生成特性使它们非常适合解决这些限制，甚至具有模型可扩展性的潜力。在这项工作中，我们系统地研究了当前基于 VQ 的方法表现良好的原因，并从运动数据表示和分布的角度探索了现有基于扩散的方法的局限性。借鉴这些见解，我们保留了基于扩散的人体运动生成模型的固有优势，并在基于 VQ 的方法的启发下逐步优化它。我们的方法引入了一种能够执行双向掩蔽自回归的人体运动扩散模型，并通过改革数据表示和分布进行了优化。此外，我们还提出了更强大的评估方法来公平地评估不同的方法。在基准人体运动生成数据集上进行的大量实验表明，我们的方法优于以前的方法并达到了最先进的性能。

Title: Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Authors: Ronghuan Wu, Wanchao Su, Jing Liao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16602
Pdf URL: https://arxiv.org/pdf/2411.16602
Copy Paste: [[2411.16602]] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models(https://arxiv.org/abs/2411.16602)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
摘要：可缩放矢量图形 (SVG) 已成为数字设计中矢量图形的事实标准，可提供分辨率独立性和对单个元素的精确控制。尽管 SVG 具有诸多优势，但创建高质量的 SVG 内容仍然具有挑战性，因为它需要专业编辑软件的技术专长和大量时间来制作复杂的形状。最近的文本到 SVG 生成方法旨在使矢量图形创建更容易，但它们在形状规则性、泛化能力和表现力方面仍然存在局限性。为了应对这些挑战，我们引入了 Chat2SVG，这是一个混合框架，结合了大型语言模型 (LLM) 和图像扩散模型的优势，用于文本到 SVG 生成。我们的方法首先使用 LLM 从基本几何图元生成语义上有意义的 SVG 模板。在图像扩散模型的指导下，双阶段优化管道细化潜在空间中的路径并调整点坐标以增强几何复杂性。大量实验表明，Chat2SVG 在视觉保真度、路径规则性和语义对齐方面优于现有方法。此外，我们的系统通过自然语言指令实现直观的编辑，让所有用户都可以创作专业的矢量图形。

Title: Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Authors: Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16619
Pdf URL: https://arxiv.org/pdf/2411.16619
Copy Paste: [[2411.16619]] Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric(https://arxiv.org/abs/2411.16619)
Keywords: generation, quality assessment
Abstract: AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 3,200 AGVs derived from 8 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released in public at this https URL
摘要：近年来，人工智能驱动的视频生成技术取得了重大进展。然而，涉及人类活动的人工智能生成的视频 (AGV) 往往表现出严重的视觉和语义扭曲，阻碍了视频生成技术在现实场景中的实际应用。为了应对这一挑战，我们对人类活动 AGV 质量评估进行了开创性的研究，重点关注视觉质量评估和语义扭曲的识别。首先，我们构建了人工智能生成的人类活动视频质量评估 (Human-AGVQA) 数据集，该数据集由 8 种流行的文本转视频 (T2V) 模型派生的 3,200 个 AGV 组成，使用 400 个描述各种人类活动的文本提示。我们进行了一项主观研究，以评估 AGV 的人体外观质量、动作连续性质量和整体视频质量，并识别人体部位的语义问题。基于 Human-AGVQA，我们对 T2V 模型的性能进行了基准测试，并分析了它们在生成不同类别的人类活动方面的优缺点。其次，我们开发了一种客观评估指标，称为 AI 生成的人类活动视频质量指标 (GHVQ)，用于自动分析人类活动 AGV 的质量。GHVQ 系统地提取以人为中心的质量特征、AI 生成的内容感知质量特征和时间连续性特征，使其成为人类活动 AGV 的全面且可解释的质量指标。大量实验结果表明，GHVQ 在 Human-AGVQA 数据集上的表现远胜于现有质量指标，证明了其在评估人类活动 AGV 质量方面的有效性。Human-AGVQA 数据集和 GHVQ 指标将在此 https URL 上公开发布

Title: Imperceptible Adversarial Examples in the Physical World

Authors: Weilin Xu, Sebastian Szyller, Cory Cornelius, Luis Murillo Rojas, Marius Arvinte, Alvaro Velasquez, Jason Martin, Nageen Himayat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16622
Pdf URL: https://arxiv.org/pdf/2411.16622
Copy Paste: [[2411.16622]] Imperceptible Adversarial Examples in the Physical World(https://arxiv.org/abs/2411.16622)
Keywords: generation
Abstract: Adversarial examples in the digital domain against deep learning-based computer vision models allow for perturbations that are imperceptible to human eyes. However, producing similar adversarial examples in the physical world has been difficult due to the non-differentiable image distortion functions in visual sensing systems. The existing algorithms for generating physically realizable adversarial examples often loosen their definition of adversarial examples by allowing unbounded perturbations, resulting in obvious or even strange visual patterns. In this work, we make adversarial examples imperceptible in the physical world using a straight-through estimator (STE, a.k.a. BPDA). We employ STE to overcome the non-differentiability -- applying exact, non-differentiable distortions in the forward pass of the backpropagation step, and using the identity function in the backward pass. Our differentiable rendering extension to STE also enables imperceptible adversarial patches in the physical world. Using printout photos, and experiments in the CARLA simulator, we show that STE enables fast generation of $\ell_\infty$ bounded adversarial examples despite the non-differentiable distortions. To the best of our knowledge, this is the first work demonstrating imperceptible adversarial examples bounded by small $\ell_\infty$ norms in the physical world that force zero classification accuracy in the global perturbation threat model and cause near-zero ($4.22\%$) AP50 in object detection in the patch perturbation threat model. We urge the community to re-evaluate the threat of adversarial examples in the physical world.
摘要：数字领域中与基于深度学习的计算机视觉模型对抗的对抗样本允许人眼无法察觉的扰动。然而，由于视觉传感系统中的图像失真函数不可微分，在物理世界中生成类似的对抗样本一直很困难。现有的用于生成物理上可实现的对抗样本的算法通常会通过允许无限制的扰动来放宽对抗样本的定义，从而导致明显甚至奇怪的视觉模式。在这项工作中，我们使用直通估计器 (STE，又名 BPDA) 使对抗样本在物理世界中不可察觉。我们使用 STE 来克服不可微性——在反向传播步骤的前向传递中应用精确、不可微分的失真，并在后向传递中使用恒等函数。我们对 STE 的可微分渲染扩展还支持在物理世界中生成不可察觉的对抗补丁。使用打印照片和 CARLA 模拟器中的实验，我们表明 STE 能够快速生成 $\ell_\infty$ 有界对抗样本，尽管存在不可微分的扭曲。据我们所知，这是首次展示物理世界中受小 $\ell_\infty$ 范数限制的难以察觉的对抗样本，这些样本在全局扰动威胁模型中导致分类准确率为零，在补丁扰动威胁模型中导致物体检测的 AP50 接近于零（$4.22\%$）。我们敦促社区重新评估物理世界中对抗样本的威胁。

Title: Exploring Discrete Flow Matching for 3D De Novo Molecule Generation

Authors: Ian Dunn, David R. Koes
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.16644
Pdf URL: https://arxiv.org/pdf/2411.16644
Copy Paste: [[2411.16644]] Exploring Discrete Flow Matching for 3D De Novo Molecule Generation(https://arxiv.org/abs/2411.16644)
Keywords: generation, generative
Abstract: Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, de novo molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D de novo small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D de novo design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at \url{this https URL}.
摘要：产生新分子结构的深度生成模型有可能促进化学发现。流匹配是一种最近提出的生成建模框架，在包括生物分子结构在内的各种任务上都取得了令人印象深刻的表现。精髓流匹配框架仅针对连续数据开发。然而，从头分子设计任务需要生成离散数据，例如原子元素或氨基酸残基序列。最近提出了几种离散流匹配方法来弥补这一差距。在这项工作中，我们对现有离散流匹配方法在 3D 从头小分子生成中的性能进行了基准测试，并解释了它们的不同行为。因此，我们提出了 FlowMol-CTMC，这是一种开源模型，它以比现有方法更少的可学习参数实现了 3D 从头设计的最先进性能。此外，我们建议使用指标来捕捉超越局部化学价态约束和更高阶结构基序的分子质量。这些指标表明，即使满足基本约束，模型也倾向于在训练数据分布之外产生不寻常且可能有问题的功能组。用于重现这项工作的代码和训练模型可以在 \url{此 https URL} 中找到。

Title: DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Authors: Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16657
Pdf URL: https://arxiv.org/pdf/2411.16657
Copy Paste: [[2411.16657]] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation(https://arxiv.org/abs/2411.16657)
Keywords: generation
Abstract: Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.
摘要：故事讲述视频生成 (SVG) 最近成为一项任务，用于创建长篇、多动作、多场景的视频，这些视频可以一致地代表输入文本脚本中描述的故事。SVG 在媒体和娱乐领域的多样化内容创建方面具有巨大潜力；然而，它也带来了重大挑战：(1) 对象必须表现出一系列细粒度、复杂的运动，(2) 多个对象需要一致地出现在各个场景中，以及 (3) 主体可能需要在单个场景中无缝过渡的多个动作。为了应对这些挑战，我们提出了 DreamRunner，一种新颖的故事到视频生成方法：首先，我们使用大型语言模型 (LLM) 构建输入脚本，以促进粗粒度场景规划以及细粒度对象级布局和运动规划。接下来，DreamRunner 提出了检索增强测试时间自适应，以捕获每个场景中对象的目标运动先验，支持基于检索到的视频进行多样化的运动定制，从而促进生成具有复杂脚本运动的新视频。最后，我们提出了一种新颖的基于时空区域的 3D 注意力和先验注入模块 SR3AI，用于细粒度的对象运动绑定和逐帧语义控制。我们将 DreamRunner 与各种 SVG 基线进行比较，展示了字符一致性、文本对齐和平滑过渡方面的最先进性能。此外，DreamRunner 在合成文本到视频生成中表现出强大的细粒度条件跟踪能力，在 T2V-ComBench 上的表现明显优于基线。最后，我们用定性示例验证了 DreamRunner 生成多对象交互的强大能力。

Title: Factorized Visual Tokenization and Generation

Authors: Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong He, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16681
Pdf URL: https://arxiv.org/pdf/2411.16681
Copy Paste: [[2411.16681]] Factorized Visual Tokenization and Generation(https://arxiv.org/abs/2411.16681)
Keywords: generation
Abstract: Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. this https URL
摘要：视觉标记器是图像生成的基础。它们将视觉数据转换为离散标记，使基于变换器的模型在图像生成方面表现出色。尽管取得了成功，但基于 VQ 的标记器（如 VQGAN）由于词汇量受限而面临重大限制。简单地扩展码本通常会导致训练不稳定并降低性能增益，因此可扩展性是一项关键挑战。在这项工作中，我们引入了分解量化 (FQ)，这是一种新颖的方法，它通过将大型码本分解为多个独立的子码本来重振基于 VQ 的标记器。这种分解降低了大型码本的查找复杂性，从而实现了更高效和可扩展的视觉标记。为了确保每个子码本捕获独特且互补的信息，我们提出了一种解缠正则化，明确减少冗余，促进子码本之间的多样性。此外，我们将表示学习集成到训练过程中，利用预训练的视觉模型（如 CLIP 和 DINO）将语义丰富性注入学习到的表示中。这种设计确保我们的标记器能够捕获不同的语义级别，从而实现更具表现力和解开的表示。实验表明，提出的 FQGAN 模型显著提高了视觉标记器的重建质量，实现了最先进的性能。我们进一步证明，这种标记器可以有效地适应自回归图像生成。此 https URL

Title: Generative Omnimatte: Learning to Decompose Video into Layers

Authors: Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, Forrester Cole
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16683
Pdf URL: https://arxiv.org/pdf/2411.16683
Copy Paste: [[2411.16683]] Generative Omnimatte: Learning to Decompose Video into Layers(https://arxiv.org/abs/2411.16683)
Keywords: generative
Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.
摘要：给定一段视频和一组输入对象蒙版，全向遮挡方法旨在将视频分解为语义上有意义的图层，其中包含各个对象及其相关效果，例如阴影和反射。现有的全向遮挡方法假设背景为静态或姿势和深度估计准确，当这些假设不成立时，会产生较差的分解。此外，由于自然视频缺乏生成先验，现有方法无法完成动态遮挡区域。我们提出了一种新颖的生成分层视频分解框架来解决全向遮挡问题。我们的方法不假设场景静止，也不需要相机姿势或深度信息，并能生成干净完整的图层，包括令人信服的遮挡动态区域完成度。我们的核心思想是训练一个视频扩散模型来识别和消除特定物体引起的场景效果。我们表明，该模型可以使用小型、精心策划的数据集从现有的视频修复模型进行微调，并针对包含软阴影、光泽反射、溅水等的各种随意捕获的视频展示高质量的分解和编辑结果。