Copy Paste: [[2411.15199]] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation(https://arxiv.org/abs/2411.15199)
Keywords: generation, generative
Abstract: With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, $\textit{Adaptively Controllable Diffusion (AC-Diff) Model}$, to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a $\textit{Conditional Time-Step (CTS) Module}$ to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our $\textit{Adaptive Hybrid Noise Schedule (AHNS) Module}$. We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.
Copy Paste: [[2411.15203]] Multimodal large language model for wheat breeding: a new exploration of smart breeding(https://arxiv.org/abs/2411.15203)
Keywords: generation
Abstract: UAV remote sensing technology has become a key technology in crop breeding, which can achieve high-throughput and non-destructive collection of crop phenotyping data. However, the multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. Therefore, it is important to develop a smart breeding goal tool to mine cross-domain multimodal data. Based on different pre-trained open-source multimodal large language models (MLLMs) (e.g., Qwen-VL, InternVL, Deepseek-VL), this study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs, thereby constructing multiple multimodal large language models for wheat breeding (WBLMs). The above WBLMs were evaluated using the newly created evaluation benchmark in this study. The results showed that the WBLM constructed using SFT, RAG and RLHF technologies and InternVL2-8B has leading performance. Then, subsequent experiments were conducted using the WBLM. Ablation experiments indicated that the combination of SFT, RAG, and RLHF technologies can improve the overall generation performance, enhance the generated quality, balance the timeliness and adaptability of the generated answer, and reduce hallucinations and biases. The WBLM performed best in wheat yield prediction using cross-domain data (remote sensing, phenotyping, weather, germplasm) simultaneously, with R2 and RMSE of 0.821 and 489.254 kg/ha, respectively. Furthermore, the WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.
Copy Paste: [[2411.15205]] DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh(https://arxiv.org/abs/2411.15205)
Keywords: generation
Abstract: Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
Copy Paste: [[2411.15215]] S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning(https://arxiv.org/abs/2411.15215)
Keywords: generation
Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.
Title: Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
Authors: Yoel Zimmermann, Adib Bazgir, Zartashia Afzal, Fariha Agbere, Qianxiang Ai, Nawaf Alampara, Alexander Al-Feghali, Mehrad Ansari, Dmytro Antypov, Amro Aswad, Jiaru Bai, Viktoriia Baibakova, Devi Dutta Biswajeet, Erik Bitzek, Joshua D. Bocarsly, Anna Borisova, Andres M Bran, L. Catherine Brinson, Marcel Moran Calderon, Alessandro Canalicchio, Victor Chen, Yuan Chiang, Defne Circi, Benjamin Charmes, Vikrant Chaudhary, Zizhang Chen, Min-Hsueh Chiu, Judith Clymo, Kedar Dabhadkar, Nathan Daelman, Archit Datar, Matthew L. Evans, Maryam Ghazizade Fard, Giuseppe Fisicaro, Abhijeet Sadashiv Gangan, Janine George, Jose D. Cojal Gonzalez, Michael Götte, Ankur K. Gupta, Hassan Harb, Pengyu Hong, Abdelrahman Ibrahim, Ahmed Ilyas, Alishba Imran, Kevin Ishimwe, Ramsey Issa, Kevin Maik Jablonka, Colin Jones, Tyler R. Josephson, Greg Juhasz, Sarthak Kapoor, Rongda Kang, Ghazal Khalighinejad, Sartaaj Khan, Sascha Klawohn, Suneel Kuman, Alvin Noe Ladines, Sarom Leang, Magdalena Lederbauer, Sheng-Lun Mark Liao, Hao Liu, Xuefeng Liu, Stanley Lo, Sandeep Madireddy, Piyush Ranjan Maharana, Shagun Maheshwari, Soroush Mahjoubi, José A. Márquez, Rob Mills, Trupti Mohanty, Bernadette Mohr, Seyed Mohamad Moosavi, Alexander Moßhammer, Amirhossein D. Naghdi, Aakash Naik, Oleksandr Narykov, Hampus Näsström, Xuan Vu Nguyen, Xinyi Ni, Dana O'Connor, Teslim Olayiwola, Federico Ottomano, Aleyna Beste Ozhan, Sebastian Pagel, Chiku Parida, Jaehee Park, Vraj Patel, Elena Patyukova, Martin Hoffmann Petersen, Luis Pinto, José M. Pizarro, Dieter Plessers, Tapashree Pradhan, Utkarsh Pratiush, Charishma Puli, Andrew Qin, Mahyar Rajabi, Francesco Ricci, Elliot Risch, Martiño Ríos-García
Copy Paste: [[2411.15221]] Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry(https://arxiv.org/abs/2411.15221)
Keywords: generation
Abstract: Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.
Copy Paste: [[2411.15236]] Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps(https://arxiv.org/abs/2411.15236)
Keywords: generation
Abstract: In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.
Copy Paste: [[2411.15245]] AnyText2: Visual Text Generation and Editing With Customizable Attributes(https://arxiv.org/abs/2411.15245)
Keywords: generation
Abstract: As the text-to-image (T2I) domain progresses, generating text that seamlessly integrates with visual content has garnered significant attention. However, even with accurate text generation, the inability to control font and color can greatly limit certain applications, and this issue remains insufficiently addressed. This paper introduces AnyText2, a novel method that enables precise control over multilingual text attributes in natural scene image generation and editing. Our approach consists of two main components. First, we propose a WriteNet+AttnX architecture that injects text rendering capabilities into a pre-trained T2I model. Compared to its predecessor, AnyText, our new approach not only enhances image realism but also achieves a 19.8% increase in inference speed. Second, we explore techniques for extracting fonts and colors from scene images and develop a Text Embedding Module that encodes these text attributes separately as conditions. As an extension of AnyText, this method allows for customization of attributes for each line of text, leading to improvements of 3.3% and 9.3% in text accuracy for Chinese and English, respectively. Through comprehensive experiments, we demonstrate the state-of-the-art performance of our method. The code and model will be made open-source in this https URL.
Abstract: Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at this https URL.
Copy Paste: [[2411.15252]] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation(https://arxiv.org/abs/2411.15252)
Keywords: generation
Abstract: Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.
Copy Paste: [[2411.15262]] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation(https://arxiv.org/abs/2411.15262)
Keywords: generation
Abstract: Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.
摘要:视频生成模型的最新进展(如稳定视频扩散)显示出有希望的结果,但主要集中在短的单场景视频上。这些模型难以生成涉及多个场景、连贯的叙述和一致的角色的长视频。此外,没有专门用于分析、评估和训练长视频生成模型的公开数据集。在本文中,我们提出了 MovieBench:用于长视频生成的分层电影级数据集,它通过提供独特的贡献来解决这些挑战:(1)具有丰富、连贯的故事情节和多场景叙述的电影长度视频,(2)角色外观和音频在场景之间的一致性,以及(3)分层数据结构包含高级电影信息和详细的镜头级描述。实验表明,MovieBench 带来了一些新的见解和挑战,例如在多个场景中为各种角色保持角色 ID 的一致性。该数据集将公开并持续维护,旨在推动长视频生成领域的发展。数据可以在以下位置找到:此 https URL。
Title: Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI
Authors: Won Jun Kim, Hyungjin Chung, Jaemin Kim, Sangmin Lee, Byeongsu Sim, Jong Chul Ye
Copy Paste: [[2411.15265]] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI(https://arxiv.org/abs/2411.15265)
Keywords: generation
Abstract: Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
Copy Paste: [[2411.15271]] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration(https://arxiv.org/abs/2411.15271)
Keywords: generation
Abstract: Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.
Abstract: Facial personalization represents a crucial downstream task in the domain of text-to-image generation. To preserve identity fidelity while ensuring alignment with user-defined prompts, current mainstream frameworks for facial personalization predominantly employ identity embedding mechanisms to associate identity information with textual embeddings. However, our experiments show that identity embeddings compromise the effectiveness of other tokens within the prompt, thereby hindering high prompt consistency, particularly when prompts involve multiple facial attributes. Moreover, previous works overlook the fact that their corresponding foundation models hold great potential to generate faces aligning to prompts well and can be easily leveraged to cure these ill-aligned attributes in personalized models. Building upon these insights, we propose FreeCure, a training-free framework that harnesses the intrinsic knowledge from the foundation models themselves to improve the prompt consistency of personalization models. First, by extracting cross-attention and semantic maps from the denoising process of foundation models, we identify easily localized attributes (e.g., hair, accessories, etc). Second, we enhance multiple attributes in the outputs of personalization models through a novel noise-blending strategy coupled with an inversion-based process. Our approach offers several advantages: it eliminates the need for training; it effectively facilitates the enhancement for a wide array of facial attributes in a non-intrusive manner; and it can be seamlessly integrated into existing popular personalization models. FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models while maintaining the integrity of original identity fidelity.
Copy Paste: [[2411.15279]] Don't Mesh with Me: Generating Constructive Solid Geometry Instead of Meshes by Fine-Tuning a Code-Generation LLM(https://arxiv.org/abs/2411.15279)
Keywords: generation
Abstract: While recent advancements in machine learning, such as LLMs, are revolutionizing software development and creative industries, they have had minimal impact on engineers designing mechanical parts, which remains largely a manual process. Existing approaches to generate 3D geometry most commonly use meshes as a 3D representation. While meshes are suitable for assets in video games or animations, they lack sufficient precision and adaptability for mechanical engineering purposes. This paper introduces a novel approach for the generation of 3D geometry that generates surface-based Constructive Solid Geometry (CSG) by leveraging a code-generation LLM. First, we create a dataset of 3D mechanical parts represented as code scripts by converting Boundary Representation geometry (BREP) into CSG-based Python scripts. Second, we create annotations in natural language using GPT-4. The resulting dataset is used to fine-tune a code-generation LLM. The fine-tuned LLM can complete geometries based on positional input and natural language in a plausible way, demonstrating geometric understanding.
摘要:虽然机器学习领域的最新进展(例如 LLM)正在彻底改变软件开发和创意产业,但它们对工程师设计机械零件的影响却微乎其微,因为机械零件设计在很大程度上仍是一个手动过程。现有的生成 3D 几何图形的方法最常使用网格作为 3D 表示。虽然网格适用于视频游戏或动画中的资产,但它们缺乏足够的精度和适应性,无法满足机械工程用途。本文介绍了一种生成 3D 几何图形的新方法,该方法利用代码生成 LLM 生成基于表面的构造性实体几何图形 (CSG)。首先,我们通过将边界表示几何图形 (BREP) 转换为基于 CSG 的 Python 脚本,创建以代码脚本表示的 3D 机械零件数据集。其次,我们使用 GPT-4 以自然语言创建注释。生成的数据集用于微调代码生成 LLM。经过微调的 LLM 可以根据位置输入和自然语言以合理的方式完成几何图形,从而展示几何理解。
Title: There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks
Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
Copy Paste: [[2411.15288]] There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks(https://arxiv.org/abs/2411.15288)
Keywords: generation
Abstract: The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.
摘要:任何分割模型 (SAM) 最初是为标签无关的掩码生成而设计的。这个模型是否也具有固有的语义理解,对更广泛的视觉任务有价值?在这项工作中,我们采用多阶段方法来探索这个问题。我们首先通过比较分类任务下的基本图像编码器效率与已建立的模型 (CLIP 和 DINOv2) 来量化 SAM 的语义能力。我们的研究结果表明,SAM 特征表示中存在严重的语义辨别能力不足,限制了需要类别区分的任务的潜力。这一初步结果激发了我们的探索性研究,该研究尝试通过轻量级微调的上下文学习来启用语义信息,我们观察到对看不见的类别的通用性仍然有限。我们的观察最终提出了一种利用 DINOv2 特征的无训练方法,旨在更好地赋予 SAM 语义理解并通过基于特征的相似性实现实例级类别区分。我们的研究表明,结合外部语义源为增强 SAM 在需要语义理解的复杂视觉任务中的实用性提供了一个有希望的方向。
Title: Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage
Authors: Soumil Datta, Shih-Chieh Dai, Leo Yu, Guanhong Tao
Copy Paste: [[2411.15367]] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage(https://arxiv.org/abs/2411.15367)
Keywords: generation, generative
Abstract: Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech{}, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.
Copy Paste: [[2411.15393]] Gradient-Free Classifier Guidance for Diffusion Model Sampling(https://arxiv.org/abs/2411.15393)
Keywords: generation
Abstract: Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512$\times$512, we achieve a record $\text{FD}_{\text{DINOv2}}$ of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)
Title: FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation
Authors: Trong Thang Pham, Ngoc-Vuong Ho, Nhat-Tan Bui, Thinh Phan, Patel Brijesh, Donald Adjeroh, Gianfranco Doretto, Anh Nguyen, Carol C. Wu, Hien Nguyen, Ngan Le
Copy Paste: [[2411.15413]] FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation(https://arxiv.org/abs/2411.15413)
Keywords: generation
Abstract: Developing an interpretable system for generating reports in chest X-ray (CXR) analysis is becoming increasingly crucial in Computer-aided Diagnosis (CAD) systems, enabling radiologists to comprehend the decisions made by these systems. Despite the growth of diverse datasets and methods focusing on report generation, there remains a notable gap in how closely these models' generated reports align with the interpretations of real radiologists. In this study, we tackle this challenge by initially introducing Fine-Grained CXR (FG-CXR) dataset, which provides fine-grained paired information between the captions generated by radiologists and the corresponding gaze attention heatmaps for each anatomy. Unlike existing datasets that include a raw sequence of gaze alongside a report, with significant misalignment between gaze location and report content, our FG-CXR dataset offers a more grained alignment between gaze attention and diagnosis transcript. Furthermore, our analysis reveals that simply applying black-box image captioning methods to generate reports cannot adequately explain which information in CXR is utilized and how long needs to attend to accurately generate reports. Consequently, we propose a novel explainable radiologist's attention generator network (Gen-XAI) that mimics the diagnosis process of radiologists, explicitly constraining its output to closely align with both radiologist's gaze attention and transcript. Finally, we perform extensive experiments to illustrate the effectiveness of our method. Our datasets and checkpoint is available at this https URL.
Copy Paste: [[2411.15420]] Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention(https://arxiv.org/abs/2411.15420)
Keywords: generation
Abstract: In the domain of single-view 3D reconstruction, traditional techniques have frequently relied on expensive and time-intensive 3D annotation data. Facing the challenge of annotation acquisition, semi-supervised learning strategies offer an innovative approach to reduce the dependence on labeled data. Despite these developments, the utilization of this learning paradigm in 3D reconstruction tasks remains relatively constrained. In this research, we created an innovative semi-supervised framework for 3D reconstruction that distinctively uniquely introduces a multi shape prior fusion strategy, intending to guide the creation of more realistic object structures. Additionally, to improve the quality of shape generation, we integrated a self-attention module into the traditional decoder. In benchmark tests on the ShapeNet dataset, our method substantially outperformed existing supervised learning methods at diverse labeled ratios of 1\%, 10\%, and 20\%. Moreover, it showcased excellent performance on the real-world Pix3D dataset. Through comprehensive experiments on ShapeNet, our framework demonstrated a 3.3\% performance improvement over the baseline. Moreover, stringent ablation studies further confirmed the notable effectiveness of our approach. Our code has been released on this https URL
摘要:在单视图 3D 重建领域,传统技术通常依赖于昂贵且耗时的 3D 注释数据。面对注释获取的挑战,半监督学习策略提供了一种创新方法来减少对标记数据的依赖。尽管取得了这些进展,但这种学习范式在 3D 重建任务中的利用仍然相对受限。在这项研究中,我们创建了一个创新的半监督 3D 重建框架,该框架独特地引入了多形状先验融合策略,旨在指导创建更逼真的对象结构。此外,为了提高形状生成的质量,我们将自注意力模块集成到传统解码器中。在 ShapeNet 数据集的基准测试中,我们的方法在 1%、10% 和 20% 的不同标记率下大大优于现有的监督学习方法。此外,它在现实世界的 Pix3D 数据集上表现出色。通过对 ShapeNet 的全面实验,我们的框架比基线性能提高了 3.3%。此外,严格的消融研究进一步证实了我们方法的显著有效性。我们的代码已在此 https URL 上发布
Title: Learning a local trading strategy: deep reinforcement learning for grid-scale renewable energy integration
Copy Paste: [[2411.15422]] Learning a local trading strategy: deep reinforcement learning for grid-scale renewable energy integration(https://arxiv.org/abs/2411.15422)
Keywords: generation
Abstract: Variable renewable generation increases the challenge of balancing power supply and demand. Grid-scale batteries co-located with generation can help mitigate this misalignment. This paper explores the use of reinforcement learning (RL) for operating grid-scale batteries co-located with solar power. Our results show RL achieves an average of 61% (and up to 96%) of the approximate theoretical optimal (non-causal) operation, outperforming advanced control methods on average. Our findings suggest RL may be preferred when future signals are hard to predict. Moreover, RL has two significant advantages compared to simpler rules-based control: (1) that solar energy is more effectively shifted towards high demand periods, and (2) increased diversity of battery dispatch across different locations, reducing potential ramping issues caused by super-position of many similar actions.
Copy Paste: [[2411.15435]] What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation(https://arxiv.org/abs/2411.15435)
Keywords: generation
Abstract: While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.
Copy Paste: [[2411.15436]] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance(https://arxiv.org/abs/2411.15436)
Keywords: generation
Abstract: Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL
Copy Paste: [[2411.15439]] Twin Trigger Generative Networks for Backdoor Attacks against Object Detection(https://arxiv.org/abs/2411.15439)
Keywords: generative
Abstract: Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.
Copy Paste: [[2411.15453]] Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy(https://arxiv.org/abs/2411.15453)
Keywords: generation
Abstract: Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM's multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.
Copy Paste: [[2411.15466]] Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator(https://arxiv.org/abs/2411.15466)
Keywords: generation
Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: this https URL
Abstract: A signed distance function (SDF) is a useful representation for continuous-space geometry and many related operations, including rendering, collision checking, and mesh generation. Hence, reconstructing SDF from image observations accurately and efficiently is a fundamental problem. Recently, neural implicit SDF (SDF-NeRF) techniques, trained using volumetric rendering, have gained a lot of attention. Compared to earlier truncated SDF (TSDF) fusion algorithms that rely on depth maps and voxelize continuous space, SDF-NeRF enables continuous-space SDF reconstruction with better geometric and photometric accuracy. However, the accuracy and convergence speed of scene-level SDF reconstruction require further improvements for many applications. With the advent of 3D Gaussian Splatting (3DGS) as an explicit representation with excellent rendering quality and speed, several works have focused on improving SDF-NeRF by introducing consistency losses on depth and surface normals between 3DGS and SDF-NeRF. However, loss-level connections alone lead to incremental improvements. We propose a novel neural implicit SDF called "SplatSDF" to fuse 3DGSandSDF-NeRF at an architecture level with significant boosts to geometric and photometric accuracy and convergence speed. Our SplatSDF relies on 3DGS as input only during training, and keeps the same complexity and efficiency as the original SDF-NeRF during inference. Our method outperforms state-of-the-art SDF-NeRF models on geometric and photometric evaluation by the time of submission.
Copy Paste: [[2411.15472]] KinMo: Kinematic-aware Human Motion Understanding and Generation(https://arxiv.org/abs/2411.15472)
Keywords: generation
Abstract: Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: {\small\url{this https URL}}
Copy Paste: [[2411.15490]] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation(https://arxiv.org/abs/2411.15490)
Keywords: generation
Abstract: Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
Copy Paste: [[2411.15497]] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation(https://arxiv.org/abs/2411.15497)
Keywords: generation, generative
Abstract: Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively. The code is available at this https URL.
Copy Paste: [[2411.15509]] Interactive Visual Assessment for Text-to-Image Generation Models(https://arxiv.org/abs/2411.15509)
Keywords: generation, generative
Abstract: Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.
Copy Paste: [[2411.15537]] MUNBa: Machine Unlearning via Nash Bargaining(https://arxiv.org/abs/2411.15537)
Keywords: generation
Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.
摘要:机器反学习 (MU) 旨在选择性地从模型中消除有害行为,同时保留模型的整体效用。作为一个多任务学习问题,MU 涉及平衡与忘记特定概念/数据和保持一般性能相关的目标。这些遗忘和保留目标的简单整合可能会导致梯度冲突,阻碍 MU 算法达到最佳解决方案。为了解决梯度冲突问题,我们将 MU 重新表述为双人合作游戏,其中两个玩家,即遗忘玩家和保留玩家,通过他们的梯度提议做出贡献,以最大化他们的整体收益。为此,受纳什讨价还价理论的启发,我们得出了一个闭式解决方案,以引导模型走向帕累托前沿,有效地避免梯度冲突。我们对 MU 的表述保证了一个平衡解决方案,其中任何偏离最终状态的行为都会导致两个玩家的总体目标减少,从而确保每个目标的最优性。我们评估了我们的算法在图像分类和图像生成的各种任务上的有效性。使用 ResNet、视觉语言模型 CLIP 和文本到图像扩散模型进行的大量实验表明,我们的方法优于最先进的 MU 算法,在多个基准上取得了优异的表现。例如,在样本遗忘这一具有挑战性的场景中,我们的算法接近黄金标准的再训练基线。我们的结果还突出了遗忘精度、泛化保持和对抗攻击的鲁棒性的改进。
Title: Large Language Model with Region-guided Referring and Grounding for CT Report Generation
Copy Paste: [[2411.15539]] Large Language Model with Region-guided Referring and Grounding for CT Report Generation(https://arxiv.org/abs/2411.15539)
Keywords: generation
Abstract: Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code will be made publicly available.
Copy Paste: [[2411.15540]] Optical-Flow Guided Prompt Optimization for Coherent Video Generation(https://arxiv.org/abs/2411.15540)
Keywords: generation
Abstract: While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
Abstract: Deep neural networks exhibit vulnerability to adversarial examples that can transfer across different models. A particularly challenging problem is developing transferable targeted attacks that can mislead models into predicting specific target classes. While various methods have been proposed to enhance attack transferability, they often incur substantial computational costs while yielding limited improvements. Recent clean feature mixup methods use random clean features to perturb the feature space but lack optimization for disrupting adversarial examples, overlooking the advantages of attack-specific perturbations. In this paper, we propose Feature Tuning Mixup (FTM), a novel method that enhances targeted attack transferability by combining both random and optimized noises in the feature space. FTM introduces learnable feature perturbations and employs an efficient stochastic update strategy for optimization. These learnable perturbations facilitate the generation of more robust adversarial examples with improved transferability. We further demonstrate that attack performance can be enhanced through an ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on the ImageNet-compatible dataset across various models demonstrate that our method achieves significant improvements over state-of-the-art methods while maintaining low computational cost.
Abstract: Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.
Copy Paste: [[2411.15584]] FLD+: Data-efficient Evaluation Metric for Generative Models(https://arxiv.org/abs/2411.15584)
Keywords: generative
Abstract: We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.
Copy Paste: [[2411.15585]] Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing(https://arxiv.org/abs/2411.15585)
Keywords: generation
Abstract: Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
Copy Paste: [[2411.15706]] Fixing the Perspective: A Critical Examination of Zero-1-to-3(https://arxiv.org/abs/2411.15706)
Keywords: generation
Abstract: Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3's theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.
Copy Paste: [[2411.15714]] ROOT: VLM based System for Indoor Scene Understanding and Beyond(https://arxiv.org/abs/2411.15714)
Keywords: generation
Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{this https URL}.
Copy Paste: [[2411.15720]] Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks(https://arxiv.org/abs/2411.15720)
Keywords: generation
Abstract: Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.
Copy Paste: [[2411.15740]] LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration(https://arxiv.org/abs/2411.15740)
Keywords: restoration
Abstract: We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.
Copy Paste: [[2411.15743]] Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting(https://arxiv.org/abs/2411.15743)
Keywords: generation
Abstract: Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.
Copy Paste: [[2411.15843]] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing(https://arxiv.org/abs/2411.15843)
Keywords: generative
Abstract: Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.
摘要:利用流变换器的大量生成先验进行免调优图像编辑需要真实的反演来将图像投影到模型域中,并需要灵活的不变性控制机制来保留非目标内容。然而,现行的扩散反演在基于流的模型中表现不佳,不变性控制无法协调各种刚性和非刚性编辑任务。为了解决这些问题,我们系统地分析了基于流变换器的 \textbf{反演和不变性} 控制。具体而言,我们发现欧拉反演与 DDIM 具有相似的结构,但更容易受到近似误差的影响。因此,我们提出了一个两阶段反演,首先改进速度估计,然后补偿剩余误差,这与模型先验密切相关并有利于编辑。同时,我们提出了不变性控制,它可以在自适应层规范化中操纵文本特征,将文本提示的变化与图像语义联系起来。该机制可以同时保留非目标内容,同时允许刚性和非刚性操作,从而实现多种编辑类型,如视觉文本、数量、面部表情等。在多种场景下的实验验证了我们的框架实现了灵活、准确的编辑,释放了流变换器在多种图像编辑方面的潜力。
Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
Copy Paste: [[2411.15867]] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs(https://arxiv.org/abs/2411.15867)
Keywords: generation
Abstract: Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.
Copy Paste: [[2411.15925]] Making Images from Images: Interleaving Denoising and Transformation(https://arxiv.org/abs/2411.15925)
Keywords: generation
Abstract: Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.
Copy Paste: [[2411.15966]] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors(https://arxiv.org/abs/2411.15966)
Keywords: generative
Abstract: In this work, we introduce a generative approach for pose-free reconstruction of $360^{\circ}$ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex $360^{\circ}$ scenes.
摘要:在这项工作中,我们介绍了一种从有限数量的未校准二维图像中无姿势重建 $360^{\circ}$ 场景的生成方法。从不完整、未摆姿势的观测中进行无姿势场景重建通常使用深度估计或 3D 基础先验进行正则化。虽然最近的进展已经能够使用扩散先验对具有已知相机姿势的无界场景进行稀疏视图重建,但这些方法依赖于显式相机嵌入来推断未观察到的区域。这种依赖限制了它们在无姿势设置中的应用,其中视图特定数据仅隐式可用。为了解决这个问题,我们提出了一种遵循指令的 RGBD 扩散模型,旨在修复缺失的细节并去除 3D 场景的新视图渲染和深度图中的伪影。我们还提出了一种新的高斯表示置信度测量,以便更好地检测这些伪影。通过在高斯 SLAM 启发过程中逐步整合这些新视图,我们实现了多视图一致的高斯表示。在 MipNeRF360 数据集上的评估表明,我们的方法超越了现有的无姿势技术,并且在复杂的 $360^{\circ}$ 场景中的表现可与最先进的姿势重建方法相媲美。
Title: CNNs for Style Transfer of Digital to Film Photography
Authors: Pierre Mackenzie, Mika Senghaas, Raphael Achddou
Copy Paste: [[2411.15967]] CNNs for Style Transfer of Digital to Film Photography(https://arxiv.org/abs/2411.15967)
Keywords: generation
Abstract: The use of deep learning in stylistic effect generation has seen increasing use over recent years. In this work, we use simple convolutional neural networks to model Cinestill800T film given a digital input. We test the effect of different loss functions, the addition of an input noise channel and the use of random scales of patches during training. We find that a combination of MSE/VGG loss gives the best colour production and that some grain can be produced, but it is not of a high quality, and no halation is produced. We contribute our dataset of aligned paired images taken with a film and digital camera for further work.
Copy Paste: [[2411.16027]] From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events(https://arxiv.org/abs/2411.16027)
Keywords: generation
Abstract: Testing Automated Driving Systems (ADS) in simulation with realistic driving scenarios is important for verifying their performance. However, converting real-world driving videos into simulation scenarios is a significant challenge due to the complexity of interpreting high-dimensional video data and the time-consuming nature of precise manual scenario reconstruction. In this work, we propose a novel framework that automates the conversion of real-world car crash videos into detailed simulation scenarios for ADS testing. Our approach leverages prompt-engineered Video Language Models(VLM) to transform dashcam footage into SCENIC scripts, which define the environment and driving behaviors in the CARLA simulator, enabling the generation of realistic simulation scenarios. Importantly, rather than solely aiming for one-to-one scenario reconstruction, our framework focuses on capturing the essential driving behaviors from the original video while offering flexibility in parameters such as weather or road conditions to facilitate search-based testing. Additionally, we introduce a similarity metric that helps iteratively refine the generated scenario through feedback by comparing key features of driving behaviors between the real and simulated videos. Our preliminary results demonstrate substantial time efficiency, finishing the real-to-sim conversion in minutes with full automation and no human intervention, while maintaining high fidelity to the original driving events.
Copy Paste: [[2411.16079]] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models(https://arxiv.org/abs/2411.16079)
Keywords: generation, generative
Abstract: Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top-$K$ losses from a biased classifier ($f_B$) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.
Copy Paste: [[2411.16080]] Boosting 3D Object Generation through PBR Materials(https://arxiv.org/abs/2411.16080)
Keywords: generation
Abstract: Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.
摘要:自动 3D 内容创建最近受到越来越多的关注,因为它在视频游戏、电影行业和 AR/VR 等各种应用领域中都具有潜力。扩散模型和多模态模型的最新进展显着提高了给定单个 RGB 图像的 3D 对象生成的质量和效率。然而,即使是通过最先进的方法生成的 3D 对象与人造资产相比仍然不令人满意。只考虑纹理而不是材质使得这些方法在照片级真实感渲染、重新照明和灵活的外观编辑方面遇到挑战。而且它们还存在几何形状和高频纹理细节之间严重错位的问题。在这项工作中,我们提出了一种从基于物理的渲染 (PBR) 材料的角度提高生成的 3D 对象质量的新方法。通过分析 PBR 材料的成分,我们选择考虑反照率、粗糙度、金属性和凹凸贴图。对于反照率和凹凸贴图,我们利用在合成数据上微调的稳定扩散来提取这些值,并采用这些微调模型的新颖用法来为生成的对象获取 3D 一致的反照率 UV 和凹凸 UV。在粗糙度和金属度贴图方面,我们采用半自动化流程来提供交互式调整的空间,我们认为这更为实用。大量实验表明,我们的模型通常对各种最先进的生成方法都有好处,可以显著提高其生成的 3D 对象的质量和真实感,具有自然的重新照明效果和显著改善的几何形状。
Title: AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity
Copy Paste: [[2411.16087]] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity(https://arxiv.org/abs/2411.16087)
Keywords: generative, quality assessment
Abstract: Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.
Copy Paste: [[2411.16123]] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain(https://arxiv.org/abs/2411.16123)
Keywords: generation
Abstract: Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.
摘要:利用预先训练的模型和定制的提示进行情境学习已被证明在 NLP 任务中非常有效。基于这一成功,最近的研究在“一次性”框架内将类似的方法应用于 Segment Anything Model (SAM),其中仅使用单个参考图像及其标签。然而,这些方法在医学领域面临限制,主要是因为 SAM 对视觉提示的基本要求以及过度依赖像素相似性来生成它们。这种依赖性可能导致 (1) 提示生成不准确和 (2) 点提示聚类,从而导致结果不理想。为了应对这些挑战,我们引入了 \textbf{Med-PerSAM},这是一种专为医学领域设计的新颖而直接的一次性框架。由于我们新颖的自动提示生成过程,Med-PerSAM 仅使用视觉提示工程,并且无需对预训练的 SAM 进行额外训练或人工干预。通过将我们基于变形的轻量级提示调整模型与 SAM 集成,我们可以提取和迭代细化视觉提示,从而提高预训练 SAM 的性能。这一进步在医学领域尤其有意义,因为对于缺乏医学专业知识的人来说,创建视觉提示是一项重大挑战。我们的模型在各种 2D 医学成像数据集上的表现优于各种基础模型和之前基于 SAM 的方法。
Title: TreeFormer: Single-view Plant Skeleton Estimation via Tree-constrained Graph Generation
Abstract: Accurate estimation of plant skeletal structure (e.g., branching structure) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. While recent graph generation methods successfully infer thin structures from images, it is challenging to constrain the output graph strictly to a tree structure. To this problem, we present TreeFormer, a plant skeleton estimator via tree-constrained graph generation. Our approach combines learning-based graph generation with traditional graph algorithms to impose the constraints during the training loop. Specifically, our method projects an unconstrained graph onto a minimum spanning tree (MST) during the training loop and incorporates this prior knowledge into the gradient descent optimization by suppressing unwanted feature values. Experiments show that our method accurately estimates target plant skeletal structures for multiple domains: Synthetic tree patterns, real botanical roots, and grapevine branches. Our implementations are available at this https URL.
Copy Paste: [[2411.16133]] Context Awareness Gate For Retrieval Augmented Generation(https://arxiv.org/abs/2411.16133)
Keywords: generation
Abstract: Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information -- which can impair the ability of the model to utilize its internal knowledge effectively -- has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs' input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.
Copy Paste: [[2411.16157]] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model(https://arxiv.org/abs/2411.16157)
Keywords: generation
Abstract: We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.
摘要:我们推出了 MVGenMaster,这是一种通过 3D 先验增强的多视图扩散模型,可解决多功能新视图合成 (NVS) 任务。MVGenMaster 利用使用度量深度和相机姿势扭曲的 3D 先验,显着增强了 NVS 中的泛化和 3D 一致性。我们的模型具有简单而有效的管道,可以通过单个前向过程生成多达 100 个以任意参考视图和相机姿势为条件的新视图。此外,我们还开发了一个全面的大规模多视图图像数据集,其中包含多达 120 万个场景,并配备了高度对齐的度量深度。此外,我们提出了几种训练和模型修改,以使用扩大的数据集来增强模型。在域内和域外基准测试中的广泛评估证明了我们提出的方法和数据公式的有效性。模型和代码将在此 https URL 上发布。
Copy Paste: [[2411.16164]] Text-to-Image Synthesis: A Decade Survey(https://arxiv.org/abs/2411.16164)
Keywords: generation, generative
Abstract: When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.
Copy Paste: [[2411.16167]] BadSFL: Backdoor Attack against Scaffold Federated Learning(https://arxiv.org/abs/2411.16167)
Keywords: generative
Abstract: Federated learning (FL) enables the training of deep learning models on distributed clients to preserve data privacy. However, this learning paradigm is vulnerable to backdoor attacks, where malicious clients can upload poisoned local models to embed backdoors into the global model, leading to attacker-desired predictions. Existing backdoor attacks mainly focus on FL with independently and identically distributed (IID) scenarios, while real-world FL training data are typically non-IID. Current strategies for non-IID backdoor attacks suffer from limitations in maintaining effectiveness and durability. To address these challenges, we propose a novel backdoor attack method, \name, specifically designed for the FL framework using the scaffold aggregation algorithm in non-IID settings. \name leverages a Generative Adversarial Network (GAN) based on the global model to complement the training set, achieving high accuracy on both backdoor and benign samples. It utilizes a specific feature as the backdoor trigger to ensure stealthiness, and exploits the Scaffold's control variate to predict the global model's convergence direction, ensuring the backdoor's persistence. Extensive experiments on three benchmark datasets demonstrate the high effectiveness, stealthiness, and durability of \name. Notably, our attack remains effective over 60 rounds in the global model and up to 3 times longer than existing baseline attacks after stopping the injection of malicious updates.
Copy Paste: [[2411.16171]] Image Generation Diversity Issues and How to Tame Them(https://arxiv.org/abs/2411.16171)
Keywords: generation, generative
Abstract: Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.
Abstract: Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.
Copy Paste: [[2411.16185]] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation(https://arxiv.org/abs/2411.16185)
Keywords: generation
Abstract: Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
摘要:从单幅图像生成 3D 网格是一项重要但不适定的任务。现有方法主要采用 2D 多视图扩散模型来生成中间多视图图像,并使用大型重建模型 (LRM) 创建最终网格。然而,多视图图像表现出局部不一致性,并且网格通常缺乏对输入图像的保真度或看起来很模糊。我们提出了 Fancy123,它具有两个增强模块和一个反投影操作,分别解决上述三个问题。外观增强模块对 2D 多视图图像进行变形以重新对齐未对齐的像素,从而获得更好的多视图一致性。保真度增强模块对 3D 网格进行变形以匹配输入图像。将输入图像和变形的多视图图像反投影到 LRM 生成的网格上可确保高清晰度,从而丢弃 LRM 预测的模糊网格颜色。大量的定性和定量实验验证了 Fancy123 的 SoTA 性能有显着改善。此外,这两个增强模块是即插即用的,可以在推理时工作,从而可以无缝集成到各种现有的单图像到 3D 方法中。
Title: VIRES: Video Instance Repainting with Sketch and Text Guidance
Authors: Shuchen Weng, Haojie Zheng, Peixuan Zhan, Yuchen Hong, Han Jiang, Si Li, Boxin Shi
Copy Paste: [[2411.16199]] VIRES: Video Instance Repainting with Sketch and Text Guidance(https://arxiv.org/abs/2411.16199)
Keywords: generation, generative
Abstract: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.
Copy Paste: [[2411.16201]] Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models(https://arxiv.org/abs/2411.16201)
Keywords: generation
Abstract: High-quality video-text preference data is crucial for Multimodal Large Language Models (MLLMs) alignment. However, existing preference data is very scarce. Obtaining VQA preference data for preference training is costly, and manually annotating responses is highly unreliable, which could result in low-quality pairs. Meanwhile, AI-generated responses controlled by temperature adjustment lack diversity. To address these issues, we propose a high-quality VQA preference dataset, called \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}rtificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V}), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation. Furthermore, to fully leverage the preference knowledge in MMAIP-V and ensure sufficient optimization, we propose \textit{\textbf{Iter}ative \textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}earning from \textbf{AI} \textbf{F}eedback for video MLLMs} (\textbf{Iter-W2S-RLAIF}), a framework that gradually enhances MLLMs' alignment capabilities by iteratively updating the reference model and performing parameter extrapolation. Finally, we propose an unbiased and information-complete evaluation scheme in VQA evaluation. Experiments demonstrate that MMAIP-V is beneficial for MLLMs in preference learning and Iter-W2S-RLAIF fully exploits the alignment information in MMAIP-V. We believe that the proposed automatic VQA preference data generation pipeline based on AI feedback can greatly promote future work in the MLLMs alignment. \textbf{Code and dataset are available} \href{this https URL}{MMAIP-V\_Iter-W2S-RLAIF-702F}.
摘要:高质量的视频文本偏好数据对于多模态大型语言模型 (MLLM) 对齐至关重要。然而,现有的偏好数据非常稀缺。获取用于偏好训练的 VQA 偏好数据成本高昂,手动注释响应非常不可靠,这可能导致低质量配对。同时,受温度调节控制的 AI 生成的响应缺乏多样性。为了解决这些问题,我们提出了一个高质量的 VQA 偏好数据集,称为 \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}artificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V}),它是通过从响应分布集中采样并使用外部评分函数进行响应评估而构建的。此外,为了充分利用 MMAIP-V 中的偏好知识并确保充分优化,我们提出了 \textit{\textbf{Iter}ative \textbf{W}eak-to-\textbf{S}trong \textbf{R}einforcement \textbf{L}learning from \textbf{AI} \textbf{F}eedback for video MLLMs}(\textbf{Iter-W2S-RLAIF}),该框架通过迭代更新参考模型和执行参数外推来逐步增强 MLLM 的对齐能力。最后,我们在 VQA 评估中提出了一种无偏且信息完整的评估方案。实验表明,MMAIP-V 对 MLLM 的偏好学习有益,而 Iter-W2S-RLAIF 充分利用了 MMAIP-V 中的对齐信息。我们相信,基于 AI 反馈的自动 VQA 偏好数据生成流程可以极大地促进未来 MLLMs 对齐方面的工作。\textbf{代码和数据集可用} \href{此 https URL}{MMAIP-V\_Iter-W2S-RLAIF-702F}。
Title: SMGDiff: Soccer Motion Generation using diffusion probabilistic models
Authors: Hongdi Yang, Chengyang Li, Zhenxuan Wu, Gaozheng Li, Jingya Wang, Jingyi Yu, Zhuo Su, Lan Xu
Abstract: Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.
Copy Paste: [[2411.16217]] Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding(https://arxiv.org/abs/2411.16217)
Keywords: restoration
Abstract: Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model. However, in real-world scenarios, images often suffer from combinations of multiple degradation factors. Existing multiple-in-one IR models encounter challenges related to degradation diversity and prompt singularity when addressing this issue. In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations. To address degradation diversity, we design a Local Dynamic Optimization (LDO) module which dynamically processes degraded areas of varying types and granularities. To tackle the prompt singularity issue, we develop an efficient Conditional Feature Embedding (CFE) module that guides the decoder in leveraging degradation-type-related features, significantly improving the model's performance in mixed degradation restoration scenarios. To validate the effectiveness of our model, we introduce a new dataset containing both single and mixed degradation elements. Experimental results demonstrate that our proposed model achieves state-of-the-art (SOTA) performance not only on mixed degradation tasks but also on classic single-task restoration benchmarks.
摘要:多合一图像恢复 (IR) 取得了重大进展,旨在用单个模型处理所有类型的单一退化图像恢复。然而,在现实世界中,图像通常会受到多种退化因素组合的影响。现有的多合一 IR 模型在解决此问题时遇到了与退化多样性和瞬时奇异性相关的挑战。在本文中,我们提出了一种新颖的多合一 IR 模型,可以有效地恢复具有单一和混合退化的图像。为了解决退化多样性问题,我们设计了一个局部动态优化 (LDO) 模块,可以动态处理不同类型和粒度的退化区域。为了解决瞬时奇异性问题,我们开发了一个高效的条件特征嵌入 (CFE) 模块,该模块指导解码器利用与退化类型相关的特征,从而显着提高模型在混合退化恢复场景中的性能。为了验证我们模型的有效性,我们引入了一个包含单一和混合退化元素的新数据集。实验结果表明,我们提出的模型不仅在混合退化任务上而且在经典的单任务恢复基准上也达到了最先进(SOTA)的性能。
Title: Weakly supervised image segmentation for defect-based grading of fresh produce
Copy Paste: [[2411.16219]] Weakly supervised image segmentation for defect-based grading of fresh produce(https://arxiv.org/abs/2411.16219)
Keywords: quality assessment
Abstract: Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM's potential for low-effort, accurate segmentation in agricultural settings with limited data.
摘要:在农业中实施基于图像的机器学习通常受到数据和注释稀缺的限制,因此很难实现高质量的模型预测。本研究解决了分散供应链中香蕉收获后质量评估的问题。我们提出了一种使用全景分割来检测和分割香蕉图像中的表面缺陷的方法,以量化缺陷的大小和数量。我们使用带有粗标签的弱监督,而不是耗时的像素级注释。在现实世界的田间条件下收集了 476 张智能手机拍摄的香蕉图像数据集,并对瘀伤和疤痕进行了注释。使用最近发布的图像分割基础模型 Segment Anything Model (SAM),我们从粗边界框生成了密集注释来训练分割模型,显著减少了手动工作量,同时实现了 77.6% 的全景质量得分。这证明了 SAM 在数据有限的农业环境中实现低工作量、准确分割的潜力。
Title: Diagnosis of diabetic retinopathy using machine learning & deep learning technique
Authors: Eric Shah, Jay Patel, Mr.Vishal Katheriya, Parth Pataliya
Copy Paste: [[2411.16250]] Diagnosis of diabetic retinopathy using machine learning & deep learning technique(https://arxiv.org/abs/2411.16250)
Keywords: generation
Abstract: Fundus images are widely used for diagnosing various eye diseases, such as diabetic retinopathy, glaucoma, and age-related macular degeneration. However, manual analysis of fundus images is time-consuming and prone to errors. In this report, we propose a novel method for fundus detection using object detection and machine learning classification techniques. We use a YOLO_V8 to perform object detection on fundus images and locate the regions of interest (ROIs) such as optic disc, optic cup and lesions. We then use machine learning SVM classification algorithms to classify the ROIs into different DR stages based on the presence or absence of pathological signs such as exudates, microaneurysms, and haemorrhages etc. Our method achieves 84% accuracy and efficiency for fundus detection and can be applied for retinal fundus disease triage, especially in remote areas around the world.
摘要:眼底图像广泛用于诊断各种眼部疾病,例如糖尿病视网膜病变、青光眼和年龄相关性黄斑变性。然而,手动分析眼底图像非常耗时并且容易出错。在本报告中,我们提出了一种使用物体检测和机器学习分类技术进行眼底检测的新方法。我们使用 YOLO_V8 对眼底图像执行物体检测并定位感兴趣区域 (ROI),例如视神经乳头、视神经杯和病变。然后,我们使用机器学习 SVM 分类算法根据是否存在病理体征(例如渗出液、微动脉瘤和出血等)将 ROI 分为不同的 DR 阶段。我们的方法在眼底检测方面实现了 84% 的准确率和效率,并且可以应用于视网膜眼底疾病分类,尤其是在世界各地的偏远地区。
Title: DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation
Authors: Yuxuan Yang, Jingyao Wang, Tao Geng, Wenwen Qiang, Changwen Zheng, Fuchun Sun
Copy Paste: [[2411.16301]] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation(https://arxiv.org/abs/2411.16301)
Keywords: generation, generative
Abstract: Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.
Copy Paste: [[2411.16312]] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training(https://arxiv.org/abs/2411.16312)
Keywords: super-resolution
Abstract: Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
Copy Paste: [[2411.16318]] One Diffusion to Generate Them All(https://arxiv.org/abs/2411.16318)
Keywords: generation
Abstract: We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL
Copy Paste: [[2411.16325]] Luminance Component Analysis for Exposure Correction(https://arxiv.org/abs/2411.16325)
Keywords: restoration
Abstract: Exposure correction methods aim to adjust the luminance while maintaining other luminance-unrelated information. However, current exposure correction methods have difficulty in fully separating luminance-related and luminance-unrelated components, leading to distortions in color, loss of detail, and requiring extra restoration procedures. Inspired by principal component analysis (PCA), this paper proposes an exposure correction method called luminance component analysis (LCA). LCA applies the orthogonal constraint to a U-Net structure to decouple luminance-related and luminance-unrelated features. With decoupled luminance-related features, LCA adjusts only the luminance-related components while keeping the luminance-unrelated components unchanged. To optimize the orthogonal constraint problem, LCA employs a geometric optimization algorithm, which converts the constrained problem in Euclidean space to an unconstrained problem in orthogonal Stiefel manifolds. Extensive experiments show that LCA can decouple the luminance feature from the RGB color space. Moreover, LCA achieves the best PSNR (21.33) and SSIM (0.88) in the exposure correction dataset with 28.72 FPS.
Copy Paste: [[2411.16327]] CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain(https://arxiv.org/abs/2411.16327)
Keywords: generation
Abstract: Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.
Copy Paste: [[2411.16375]] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing(https://arxiv.org/abs/2411.16375)
Keywords: generation
Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL
Copy Paste: [[2411.16405]] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN(https://arxiv.org/abs/2411.16405)
Keywords: generation, generative
Abstract: The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.
Copy Paste: [[2411.16425]] TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation(https://arxiv.org/abs/2411.16425)
Keywords: generation
Abstract: The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., $+3.9\%$ SR and $+2.0\%$ SPL absolute improvements on HM3D.
Copy Paste: [[2411.16427]] Unsupervised Event Outlier Detection in Continuous Time(https://arxiv.org/abs/2411.16427)
Keywords: generative
Abstract: Event sequence data record the occurrences of events in continuous time. Event sequence forecasting based on temporal point processes (TPPs) has been extensively studied, but outlier or anomaly detection, especially without any supervision from humans, is still underexplored. In this work, we develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. Our novel unsupervised outlier detection framework is based on ideas from generative adversarial networks (GANs) and reinforcement learning (RL). We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data, which may contain outliers. A key insight is that if the generator made a mistake in the correction, it would generate anomalies that are different from the anomalies in the real data, so it serves as data augmentation for the discriminator learning. Different from typical GAN-based outlier detection approaches, our method employs the generator to detect outliers in an online manner. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.
摘要:事件序列数据记录连续时间中事件的发生。基于时间点过程 (TPP) 的事件序列预测已得到广泛研究,但异常值或异常检测(尤其是在没有任何人类监督的情况下)仍未得到充分探索。在这项工作中,我们尽我们所知开发了第一种无监督异常值检测方法来检测异常事件。我们新颖的无监督异常值检测框架基于生成对抗网络 (GAN) 和强化学习 (RL) 的思想。我们训练一个“生成器”,该生成器用于纠正数据中的异常值,并使用一个“鉴别器”,该鉴别器学习区分已纠正的数据和可能包含异常值的实际数据。一个关键的见解是,如果生成器在纠正过程中犯了错误,它将生成与实际数据中的异常不同的异常,因此它可以作为鉴别器学习的数据增强。与典型的基于 GAN 的异常值检测方法不同,我们的方法使用生成器以在线方式检测异常值。实验结果表明,我们的方法比最先进的方法可以更准确地检测事件异常值。
Title: SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
Authors: Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, Changick Kim
Copy Paste: [[2411.16443]] SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis(https://arxiv.org/abs/2411.16443)
Keywords: generation
Abstract: Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
摘要:基于文本的 3D 场景生成和编辑具有通过直观的用户交互简化内容创建的巨大潜力。虽然最近的进展利用 3D 高斯分层 (3DGS) 实现高保真和实时渲染,但现有方法通常是专门的和以任务为中心的,缺乏用于生成和编辑的统一框架。在本文中,我们介绍了 SplatFlow,这是一个全面的框架,通过支持直接生成和编辑 3DGS 来解决这一差距。SplatFlow 包含两个主要组件:多视图整流 (RF) 模型和高斯分层解码器 (GSDecoder)。多视图 RF 模型在潜在空间中运行,根据文本提示同时生成多视图图像、深度和相机姿势,从而解决现实世界环境中多样化场景规模和复杂相机轨迹等挑战。然后,GSDecoder 通过前馈 3DGS 方法将这些潜在输出有效地转换为 3DGS 表示。 SplatFlow 利用无需训练的反演和修复技术,实现了无缝 3DGS 编辑,并在统一框架内支持广泛的 3D 任务(包括对象编辑、新视图合成和相机姿势估计),而无需额外的复杂管道。我们在 MVImgNet 和 DL3DV-7K 数据集上验证了 SplatFlow 的功能,证明了其在各种 3D 生成、编辑和修复任务中的多功能性和有效性。
Title: VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation
Copy Paste: [[2411.16446]] VQ-SGen: A Vector Quantized Stroke Representation for Sketch Generation(https://arxiv.org/abs/2411.16446)
Keywords: generation
Abstract: This paper presents VQ-SGen, a novel algorithm for high-quality sketch generation. Recent approaches have often framed the task as pixel-based generation either as a whole or part-by-part, neglecting the intrinsic and contextual relationships among individual strokes, such as the shape and spatial positioning of both proximal and distant strokes. To overcome these limitations, we propose treating each stroke within a sketch as an entity and introducing a vector-quantized (VQ) stroke representation for fine-grained sketch generation. Our method follows a two-stage framework - in the first stage, we decouple each stroke's shape and location information to ensure the VQ representation prioritizes stroke shape learning. In the second stage, we feed the precise and compact representation into an auto-decoding Transformer to incorporate stroke semantics, positions, and shapes into the generation process. By utilizing tokenized stroke representation, our approach generates strokes with high fidelity and facilitates novel applications, such as conditional generation and semantic-aware stroke editing. Comprehensive experiments demonstrate our method surpasses existing state-of-the-art techniques, underscoring its effectiveness. The code and model will be made publicly available upon publication.
Copy Paste: [[2411.16468]] Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency(https://arxiv.org/abs/2411.16468)
Keywords: restoration
Abstract: As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{this https URL}.
Copy Paste: [[2411.16498]] Multi-Resolution Generative Modeling of Human Motion from Limited Data(https://arxiv.org/abs/2411.16498)
Keywords: generation, generative
Abstract: We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
Abstract: In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.
Copy Paste: [[2411.16550]] Representation Collapsing Problems in Vector Quantization(https://arxiv.org/abs/2411.16550)
Keywords: generative
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.
Copy Paste: [[2411.16567]] Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches(https://arxiv.org/abs/2411.16567)
Keywords: generative
Abstract: This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning in a framework designed to tackle the challenges posed by small-sample data. Recognizing the critical limitations of traditional machine learning models that require large datasets-especially in fields such as drug discovery, target recognition, and malicious traffic detection-this study proposes a novel strategy that leverages Generative Adversarial Networks (GANs) and advanced optimization techniques to improve model performance with limited data. Specifically, the paper addresses the noise and bias issues introduced by data augmentation methods, contrasting them with model-based approaches, such as fine-tuning and metric learning, which rely heavily on related datasets. By combining Markov Chain Monte Carlo (MCMC) sampling and discriminative model ensemble strategies within a GAN framework, the proposed model adjusts generative and discriminative distributions to simulate a broader range of relevant data. Furthermore, it employs MHLoss and a reparameterized GAN ensemble to enhance stability and accelerate convergence, ultimately leading to improved classification performance on small-sample images and structured datasets. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning, offering a practical solution that bridges data scarcity with high-performing model adaptability and generalization.
摘要:本文提出了一种创新方法,通过在旨在应对小样本数据挑战的框架中将数据增强与模型微调相结合来增强小样本学习。认识到需要大量数据集的传统机器学习模型的严重局限性——尤其是在药物发现、目标识别和恶意流量检测等领域——本研究提出了一种新策略,利用生成对抗网络 (GAN) 和高级优化技术来提高有限数据下的模型性能。具体来说,本文解决了数据增强方法引入的噪声和偏差问题,并将其与微调和度量学习等严重依赖相关数据集的基于模型的方法进行了对比。通过在 GAN 框架内结合马尔可夫链蒙特卡罗 (MCMC) 采样和判别模型集成策略,所提出的模型可以调整生成和判别分布以模拟更广泛的相关数据。此外,它还采用了 MHLoss 和重新参数化的 GAN 集成来增强稳定性并加速收敛,最终提高了小样本图像和结构化数据集的分类性能。结果证实,本研究开发的 MhERGAN 算法对于小样本学习非常有效,提供了一种实用的解决方案,既解决了数据稀缺问题,又实现了高性能的模型适应性和泛化能力。
Title: Rethinking Diffusion for Text-Driven Human Motion Generation
Copy Paste: [[2411.16575]] Rethinking Diffusion for Text-Driven Human Motion Generation(https://arxiv.org/abs/2411.16575)
Keywords: generation
Abstract: Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
Copy Paste: [[2411.16602]] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models(https://arxiv.org/abs/2411.16602)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.
Copy Paste: [[2411.16619]] Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric(https://arxiv.org/abs/2411.16619)
Keywords: generation, quality assessment
Abstract: AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 3,200 AGVs derived from 8 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released in public at this https URL
Copy Paste: [[2411.16622]] Imperceptible Adversarial Examples in the Physical World(https://arxiv.org/abs/2411.16622)
Keywords: generation
Abstract: Adversarial examples in the digital domain against deep learning-based computer vision models allow for perturbations that are imperceptible to human eyes. However, producing similar adversarial examples in the physical world has been difficult due to the non-differentiable image distortion functions in visual sensing systems. The existing algorithms for generating physically realizable adversarial examples often loosen their definition of adversarial examples by allowing unbounded perturbations, resulting in obvious or even strange visual patterns. In this work, we make adversarial examples imperceptible in the physical world using a straight-through estimator (STE, a.k.a. BPDA). We employ STE to overcome the non-differentiability -- applying exact, non-differentiable distortions in the forward pass of the backpropagation step, and using the identity function in the backward pass. Our differentiable rendering extension to STE also enables imperceptible adversarial patches in the physical world. Using printout photos, and experiments in the CARLA simulator, we show that STE enables fast generation of $\ell_\infty$ bounded adversarial examples despite the non-differentiable distortions. To the best of our knowledge, this is the first work demonstrating imperceptible adversarial examples bounded by small $\ell_\infty$ norms in the physical world that force zero classification accuracy in the global perturbation threat model and cause near-zero ($4.22\%$) AP50 in object detection in the patch perturbation threat model. We urge the community to re-evaluate the threat of adversarial examples in the physical world.
Copy Paste: [[2411.16644]] Exploring Discrete Flow Matching for 3D De Novo Molecule Generation(https://arxiv.org/abs/2411.16644)
Keywords: generation, generative
Abstract: Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, de novo molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D de novo small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D de novo design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at \url{this https URL}.
摘要:产生新分子结构的深度生成模型有可能促进化学发现。流匹配是一种最近提出的生成建模框架,在包括生物分子结构在内的各种任务上都取得了令人印象深刻的表现。精髓流匹配框架仅针对连续数据开发。然而,从头分子设计任务需要生成离散数据,例如原子元素或氨基酸残基序列。最近提出了几种离散流匹配方法来弥补这一差距。在这项工作中,我们对现有离散流匹配方法在 3D 从头小分子生成中的性能进行了基准测试,并解释了它们的不同行为。因此,我们提出了 FlowMol-CTMC,这是一种开源模型,它以比现有方法更少的可学习参数实现了 3D 从头设计的最先进性能。此外,我们建议使用指标来捕捉超越局部化学价态约束和更高阶结构基序的分子质量。这些指标表明,即使满足基本约束,模型也倾向于在训练数据分布之外产生不寻常且可能有问题的功能组。用于重现这项工作的代码和训练模型可以在 \url{此 https URL} 中找到。
Title: DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Copy Paste: [[2411.16657]] DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation(https://arxiv.org/abs/2411.16657)
Keywords: generation
Abstract: Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.
Copy Paste: [[2411.16681]] Factorized Visual Tokenization and Generation(https://arxiv.org/abs/2411.16681)
Keywords: generation
Abstract: Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. this https URL
Copy Paste: [[2411.16683]] Generative Omnimatte: Learning to Decompose Video into Layers(https://arxiv.org/abs/2411.16683)
Keywords: generative
Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.