2024-12-18

Title: SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout

Authors: Chiyu Max Jiang, Yijing Bai, Andre Cornman, Christopher Davis, Xiukun Huang, Hong Jeon, Sakshum Kulshrestha, John Lambert, Shuangyu Li, Xuanyu Zhou, Carlos Fuertes, Chang Yuan, Mingxing Tan, Yin Zhou, Dragomir Anguelov
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.12129
Pdf URL: https://arxiv.org/pdf/2412.12129
Copy Paste: [[2412.12129]] SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout(https://arxiv.org/abs/2412.12129)
Keywords: diffusion
Abstract: Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models.

Title: PyOD 2: A Python Library for Outlier Detection with LLM-powered Model Selection

Authors: Sihan Chen, Zhuangzhuang Qian, Wingchun Siu, Xingcan Hu, Jiaqi Li, Shawn Li, Yuehan Qin, Tiankai Yang, Zhuo Xiao, Wanghao Ye, Yichi Zhang, Yushun Dong, Yue Zhao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.12154
Pdf URL: https://arxiv.org/pdf/2412.12154
Copy Paste: [[2412.12154]] PyOD 2: A Python Library for Outlier Detection with LLM-powered Model Selection(https://arxiv.org/abs/2412.12154)
Keywords: anomaly
Abstract: Outlier detection (OD), also known as anomaly detection, is a critical machine learning (ML) task with applications in fraud detection, network intrusion detection, clickstream analysis, recommendation systems, and social network moderation. Among open-source libraries for outlier detection, the Python Outlier Detection (PyOD) library is the most widely adopted, with over 8,500 GitHub stars, 25 million downloads, and diverse industry usage. However, PyOD currently faces three limitations: (1) insufficient coverage of modern deep learning algorithms, (2) fragmented implementations across PyTorch and TensorFlow, and (3) no automated model selection, making it hard for non-experts. To address these issues, we present PyOD Version 2 (PyOD 2), which integrates 12 state-of-the-art deep learning models into a unified PyTorch framework and introduces a large language model (LLM)-based pipeline for automated OD model selection. These improvements simplify OD workflows, provide access to 45 algorithms, and deliver robust performance on various datasets. In this paper, we demonstrate how PyOD 2 streamlines the deployment and automation of OD models and sets a new standard in both research and industry. PyOD 2 is accessible at [this https URL](this https URL). This study aligns with the Web Mining and Content Analysis track, addressing topics such as the robustness of Web mining methods and the quality of algorithmically-generated Web data.

Title: What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis

Authors: Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, Chengxiang Zhai, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12157
Pdf URL: https://arxiv.org/pdf/2412.12157
Copy Paste: [[2412.12157]] What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis(https://arxiv.org/abs/2412.12157)
Keywords: in-context
Abstract: Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by a LLM-oriented semantic similarity and an inference stability of demonstrations, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It can adaptively facilitate to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish.

Title: Multi-Surrogate-Teacher Assistance for Representation Alignment in Fingerprint-based Indoor Localization

Authors: Son Minh Nguyen, Linh Duy Tran, Duc Viet Le, Paul J.M Havinga
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12189
Pdf URL: https://arxiv.org/pdf/2412.12189
Copy Paste: [[2412.12189]] Multi-Surrogate-Teacher Assistance for Representation Alignment in Fingerprint-based Indoor Localization(https://arxiv.org/abs/2412.12189)
Keywords: generative
Abstract: Despite remarkable progress in knowledge transfer across visual and textual domains, extending these achievements to indoor localization, particularly for learning transferable representations among Received Signal Strength (RSS) fingerprint datasets, remains a challenge. This is due to inherent discrepancies among these RSS datasets, largely including variations in building structure, the input number and disposition of WiFi anchors. Accordingly, specialized networks, which were deprived of the ability to discern transferable representations, readily incorporate environment-sensitive clues into the learning process, hence limiting their potential when applied to specific RSS datasets. In this work, we propose a plug-and-play (PnP) framework of knowledge transfer, facilitating the exploitation of transferable representations for specialized networks directly on target RSS datasets through two main phases. Initially, we design an Expert Training phase, which features multiple surrogate generative teachers, all serving as a global adapter that homogenizes the input disparities among independent source RSS datasets while preserving their unique characteristics. In a subsequent Expert Distilling phase, we continue introducing a triplet of underlying constraints that requires minimizing the differences in essential knowledge between the specialized network and surrogate teachers through refining its representation learning on the target dataset. This process implicitly fosters a representational alignment in such a way that is less sensitive to specific environmental dynamics. Extensive experiments conducted on three benchmark WiFi RSS fingerprint datasets underscore the effectiveness of the framework that significantly exerts the full potential of specialized networks in localization.

Title: No Free Lunch for Defending Against Prefilling Attack by In-Context Learning

Authors: Zhiyu Xue, Guangliang Liu, Bocheng Chen, Kristen Marie Johnson, Ramtin Pedarsani
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12192
Pdf URL: https://arxiv.org/pdf/2412.12192
Copy Paste: [[2412.12192]] No Free Lunch for Defending Against Prefilling Attack by In-Context Learning(https://arxiv.org/abs/2412.12192)
Keywords: in-context
Abstract: The security of Large Language Models (LLMs) has become an important research topic since the emergence of ChatGPT. Though there have been various effective methods to defend against jailbreak attacks, prefilling attacks remain an unsolved and popular threat against open-sourced LLMs. In-Context Learning (ICL) offers a computationally efficient defense against various jailbreak attacks, yet no effective ICL methods have been developed to counter prefilling attacks. In this paper, we: (1) show that ICL can effectively defend against prefilling jailbreak attacks by employing adversative sentence structures within demonstrations; (2) characterize the effectiveness of this defense through the lens of model size, number of demonstrations, over-defense, integration with other jailbreak attacks, and the presence of safety alignment. Given the experimental results and our analysis, we conclude that there is no free lunch for defending against prefilling jailbreak attacks with ICL. On the one hand, current safety alignment methods fail to mitigate prefilling jailbreak attacks, but adversative structures within ICL demonstrations provide robust defense across various model sizes and complex jailbreak attacks. On the other hand, LLMs exhibit similar over-defensiveness when utilizing ICL demonstrations with adversative structures, and this behavior appears to be independent of model size.

Title: Are Large Language Models Useful for Time Series Data Analysis?

Authors: Francis Tang, Ying Ding
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12219
Pdf URL: https://arxiv.org/pdf/2412.12219
Copy Paste: [[2412.12219]] Are Large Language Models Useful for Time Series Data Analysis?(https://arxiv.org/abs/2412.12219)
Keywords: anomaly
Abstract: Time series data plays a critical role across diverse domains such as healthcare, energy, and finance, where tasks like classification, anomaly detection, and forecasting are essential for informed decision-making. Recently, large language models (LLMs) have gained prominence for their ability to handle complex data and extract meaningful insights. This study investigates whether LLMs are effective for time series data analysis by comparing their performance with non-LLM-based approaches across three tasks: classification, anomaly detection, and forecasting. Through a series of experiments using GPT4TS and autoregressive models, we evaluate their performance on benchmark datasets and assess their accuracy, precision, and ability to generalize. Our findings indicate that while LLM-based methods excel in specific tasks like anomaly detection, their benefits are less pronounced in others, such as forecasting, where simpler models sometimes perform comparably or better. This research highlights the role of LLMs in time series analysis and lays the groundwork for future studies to systematically explore their applications and limitations in handling temporal data.

Title: Can video generation replace cinematographers? Research on the cinematic language of generated video

Authors: Xiaozhe Li, Kai WU, Siyi Yang, YiZhan Qu, Guohua.Zhang, Zhiyu Chen, Jiayao Li, Jiangchuan Mu, Xiaobin Hu, Wen Fang, Mingliang Xiong, Hao Deng, Qingwen Liu, Gang Li, Bin He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12223
Pdf URL: https://arxiv.org/pdf/2412.12223
Copy Paste: [[2412.12223]] Can video generation replace cinematographers? Research on the cinematic language of generated video(https://arxiv.org/abs/2412.12223)
Keywords: diffusion
Abstract: Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance the visual coherence of videos generated from textual descriptions. However, most research has primarily focused on object motion, with limited attention given to cinematic language in videos, which is crucial for cinematographers to convey emotion and narrative pacing. To address this limitation, we propose a threefold approach to enhance the ability of T2V models to generate controllable cinematic language. Specifically, we introduce a cinematic language dataset that encompasses shot framing, angle, and camera movement, enabling models to learn diverse cinematic styles. Building on this, to facilitate robust cinematic alignment evaluation, we present CameraCLIP, a model fine-tuned on the proposed dataset that excels in understanding complex cinematic language in generated videos and can further provide valuable guidance in the multi-shot composition process. Finally, we propose CLIPLoRA, a cost-guided dynamic LoRA composition method that facilitates smooth transitions and realistic blending of cinematic language by dynamically fusing multiple pre-trained cinematic LoRAs within a single video. Our experiments demonstrate that CameraCLIP outperforms existing models in assessing the alignment between cinematic language and video, achieving an R@1 score of 0.81. Additionally, CLIPLoRA improves the ability for multi-shot composition, potentially bridging the gap between automatically generated videos and those shot by professional cinematographers.

Title: You Only Submit One Image to Find the Most Suitable Generative Model

Authors: Zhi Zhou, Lan-Zhe Guo, Peng-Xiao Song, Yu-Feng Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12232
Pdf URL: https://arxiv.org/pdf/2412.12232
Copy Paste: [[2412.12232]] You Only Submit One Image to Find the Most Suitable Generative Model(https://arxiv.org/abs/2412.12232)
Keywords: generative
Abstract: Deep generative models have achieved promising results in image generation, and various generative model hubs, e.g., Hugging Face and Civitai, have been developed that enable model developers to upload models and users to download models. However, these model hubs lack advanced model management and identification mechanisms, resulting in users only searching for models through text matching, download sorting, etc., making it difficult to efficiently find the model that best meets user requirements. In this paper, we propose a novel setting called Generative Model Identification (GMI), which aims to enable the user to identify the most appropriate generative model(s) for the user's requirements from a large number of candidate models efficiently. To our best knowledge, it has not been studied yet. In this paper, we introduce a comprehensive solution consisting of three pivotal modules: a weighted Reduced Kernel Mean Embedding (RKME) framework for capturing the generated image distribution and the relationship between images and prompts, a pre-trained vision-language model aimed at addressing dimensionality challenges, and an image interrogator designed to tackle cross-modality issues. Extensive empirical results demonstrate the proposal is both efficient and effective. For example, users only need to submit a single example image to describe their requirements, and the model platform can achieve an average top-4 identification accuracy of more than 80%.

Title: OmniPrism: Learning Disentangled Visual Concept for Image Generation

Authors: Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12242
Pdf URL: https://arxiv.org/pdf/2412.12242
Copy Paste: [[2412.12242]] OmniPrism: Learning Disentangled Visual Concept for Image Generation(https://arxiv.org/abs/2412.12242)
Keywords: diffusion
Abstract: Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.

Title: Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12276
Pdf URL: https://arxiv.org/pdf/2412.12276
Copy Paste: [[2412.12276]] Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers(https://arxiv.org/abs/2412.12276)
Keywords: in-context
Abstract: Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose \textbf{concept encoding-decoding mechanism} to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.

Title: Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Authors: Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12278
Pdf URL: https://arxiv.org/pdf/2412.12278
Copy Paste: [[2412.12278]] Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content(https://arxiv.org/abs/2412.12278)
Keywords: foundation model, generative
Abstract: Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

Title: F-RBA: A Federated Learning-based Framework for Risk-based Authentication

Authors: Hamidreza Fereidouni, Abdelhakim Senhaji Hafid, Dimitrios Makrakis, Yaser Baseri
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12324
Pdf URL: https://arxiv.org/pdf/2412.12324
Copy Paste: [[2412.12324]] F-RBA: A Federated Learning-based Framework for Risk-based Authentication(https://arxiv.org/abs/2412.12324)
Keywords: anomaly
Abstract: The proliferation of Internet services has led to an increasing need to protect private data. User authentication serves as a crucial mechanism to ensure data security. Although robust authentication forms the cornerstone of remote service security, it can still leave users vulnerable to credential disclosure, device-theft attacks, session hijacking, and inadequate adaptive security measures. Risk-based Authentication (RBA) emerges as a potential solution, offering a multi-level authentication approach that enhances user experience without compromising security. In this paper, we propose a Federated Risk-based Authentication (F-RBA) framework that leverages Federated Learning to ensure privacy-centric training, keeping user data local while distributing learning across devices. Whereas traditional approaches rely on centralized storage, F-RBA introduces a distributed architecture where risk assessment occurs locally on users' devices. The framework's core innovation lies in its similarity-based feature engineering approach, which addresses the heterogeneous data challenges inherent in federated settings, a significant advancement for distributed authentication. By facilitating real-time risk evaluation across devices while maintaining unified user profiles, F-RBA achieves a balance between data protection, security, and scalability. Through its federated approach, F-RBA addresses the cold-start challenge in risk model creation, enabling swift adaptation to new users without compromising security. Empirical evaluation using a real-world multi-user dataset demonstrates the framework's effectiveness, achieving a superior true positive rate for detecting suspicious logins compared to conventional unsupervised anomaly detection models. This research introduces a new paradigm for privacy-focused RBA in distributed digital environments, facilitating advancements in federated security systems.

Title: BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A

Authors: Samy Ateia, Udo Kruschwitz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12358
Pdf URL: https://arxiv.org/pdf/2412.12358
Copy Paste: [[2412.12358]] BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A(https://arxiv.org/abs/2412.12358)
Keywords: generative
Abstract: We present BioRAGent, an interactive web-based retrieval-augmented generation (RAG) system for biomedical question answering. The system uses large language models (LLMs) for query expansion, snippet extraction, and answer generation while maintaining transparency through citation links to the source documents and displaying generated queries for further editing. Building on our successful participation in the BioASQ 2024 challenge, we demonstrate how few-shot learning with LLMs can be effectively applied for a professional search setting. The system supports both direct short paragraph style responses and responses with inline citations. Our demo is available online, and the source code is publicly accessible through GitHub.

Title: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

Authors: Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.12359
Pdf URL: https://arxiv.org/pdf/2412.12359
Copy Paste: [[2412.12359]] Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering(https://arxiv.org/abs/2412.12359)
Keywords: in-context
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.

Title: Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Authors: Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12391
Pdf URL: https://arxiv.org/pdf/2412.12391
Copy Paste: [[2412.12391]] Efficient Scaling of Diffusion Transformers for Text-to-Image Generation(https://arxiv.org/abs/2412.12391)
Keywords: diffusion
Abstract: We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.

Title: Causally Consistent Normalizing Flow

Authors: Qingyang Zhou, Kangjie Lu, Meng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12401
Pdf URL: https://arxiv.org/pdf/2412.12401
Copy Paste: [[2412.12401]] Causally Consistent Normalizing Flow(https://arxiv.org/abs/2412.12401)
Keywords: generative
Abstract: Causal inconsistency arises when the underlying causal graphs captured by generative models like \textit{Normalizing Flows} (NFs) are inconsistent with those specified in causal models like \textit{Struct Causal Models} (SCMs). This inconsistency can cause unwanted issues including the unfairness problem. Prior works to achieve causal consistency inevitably compromise the expressiveness of their models by disallowing hidden layers. In this work, we introduce a new approach: \textbf{C}ausally \textbf{C}onsistent \textbf{N}ormalizing \textbf{F}low (CCNF). To the best of our knowledge, CCNF is the first causally consistent generative model that can approximate any distribution with multiple layers. CCNF relies on two novel constructs: a sequential representation of SCMs and partial causal transformations. These constructs allow CCNF to inherently maintain causal consistency without sacrificing expressiveness. CCNF can handle all forms of causal inference tasks, including interventions and counterfactuals. Through experiments, we show that CCNF outperforms current approaches in causal inference. We also empirically validate the practical utility of CCNF by applying it to real-world datasets and show how CCNF addresses challenges like unfairness effectively.

Title: DeepSN: A Sheaf Neural Framework for Influence Maximization

Authors: Asela Hevapathige, Qing Wang, Ahad N. Zehmakan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12416
Pdf URL: https://arxiv.org/pdf/2412.12416
Copy Paste: [[2412.12416]] DeepSN: A Sheaf Neural Framework for Influence Maximization(https://arxiv.org/abs/2412.12416)
Keywords: diffusion
Abstract: Influence maximization is key topic in data mining, with broad applications in social network analysis and viral marketing. In recent years, researchers have increasingly turned to machine learning techniques to address this problem. They have developed methods to learn the underlying diffusion processes in a data-driven manner, which enhances the generalizability of the solution, and have designed optimization objectives to identify the optimal seed set. Nonetheless, two fundamental gaps remain unsolved: (1) Graph Neural Networks (GNNs) are increasingly used to learn diffusion models, but in their traditional form, they often fail to capture the complex dynamics of influence diffusion, (2) Designing optimization objectives is challenging due to combinatorial explosion when solving this problem. To address these challenges, we propose a novel framework, DeepSN. Our framework employs sheaf neural diffusion to learn diverse influence patterns in a data-driven, end-to-end manner, providing enhanced separability in capturing diffusion characteristics. We also propose an optimization technique that accounts for overlapping influence between vertices, which helps to reduce the search space and identify the optimal seed set effectively and efficiently. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate the effectiveness of our framework.

Title: LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12444
Pdf URL: https://arxiv.org/pdf/2412.12444
Copy Paste: [[2412.12444]] LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers(https://arxiv.org/abs/2412.12444)
Keywords: diffusion, generative
Abstract: Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the \textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency.

Title: PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts

Authors: Kun Guo, Qiang Ling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12460
Pdf URL: https://arxiv.org/pdf/2412.12460
Copy Paste: [[2412.12460]] PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts(https://arxiv.org/abs/2412.12460)
Keywords: foundation model
Abstract: Multi-camera 3D object detection aims to detect and localize objects in 3D space using multiple cameras, which has attracted more attention due to its cost-effectiveness trade-off. However, these methods often struggle with the lack of accurate depth estimation caused by the natural weakness of the camera in ranging. Recently, multi-modal fusion and knowledge distillation methods for 3D object detection have been proposed to solve this problem, which are time-consuming during the training phase and not friendly to memory cost. In light of this, we propose PromptDet, a lightweight yet effective 3D object detection framework motivated by the success of prompt learning in 2D foundation model. Our proposed framework, PromptDet, comprises two integral components: a general camera-based detection module, exemplified by models like BEVDet and BEVDepth, and a LiDAR-assisted prompter. The LiDAR-assisted prompter leverages the LiDAR points as a complementary signal, enriched with a minimal set of additional trainable parameters. Notably, our framework is flexible due to our prompt-like design, which can not only be used as a lightweight multi-modal fusion method but also as a camera-only method for 3D object detection during the inference phase. Extensive experiments on nuScenes validate the effectiveness of the proposed PromptDet. As a multi-modal detector, PromptDet improves the mAP and NDS by at most 22.8\% and 21.1\% with fewer than 2\% extra parameters compared with the camera-only baseline. Without LiDAR points, PromptDet still achieves an improvement of at most 2.4\% mAP and 4.0\% NDS with almost no impact on camera detection inference time.

Title: Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy

Authors: Aditya Ganeshan, Thibault Groueix, Paul Guerrero, Radomír Měch, Matthew Fisher, Daniel Ritchie
Subjects: cs.CV, cs.AI, cs.GR, cs.HC
Abstract URL: https://arxiv.org/abs/2412.12463
Pdf URL: https://arxiv.org/pdf/2412.12463
Copy Paste: [[2412.12463]] Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy(https://arxiv.org/abs/2412.12463)
Keywords: diffusion, generative
Abstract: Pattern images are everywhere in the digital and physical worlds, and tools to edit them are valuable. But editing pattern images is tricky: desired edits are often programmatic: structure-aware edits that alter the underlying program which generates the pattern. One could attempt to infer this underlying program, but current methods for doing so struggle with complex images and produce unorganized programs that make editing tedious. In this work, we introduce a novel approach to perform programmatic edits on pattern images. By using a pattern analogy -- a pair of simple patterns to demonstrate the intended edit -- and a learning-based generative model to execute these edits, our method allows users to intuitively edit patterns. To enable this paradigm, we introduce SplitWeave, a domain-specific language that, combined with a framework for sampling synthetic pattern analogies, enables the creation of a large, high-quality synthetic training dataset. We also present TriFuser, a Latent Diffusion Model (LDM) designed to overcome critical issues that arise when naively deploying LDMs to this task. Extensive experiments on real-world, artist-sourced patterns reveals that our method faithfully performs the demonstrated edit while also generalizing to related pattern styles beyond its training distribution.

Title: Transferable and Forecastable User Targeting Foundation Model

Authors: Bin Dou, Baokun Wang, Yun Zhu, Xiaotong Lin, Yike Xu, Xiaorui Huang, Yang Chen, Yun Liu, Shaoshuai Han, Yongchao Liu, Tianyi Zhang, Yu Cheng, Weiqiang Wang, Chuntao Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12468
Pdf URL: https://arxiv.org/pdf/2412.12468
Copy Paste: [[2412.12468]] Transferable and Forecastable User Targeting Foundation Model(https://arxiv.org/abs/2412.12468)
Keywords: foundation model
Abstract: User targeting, the process of selecting targeted users from a pool of candidates for non-expert marketers, has garnered substantial attention with the advancements in digital marketing. However, existing user targeting methods encounter two significant challenges: (i) Poor cross-domain and cross-scenario transferability and generalization, and (ii) Insufficient forecastability in real-world applications. These limitations hinder their applicability across diverse industrial scenarios. In this work, we propose FIND, an industrial-grade, transferable, and forecastable user targeting foundation model. To enhance cross-domain transferability, our framework integrates heterogeneous multi-scenario user data, aligning them with one-sentence targeting demand inputs through contrastive pre-training. For improved forecastability, the text description of each user is derived based on anticipated future behaviors, while user representations are constructed from historical information. Experimental results demonstrate that our approach significantly outperforms existing baselines in cross-domain, real-world user targeting scenarios, showcasing the superior capabilities of FIND. Moreover, our method has been successfully deployed on the Alipay platform and is widely utilized across various scenarios.

Title: A Method for Enhancing Generalization of Adam by Multiple Integrations

Authors: Long Jin, Han Nong, Liangming Chen, Zhenming Su
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12473
Pdf URL: https://arxiv.org/pdf/2412.12473
Copy Paste: [[2412.12473]] A Method for Enhancing Generalization of Adam by Multiple Integrations(https://arxiv.org/abs/2412.12473)
Keywords: diffusion
Abstract: The insufficient generalization of adaptive moment estimation (Adam) has hindered its broader application. Recent studies have shown that flat minima in loss landscapes are highly associated with improved generalization. Inspired by the filtering effect of integration operations on high-frequency signals, we propose multiple integral Adam (MIAdam), a novel optimizer that integrates a multiple integral term into Adam. This multiple integral term effectively filters out sharp minima encountered during optimization, guiding the optimizer towards flatter regions and thereby enhancing generalization capability. We provide a theoretical explanation for the improvement in generalization through the diffusion theory framework and analyze the impact of the multiple integral term on the optimizer's convergence. Experimental results demonstrate that MIAdam not only enhances generalization and robustness against label noise but also maintains the rapid convergence characteristic of Adam, outperforming Adam and its variants in state-of-the-art benchmarks.

Title: Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues

Authors: Yan Zhang, Gangyan Zeng, Huawen Shen, Daiqing Wu, Yu Zhou, Can Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12502
Pdf URL: https://arxiv.org/pdf/2412.12502
Copy Paste: [[2412.12502]] Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues(https://arxiv.org/abs/2412.12502)
Keywords: generative
Abstract: Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video. Inspired by the development of TextVQA in image domain, existing Video TextVQA approaches leverage a language model (e.g. T5) to process text-rich multiple frames and generate answers auto-regressively. Nevertheless, the spatio-temporal relationships among visual entities (including scene text and objects) will be disrupted and models are susceptible to interference from unrelated information, resulting in irrational reasoning and inaccurate answering. To tackle these challenges, we propose the TEA (stands for ``\textbf{T}rack th\textbf{E} \textbf{A}nswer'') method that better extends the generative TextVQA framework from image to video. TEA recovers the spatio-temporal relationships in a complementary way and incorporates OCR-aware clues to enhance the quality of reasoning questions. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. TEA outperforms existing TextVQA methods, video-language pretraining methods and video large language models by great margins.

Title: DocFusion: A Unified Framework for Document Parsing Tasks

Authors: Mingxu Chai, Ziyu Shen, Chong Zhang, Yue Zhang, Xiao Wang, Shihan Dou, Jihua Kang, Jiazheng Zhang, Qi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12505
Pdf URL: https://arxiv.org/pdf/2412.12505
Copy Paste: [[2412.12505]] DocFusion: A Unified Framework for Document Parsing Tasks(https://arxiv.org/abs/2412.12505)
Keywords: generative
Abstract: Document parsing is essential for analyzing complex document structures and extracting fine-grained information, supporting numerous downstream applications. However, existing methods often require integrating multiple independent models to handle various parsing tasks, leading to high complexity and maintenance overhead. To address this, we propose DocFusion, a lightweight generative model with only 0.28B parameters. It unifies task representations and achieves collaborative training through an improved objective function. Experiments reveal and leverage the mutually beneficial interaction among recognition tasks, and integrating recognition data significantly enhances detection performance. The final results demonstrate that DocFusion achieves state-of-the-art (SOTA) performance across four key tasks.

Title: Invisible Watermarks: Attacks and Robustness

Authors: Dongjun Hwang, Sungwon Woo, Tom Gao, Raymond Luo, Sunghwan Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12511
Pdf URL: https://arxiv.org/pdf/2412.12511
Copy Paste: [[2412.12511]] Invisible Watermarks: Attacks and Robustness(https://arxiv.org/abs/2412.12511)
Keywords: generative
Abstract: As Generative AI continues to become more accessible, the case for robust detection of generated images in order to combat misinformation is stronger than ever. Invisible watermarking methods act as identifiers of generated content, embedding image- and latent-space messages that are robust to many forms of perturbations. The majority of current research investigates full-image attacks against images with a single watermarking method applied. We introduce novel improvements to watermarking robustness as well as minimizing degradation on image quality during attack. Firstly, we examine the application of both image-space and latent-space watermarking methods on a single image, where we propose a custom watermark remover network which preserves one of the watermarking modalities while completely removing the other during decoding. Then, we investigate localized blurring attacks (LBA) on watermarked images based on the GradCAM heatmap acquired from the watermark decoder in order to reduce the amount of degradation to the target image. Our evaluation suggests that 1) implementing the watermark remover model to preserve one of the watermark modalities when decoding the other modality slightly improves on the baseline performance, and that 2) LBA degrades the image significantly less compared to uniform blurring of the entire image. Code is available at: this https URL

Title: Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL

Authors: Geling Liu, Yunzhi Tan, Ruichao Zhong, Yuanzhen Xie, Lingchen Zhao, Qian Wang, Bo Hu, Zang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12522
Pdf URL: https://arxiv.org/pdf/2412.12522
Copy Paste: [[2412.12522]] Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL(https://arxiv.org/abs/2412.12522)
Keywords: in-context
Abstract: Recently, large language models (LLMs) have significantly improved the performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness. Our experiments reveal that while LLM-driven methods excel on standard datasets, their accuracy is notably compromised when faced with adversarial perturbations. To address this challenge, we propose a robust text-to-SQL solution, called Solid-SQL, designed to integrate with various LLMs. We focus on the pre-processing stage, training a robust schema-linking model enhanced by LLM-based data augmentation. Additionally, we design a two-round, structural similarity-based example retrieval strategy for in-context learning. Our method achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general Spider and Bird benchmarks, respectively. Furthermore, experimental results show that Solid-SQL delivers an average improvement of 11.6% compared to baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider benchmarks.

Title: Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling

Authors: Iman Khazrak, Shakhnoza Takhirova, Mostafa M. Rezaee, Mehrdad Yadollahi, Robert C. Green II, Shuteng Niu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12532
Pdf URL: https://arxiv.org/pdf/2412.12532
Copy Paste: [[2412.12532]] Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling(https://arxiv.org/abs/2412.12532)
Keywords: diffusion, generative
Abstract: The development of accurate medical image classification models is often constrained by privacy concerns and data scarcity for certain conditions, leading to small and imbalanced datasets. To address these limitations, this study explores the use of generative models, such as Denoising Diffusion Probabilistic Models (DDPM) and Progressive Growing Generative Adversarial Networks (PGGANs), for dataset augmentation. The research introduces a framework to assess the impact of synthetic images generated by DDPM and PGGANs on the performance of four models: a custom CNN, Untrained VGG16, Pretrained VGG16, and Pretrained ResNet50. Experiments were conducted using Random Sampling and Greedy K Sampling to create small, imbalanced datasets. The synthetic images were evaluated using Frechet Inception Distance (FID) and compared to original datasets through classification metrics. The results show that DDPM consistently generated more realistic images with lower FID scores and significantly outperformed PGGANs in improving classification metrics across all models and datasets. Incorporating DDPM-generated images into the original datasets increased accuracy by up to 6%, enhancing model robustness and stability, particularly in imbalanced scenarios. Random Sampling demonstrated superior stability, while Greedy K Sampling offered diversity at the cost of higher FID scores. This study highlights the efficacy of DDPM in augmenting small, imbalanced medical image datasets, improving model performance by balancing the dataset and expanding its size.

Title: Stiefel Flow Matching for Moment-Constrained Structure Elucidation

Authors: Austin Cheng, Alston Lo, Kin Long Kelvin Lee, Santiago Miret, Alán Aspuru-Guzik
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2412.12540
Pdf URL: https://arxiv.org/pdf/2412.12540
Copy Paste: [[2412.12540]] Stiefel Flow Matching for Moment-Constrained Structure Elucidation(https://arxiv.org/abs/2412.12540)
Keywords: diffusion, generative
Abstract: Molecular structure elucidation is a fundamental step in understanding chemical phenomena, with applications in identifying molecules in natural products, lab syntheses, forensic samples, and the interstellar medium. We consider the task of predicting a molecule's all-atom 3D structure given only its molecular formula and moments of inertia, motivated by the ability of rotational spectroscopy to measure these moments. While existing generative models can conditionally sample 3D structures with approximately correct moments, this soft conditioning fails to leverage the many digits of precision afforded by experimental rotational spectroscopy. To address this, we first show that the space of $n$-atom point clouds with a fixed set of moments of inertia is embedded in the Stiefel manifold $\mathrm{St}(n, 4)$. We then propose Stiefel Flow Matching as a generative model for elucidating 3D structure under exact moment constraints. Additionally, we learn simpler and shorter flows by finding approximate solutions for equivariant optimal transport on the Stiefel manifold. Empirically, enforcing exact moment constraints allows Stiefel Flow Matching to achieve higher success rates and faster sampling than Euclidean diffusion models, even on high-dimensional manifolds corresponding to large molecules in the GEOM dataset.

Title: Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration

Authors: Xinlong Cheng, Tiantian Cao, Guoan Cheng, Bangxuan Huang, Xinghan Tian, Ye Wang, Xiaoyu He, Weixin Li, Tianfan Xue, Xuan Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12550
Pdf URL: https://arxiv.org/pdf/2412.12550
Copy Paste: [[2412.12550]] Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration(https://arxiv.org/abs/2412.12550)
Keywords: diffusion
Abstract: In this work, we address the limitations of denoising diffusion models (DDMs) in image restoration tasks, particularly the shape and color distortions that can compromise image quality. While DDMs have demonstrated a promising performance in many applications such as text-to-image synthesis, their effectiveness in image restoration is often hindered by shape and color distortions. We observe that these issues arise from inconsistencies between the training and testing data used by DDMs. Based on our observation, we propose a novel training method, named data-consistent training, which allows the DDMs to access images with accumulated errors during training, thereby ensuring the model to learn to correct these errors. Experimental results show that, across five image restoration tasks, our method has significant improvements over state-of-the-art methods while effectively minimizing distortions and preserving image fidelity.

Title: SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps

Authors: Sparsh Pekhale, Rakshith Sathish, Sathisha Basavaraju, Divya Sharma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12552
Pdf URL: https://arxiv.org/pdf/2412.12552
Copy Paste: [[2412.12552]] SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps(https://arxiv.org/abs/2412.12552)
Keywords: foundation model
Abstract: Land-use and land cover (LULC) analysis is critical in remote sensing, with wide-ranging applications across diverse fields such as agriculture, utilities, and urban planning. However, automating LULC map generation using machine learning is rendered challenging due to noisy labels. Typically, the ground truths (e.g. ESRI LULC, MapBioMass) have noisy labels that hamper the model's ability to learn to accurately classify the pixels. Further, these erroneous labels can significantly distort the performance metrics of a model, leading to misleading evaluations. Traditionally, the ambiguous labels are rectified using unsupervised algorithms. These algorithms struggle not only with scalability but also with generalization across different geographies. To overcome these challenges, we propose a zero-shot approach using the foundation model, Segment Anything Model (SAM), to automatically delineate different land parcels/regions and leverage them to relabel the unsure pixels by using the local label statistics within each detected region. We achieve a significant reduction in label noise and an improvement in the performance of the downstream segmentation model by $\approx 5\%$ when trained with denoised labels.

Title: Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

Authors: Vaden Masrani, Mohammad Akbari, David Ming Xuan Yue, Ahmad Rezaei, Yong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12563
Pdf URL: https://arxiv.org/pdf/2412.12563
Copy Paste: [[2412.12563]] Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers(https://arxiv.org/abs/2412.12563)
Keywords: self-supervised
Abstract: In the era of costly pre-training of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.

Title: PBVS 2024 Solution: Self-Supervised Learning and Sampling Strategies for SAR Classification in Extreme Long-Tail Distribution

Authors: Yuhyun Kim, Minwoo Kim, Hyobin Park, Jinwook Jung, Dong-Geol Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12565
Pdf URL: https://arxiv.org/pdf/2412.12565
Copy Paste: [[2412.12565]] PBVS 2024 Solution: Self-Supervised Learning and Sampling Strategies for SAR Classification in Extreme Long-Tail Distribution(https://arxiv.org/abs/2412.12565)
Keywords: self-supervised
Abstract: The Multimodal Learning Workshop (PBVS 2024) aims to improve the performance of automatic target recognition (ATR) systems by leveraging both Synthetic Aperture Radar (SAR) data, which is difficult to interpret but remains unaffected by weather conditions and visible light, and Electro-Optical (EO) data for simultaneous learning. The subtask, known as the Multi-modal Aerial View Imagery Challenge - Classification, focuses on predicting the class label of a low-resolution aerial image based on a set of SAR-EO image pairs and their respective class labels. The provided dataset consists of SAR-EO pairs, characterized by a severe long-tail distribution with over a 1000-fold difference between the largest and smallest classes, making typical long-tail methods difficult to apply. Additionally, the domain disparity between the SAR and EO datasets complicates the effectiveness of standard multimodal methods. To address these significant challenges, we propose a two-stage learning approach that utilizes self-supervised techniques, combined with multimodal learning and inference through SAR-to-EO translation for effective EO utilization. In the final testing phase of the PBVS 2024 Multi-modal Aerial View Image Challenge - Classification (SAR Classification) task, our model achieved an accuracy of 21.45%, an AUC of 0.56, and a total score of 0.30, placing us 9th in the competition.

Title: ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12571
Pdf URL: https://arxiv.org/pdf/2412.12571
Copy Paste: [[2412.12571]] ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers(https://arxiv.org/abs/2412.12571)
Keywords: diffusion, in-context
Abstract: Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at this https URL

Title: Process-Supervised Reward Models for Clinical Note Generation: A Scalable Approach Guided by Domain Expertise

Authors: Hanyin Wang, Qiping Xu, Bolun Liu, Guleid Hussein, Hariprasad Korsapati, Mohamad El Labban, Kingsley Iheasirim, Mohamed Hassan, Gokhan Anil, Brian Bartlett, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12583
Pdf URL: https://arxiv.org/pdf/2412.12583
Copy Paste: [[2412.12583]] Process-Supervised Reward Models for Clinical Note Generation: A Scalable Approach Guided by Domain Expertise(https://arxiv.org/abs/2412.12583)
Keywords: generative
Abstract: Process-supervised reward models (PRMs), which verify large language model (LLM) outputs step-by-step, have achieved significant success in mathematical and coding problems. However, their application to other domains remains largely unexplored. In this work, we train a PRM to provide step-level reward signals for clinical notes generated by LLMs from patient-doctor dialogues. Guided by real-world clinician expertise, we carefully designed step definitions for clinical notes and utilized Gemini-Pro 1.5 to automatically generate process supervision data at scale. Our proposed PRM, trained on the LLaMA-3.1 8B instruct model, demonstrated superior performance compared to Gemini-Pro 1.5 and an outcome-supervised reward model (ORM) across two key evaluations: (1) the accuracy of selecting gold-reference samples from error-containing samples, achieving 98.8% (versus 61.3% for ORM and 93.8% for Gemini-Pro 1.5), and (2) the accuracy of selecting physician-preferred notes, achieving 56.2% (compared to 51.2% for ORM and 50.0% for Gemini-Pro 1.5). Additionally, we conducted ablation studies to determine optimal loss functions and data selection strategies, along with physician reader studies to explore predictors of downstream Best-of-N performance. Our promising results suggest the potential of PRMs to extend beyond the clinical domain, offering a scalable and effective solution for diverse generative tasks.

Title: A Simple and Efficient Baseline for Zero-Shot Generative Classification

Authors: Zipeng Qi, Buhua Liu, Shiyan Zhang, Bao Li, Zhiqiang Xu, Haoyi Xiong, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12594
Pdf URL: https://arxiv.org/pdf/2412.12594
Copy Paste: [[2412.12594]] A Simple and Efficient Baseline for Zero-Shot Generative Classification(https://arxiv.org/abs/2412.12594)
Keywords: diffusion, generative
Abstract: Large diffusion models have become mainstream generative models in both academic studies and industrial AIGC applications. Recently, a number of works further explored how to employ the power of large diffusion models as zero-shot classifiers. While recent zero-shot diffusion-based classifiers have made performance advancement on benchmark datasets, they still suffered badly from extremely slow classification speed (e.g., ~1000 seconds per classifying single image on ImageNet). The extremely slow classification speed strongly prohibits existing zero-shot diffusion-based classifiers from practical applications. In this paper, we propose an embarrassingly simple and efficient zero-shot Gaussian Diffusion Classifiers (GDC) via pretrained text-to-image diffusion models and DINOv2. The proposed GDC can not only significantly surpass previous zero-shot diffusion-based classifiers by over 10 points (61.40% - 71.44%) on ImageNet, but also accelerate more than 30000 times (1000 - 0.03 seconds) classifying a single image on ImageNet. Additionally, it provides probability interpretation of the results. Our extensive experiments further demonstrate that GDC can achieve highly competitive zero-shot classification performance over various datasets and can promisingly self-improve with stronger diffusion models. To the best of our knowledge, the proposed GDC is the first zero-shot diffusionbased classifier that exhibits both competitive accuracy and practical efficiency.

Title: PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection

Authors: Jianan Ye, Weiguang Zhao, Xi Yang, Guangliang Cheng, Kaizhu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12617
Pdf URL: https://arxiv.org/pdf/2412.12617
Copy Paste: [[2412.12617]] PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection(https://arxiv.org/abs/2412.12617)
Keywords: anomaly
Abstract: Point cloud anomaly detection under the anomaly-free setting poses significant challenges as it requires accurately capturing the features of 3D normal data to identify deviations indicative of anomalies. Current efforts focus on devising reconstruction tasks, such as acquiring normal data representations by restoring normal samples from altered, pseudo-anomalous counterparts. Our findings reveal that distributing attention equally across normal and pseudo-anomalous data tends to dilute the model's focus on anomalous deviations. The challenge is further compounded by the inherently disordered and sparse nature of 3D point cloud data. In response to those predicaments, we introduce an innovative approach that emphasizes learning point offsets, targeting more informative pseudo-abnormal points, thus fostering more effective distillation of normal data representations. We also have crafted an augmentation technique that is steered by normal vectors, facilitating the creation of credible pseudo anomalies that enhance the efficiency of the training process. Our comprehensive experimental evaluation on the Anomaly-ShapeNet and Real3D-AD datasets evidences that our proposed method outperforms existing state-of-the-art approaches, achieving an average enhancement of 9.0% and 1.4% in the AUC-ROC detection metric across these datasets, respectively.

Title: Jailbreaking? One Step Is Enough!

Authors: Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, Yongmei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12621
Pdf URL: https://arxiv.org/pdf/2412.12621
Copy Paste: [[2412.12621]] Jailbreaking? One Step Is Enough!(https://arxiv.org/abs/2412.12621)
Keywords: in-context
Abstract: Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.

Title: Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation

Authors: Andong Chen, Yuchen Song, Kehai Chen, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12627
Pdf URL: https://arxiv.org/pdf/2412.12627
Copy Paste: [[2412.12627]] Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation(https://arxiv.org/abs/2412.12627)
Keywords: diffusion
Abstract: Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.

Title: RDPI: A Refine Diffusion Probability Generation Method for Spatiotemporal Data Imputation

Authors: Zijin Liu, Xiang Zhao, You Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12642
Pdf URL: https://arxiv.org/pdf/2412.12642
Copy Paste: [[2412.12642]] RDPI: A Refine Diffusion Probability Generation Method for Spatiotemporal Data Imputation(https://arxiv.org/abs/2412.12642)
Keywords: diffusion
Abstract: Spatiotemporal data imputation plays a crucial role in various fields such as traffic flow monitoring, air quality assessment, and climate prediction. However, spatiotemporal data collected by sensors often suffer from temporal incompleteness, and the sparse and uneven distribution of sensors leads to missing data in the spatial dimension. Among existing methods, autoregressive approaches are prone to error accumulation, while simple conditional diffusion models fail to adequately capture the spatiotemporal relationships between observed and missing data. To address these issues, we propose a novel two-stage Refined Diffusion Probability Impuation (RDPI) framework based on an initial network and a conditional diffusion model. In the initial stage, deterministic imputation methods are used to generate preliminary estimates of the missing data. In the refinement stage, residuals are treated as the diffusion target, and observed values are innovatively incorporated into the forward process. This results in a conditional diffusion model better suited for spatiotemporal data imputation, bridging the gap between the preliminary estimates and the true values. Experiments on multiple datasets demonstrate that RDPI not only achieves state-of-the-art imputation accuracy but also significantly reduces sampling computational costs.

Title: LLM-based Discriminative Reasoning for Knowledge Graph Question Answering

Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12643
Pdf URL: https://arxiv.org/pdf/2412.12643
Copy Paste: [[2412.12643]] LLM-based Discriminative Reasoning for Knowledge Graph Question Answering(https://arxiv.org/abs/2412.12643)
Keywords: generative
Abstract: Large language models (LLMs) based on generative pre-trained Transformer have achieved remarkable performance on knowledge graph question-answering (KGQA) tasks. However, LLMs often produce ungrounded subgraph planning or reasoning results in KGQA due to the hallucinatory behavior brought by the generative paradigm, which may hinder the advancement of the LLM-based KGQA model. To deal with the issue, we propose a novel LLM-based Discriminative Reasoning (LDR) method to explicitly model the subgraph retrieval and answer inference process. By adopting discriminative strategies, the proposed LDR method not only enhances the capability of LLMs to retrieve question-related subgraphs but also alleviates the issue of ungrounded reasoning brought by the generative paradigm of LLMs. Experimental results show that the proposed approach outperforms multiple strong comparison methods, along with achieving state-of-the-art performance on two widely used WebQSP and CWQ benchmarks.

Title: Progressive Monitoring of Generative Model Training Evolution

Authors: Vidya Prasad, Anna Vilanova, Nicola Pezzotti
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.12755
Pdf URL: https://arxiv.org/pdf/2412.12755
Copy Paste: [[2412.12755]] Progressive Monitoring of Generative Model Training Evolution(https://arxiv.org/abs/2412.12755)
Keywords: generative
Abstract: While deep generative models (DGMs) have gained popularity, their susceptibility to biases and other inefficiencies that lead to undesirable outcomes remains an issue. With their growing complexity, there is a critical need for early detection of issues to achieve desired results and optimize resources. Hence, we introduce a progressive analysis framework to monitor the training process of DGMs. Our method utilizes dimensionality reduction techniques to facilitate the inspection of latent representations, the generated and real distributions, and their evolution across training iterations. This monitoring allows us to pause and fix the training method if the representations or distributions progress undesirably. This approach allows for the analysis of a models' training dynamics and the timely identification of biases and failures, minimizing computational loads. We demonstrate how our method supports identifying and mitigating biases early in training a Generative Adversarial Network (GAN) and improving the quality of the generated data distribution.

Title: Towards a Training Free Approach for 3D Scene Editing

Authors: Vivek Madhavaram, Shivangana Rawat, Chaitanya Devaguptapu, Charu Sharma, Manohar Kaul
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12766
Pdf URL: https://arxiv.org/pdf/2412.12766
Copy Paste: [[2412.12766]] Towards a Training Free Approach for 3D Scene Editing(https://arxiv.org/abs/2412.12766)
Keywords: diffusion, foundation model
Abstract: Text driven diffusion models have shown remarkable capabilities in editing images. However, when editing 3D scenes, existing works mostly rely on training a NeRF for 3D editing. Recent NeRF editing methods leverages edit operations by deploying 2D diffusion models and project these edits into 3D space. They require strong positional priors alongside text prompt to identify the edit location. These methods are operational on small 3D scenes and are more generalized to particular scene. They require training for each specific edit and cannot be exploited in real-time edits. To address these limitations, we propose a novel method, FreeEdit, to make edits in training free manner using mesh representations as a substitute for NeRF. Training-free methods are now a possibility because of the advances in foundation model's space. We leverage these models to bring a training-free alternative and introduce solutions for insertion, replacement and deletion. We consider insertion, replacement and deletion as basic blocks for performing intricate edits with certain combinations of these operations. Given a text prompt and a 3D scene, our model is capable of identifying what object should be inserted/replaced or deleted and location where edit should be performed. We also introduce a novel algorithm as part of FreeEdit to find the optimal location on grounding object for placement. We evaluate our model by comparing it with baseline models on a wide range of scenes using quantitative and qualitative metrics and showcase the merits of our method with respect to others.

Title: Guided and Variance-Corrected Fusion with One-shot Style Alignment for Large-Content Image Generation

Authors: Shoukun Sun, Min Xian, Tiankai Yao, Fei Xu, Luca Capriotti
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12771
Pdf URL: https://arxiv.org/pdf/2412.12771
Copy Paste: [[2412.12771]] Guided and Variance-Corrected Fusion with One-shot Style Alignment for Large-Content Image Generation(https://arxiv.org/abs/2412.12771)
Keywords: diffusion
Abstract: Producing large images using small diffusion models is gaining increasing popularity, as the cost of training large models could be prohibitive. A common approach involves jointly generating a series of overlapped image patches and obtaining large images by merging adjacent patches. However, results from existing methods often exhibit obvious artifacts, e.g., seams and inconsistent objects and styles. To address the issues, we proposed Guided Fusion (GF), which mitigates the negative impact from distant image regions by applying a weighted average to the overlapping regions. Moreover, we proposed Variance-Corrected Fusion (VCF), which corrects data variance at post-averaging, generating more accurate fusion for the Denoising Diffusion Probabilistic Model. Furthermore, we proposed a one-shot Style Alignment (SA), which generates a coherent style for large images by adjusting the initial input noise without adding extra computational burden. Extensive experiments demonstrated that the proposed fusion methods improved the quality of the generated image significantly. As a plug-and-play module, the proposed method can be widely applied to enhance other fusion-based methods for large image generation.

Title: Rethinking Diffusion-Based Image Generators for Fundus Fluorescein Angiography Synthesis on Limited Data

Authors: Chengzhou Yu (South China University of Technology), Huihui Fang (Pazhou Laboratory), Hongqiu Wang (The Hong Kong University of Science and Technology (Guangzhou)), Ting Deng (South China University of Technology), Qing Du (South China University of Technology), Yanwu Xu (South China University of Technology), Weihua Yang (Shenzhen Eye Hospital)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12778
Pdf URL: https://arxiv.org/pdf/2412.12778
Copy Paste: [[2412.12778]] Rethinking Diffusion-Based Image Generators for Fundus Fluorescein Angiography Synthesis on Limited Data(https://arxiv.org/abs/2412.12778)
Keywords: diffusion, generative
Abstract: Fundus imaging is a critical tool in ophthalmology, with different imaging modalities offering unique advantages. For instance, fundus fluorescein angiography (FFA) can accurately identify eye diseases. However, traditional invasive FFA involves the injection of sodium fluorescein, which can cause discomfort and risks. Generating corresponding FFA images from non-invasive fundus images holds significant practical value but also presents challenges. First, limited datasets constrain the performance and effectiveness of models. Second, previous studies have primarily focused on generating FFA for single diseases or single modalities, often resulting in poor performance for patients with various ophthalmic conditions. To address these issues, we propose a novel latent diffusion model-based framework, Diffusion, which introduces a fine-tuning protocol to overcome the challenge of limited medical data and unleash the generative capabilities of diffusion models. Furthermore, we designed a new approach to tackle the challenges of generating across different modalities and disease types. On limited datasets, our framework achieves state-of-the-art results compared to existing methods, offering significant potential to enhance ophthalmic diagnostics and patient care. Our code will be released soon to support further research in this field.

Title: Is it the end of (generative) linguistics as we know it?

Authors: Cristiano Chesi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12797
Pdf URL: https://arxiv.org/pdf/2412.12797
Copy Paste: [[2412.12797]] Is it the end of (generative) linguistics as we know it?(https://arxiv.org/abs/2412.12797)
Keywords: generative
Abstract: A significant debate has emerged in response to a paper written by Steven Piantadosi (Piantadosi, 2023) and uploaded to the LingBuzz platform, the open archive for generative linguistics. Piantadosi's dismissal of Chomsky's approach is ruthless, but generative linguists deserve it. In this paper, I will adopt three idealized perspectives -- computational, theoretical, and experimental -- to focus on two fundamental issues that lend partial support to Piantadosi's critique: (a) the evidence challenging the Poverty of Stimulus (PoS) hypothesis and (b) the notion of simplicity as conceived within mainstream Minimalism. In conclusion, I argue that, to reclaim a central role in language studies, generative linguistics -- representing a prototypical theoretical perspective on language -- needs a serious update leading to (i) more precise, consistent, and complete formalizations of foundational intuitions and (ii) the establishment and utilization of a standardized dataset of crucial empirical evidence to evaluate the theory's adequacy. On the other hand, ignoring the formal perspective leads to major drawbacks in both computational and experimental approaches. Neither descriptive nor explanatory adequacy can be easily achieved without the precise formulation of general principles that can be challenged empirically.

Title: ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing

Authors: Yaohui Ma, Xiaopeng Hong, Shizhou Zhang, Huiyun Li, Zhilin Zhu, Wei Luo, Zhiheng Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12821
Pdf URL: https://arxiv.org/pdf/2412.12821
Copy Paste: [[2412.12821]] ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing(https://arxiv.org/abs/2412.12821)
Keywords: in-context
Abstract: Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding, but often contain outdated or inaccurate information. Current multimodal knowledge editing evaluations are limited in scope and potentially biased, focusing on narrow tasks and failing to assess the impact on in-domain samples. To address these issues, we introduce ComprehendEdit, a comprehensive benchmark comprising eight diverse tasks from multiple datasets. We propose two novel metrics: Knowledge Generalization Index (KGI) and Knowledge Preservation Index (KPI), which evaluate editing effects on in-domain samples without relying on AI-synthetic samples. Based on insights from our framework, we establish Hierarchical In-Context Editing (HICE), a baseline method employing a two-stage approach that balances performance across all metrics. This study provides a more comprehensive evaluation framework for multimodal knowledge editing, reveals unique challenges in this field, and offers a baseline method demonstrating improved performance. Our work opens new perspectives for future research and provides a foundation for developing more robust and effective editing techniques for MLLMs. The ComprehendEdit benchmark and implementation code are available at this https URL.

Title: Boosting Fine-Grained Visual Anomaly Detection with Coarse-Knowledge-Aware Adversarial Learning

Authors: Qingqing Fang, Qinliang Su, Wenxi Lv, Wenchao Xu, Jianxing Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12850
Pdf URL: https://arxiv.org/pdf/2412.12850
Copy Paste: [[2412.12850]] Boosting Fine-Grained Visual Anomaly Detection with Coarse-Knowledge-Aware Adversarial Learning(https://arxiv.org/abs/2412.12850)
Keywords: anomaly
Abstract: Many unsupervised visual anomaly detection methods train an auto-encoder to reconstruct normal samples and then leverage the reconstruction error map to detect and localize the anomalies. However, due to the powerful modeling and generalization ability of neural networks, some anomalies can also be well reconstructed, resulting in unsatisfactory detection and localization accuracy. In this paper, a small coarsely-labeled anomaly dataset is first collected. Then, a coarse-knowledge-aware adversarial learning method is developed to align the distribution of reconstructed features with that of normal features. The alignment can effectively suppress the auto-encoder's reconstruction ability on anomalies and thus improve the detection accuracy. Considering that anomalies often only occupy very small areas in anomalous images, a patch-level adversarial learning strategy is further developed. Although no patch-level anomalous information is available, we rigorously prove that by simply viewing any patch features from anomalous images as anomalies, the proposed knowledge-aware method can also align the distribution of reconstructed patch features with the normal ones. Experimental results on four medical datasets and two industrial datasets demonstrate the effectiveness of our method in improving the detection and localization performance.

Title: Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

Authors: Zhengdi Yu, Stefanos Zafeiriou, Tolga Birdal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12861
Pdf URL: https://arxiv.org/pdf/2412.12861
Copy Paste: [[2412.12861]] Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera(https://arxiv.org/abs/2412.12861)
Keywords: generative
Abstract: We propose Dyn-HaMR, to the best of our knowledge, the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild. Reconstructing accurate 3D hand meshes from monocular videos is a crucial task for understanding human behaviour, with significant applications in augmented and virtual reality (AR/VR). However, existing methods for monocular hand reconstruction typically rely on a weak perspective camera model, which simulates hand motion within a limited camera frustum. As a result, these approaches struggle to recover the full 3D global trajectory and often produce noisy or incorrect depth estimations, particularly when the video is captured by dynamic or moving cameras, which is common in egocentric scenarios. Our Dyn-HaMR consists of a multi-stage, multi-objective optimization pipeline, that factors in (i) simultaneous localization and mapping (SLAM) to robustly estimate relative camera motion, (ii) an interacting-hand prior for generative infilling and to refine the interaction dynamics, ensuring plausible recovery under (self-)occlusions, and (iii) hierarchical initialization through a combination of state-of-the-art hand tracking methods. Through extensive evaluations on both in-the-wild and indoor datasets, we show that our approach significantly outperforms state-of-the-art methods in terms of 4D global mesh recovery. This establishes a new benchmark for hand motion reconstruction from monocular video with moving cameras. Our project page is at this https URL.

Title: ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction

Authors: Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, Yingda Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12888
Pdf URL: https://arxiv.org/pdf/2412.12888
Copy Paste: [[2412.12888]] ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction(https://arxiv.org/abs/2412.12888)
Keywords: diffusion, generative
Abstract: The emergence of diffusion models has significantly advanced image synthesis. The recent studies of model interaction and self-corrective reasoning approach in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models in this paper. To the best of our knowledge, ArtAug is the first one that improves image synthesis models via model interactions with understanding models. In the interactions, we leverage human preferences implicitly learned by image understanding models to provide fine-grained suggestions for image synthesis models. The interactions can modify the image content to make it aesthetically pleasing, such as adjusting exposure, changing shooting angles, and adding atmospheric effects. The enhancements brought by the interaction are iteratively fused into the synthesis model itself through an additional enhancement module. This enables the synthesis model to directly produce aesthetically pleasing images without any extra computational cost. In the experiments, we train the ArtAug enhancement module on existing text-to-image models. Various evaluation metrics consistently demonstrate that ArtAug enhances the generative capabilities of text-to-image models without incurring additional computational costs. The source code and models will be released publicly.

Title: An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions

Authors: Shreeyash Gowaikar, Srinivasan Iyengar, Sameer Segal, Shivkumar Kalyanaraman
Subjects: cs.LG, cs.CE, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2412.12898
Pdf URL: https://arxiv.org/pdf/2412.12898
Copy Paste: [[2412.12898]] An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions(https://arxiv.org/abs/2412.12898)
Keywords: generative
Abstract: The Piping and Instrumentation Diagrams (P&IDs) are foundational to the design, construction, and operation of workflows in the engineering and process industries. However, their manual creation is often labor-intensive, error-prone, and lacks robust mechanisms for error detection and correction. While recent advancements in Generative AI, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), have demonstrated significant potential across various domains, their application in automating generation of engineering workflows remains underexplored. In this work, we introduce a novel copilot for automating the generation of P&IDs from natural language descriptions. Leveraging a multi-step agentic workflow, our copilot provides a structured and iterative approach to diagram creation directly from Natural Language prompts. We demonstrate the feasibility of the generation process by evaluating the soundness and completeness of the workflow, and show improved results compared to vanilla zero-shot and few-shot generation approaches.

Title: Unsupervised Region-Based Image Editing of Denoising Diffusion Models

Authors: Zixiang Li, Yue Song, Renshuai Tao, Xiaohong Jia, Yao Zhao, Wei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12912
Pdf URL: https://arxiv.org/pdf/2412.12912
Copy Paste: [[2412.12912]] Unsupervised Region-Based Image Editing of Denoising Diffusion Models(https://arxiv.org/abs/2412.12912)
Keywords: diffusion
Abstract: Although diffusion models have achieved remarkable success in the field of image generation, their latent space remains under-explored. Current methods for identifying semantics within latent space often rely on external supervision, such as textual information and segmentation masks. In this paper, we propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training. By projecting the Jacobian of the targeted semantic region into a low-dimensional subspace which is orthogonal to the non-masked regions, our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations. We conducted extensive experiments across multiple datasets and various architectures of diffusion models, achieving state-of-the-art performance. In particular, for some specific face attributes, the performance of our proposed method even surpasses that of supervised approaches, demonstrating its superior ability in editing local image properties.

Title: Synthetic Data Generation for Anomaly Detection on Table Grapes

Authors: Ionut Marian Motoi, Valerio Belli, Alberto Carpineto, Daniele Nardi, Thomas Alessandro Ciarfuglia
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.12949
Pdf URL: https://arxiv.org/pdf/2412.12949
Copy Paste: [[2412.12949]] Synthetic Data Generation for Anomaly Detection on Table Grapes(https://arxiv.org/abs/2412.12949)
Keywords: anomaly
Abstract: Early detection of illnesses and pest infestations in fruit cultivation is critical for maintaining yield quality and plant health. Computer vision and robotics are increasingly employed for the automatic detection of such issues, particularly using data-driven solutions. However, the rarity of these problems makes acquiring and processing the necessary data to train such algorithms a significant obstacle. One solution to this scarcity is the generation of synthetic high-quality anomalous samples. While numerous methods exist for this task, most require highly trained individuals for setup. This work addresses the challenge of generating synthetic anomalies in an automatic fashion that requires only an initial collection of normal and anomalous samples from the user - a task that is straightforward for farmers. We demonstrate the approach in the context of table grape cultivation. Specifically, based on the observation that normal berries present relatively smooth surfaces, while defects result in more complex textures, we introduce a Dual-Canny Edge Detection (DCED) filter. This filter emphasizes the additional texture indicative of diseases, pest infestations, or other defects. Using segmentation masks provided by the Segment Anything Model, we then select and seamlessly blend anomalous berries onto normal ones. We show that the proposed dataset augmentation technique improves the accuracy of an anomaly classifier for table grapes and that the approach can be generalized to other fruit types.

Title: Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.12953
Pdf URL: https://arxiv.org/pdf/2412.12953
Copy Paste: [[2412.12953]] Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning(https://arxiv.org/abs/2412.12953)
Keywords: diffusion
Abstract: Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at this https URL.

Title: ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting

Authors: Guillaume Couairon, Renu Singh, Anastase Charantonis, Christian Lessig, Claire Monteleoni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12971
Pdf URL: https://arxiv.org/pdf/2412.12971
Copy Paste: [[2412.12971]] ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting(https://arxiv.org/abs/2412.12971)
Keywords: diffusion, generative
Abstract: Weather forecasting plays a vital role in today's society, from agriculture and logistics to predicting the output of renewable energies, and preparing for extreme weather events. Deep learning weather forecasting models trained with the next state prediction objective on ERA5 have shown great success compared to numerical global circulation models. However, for a wide range of applications, being able to provide representative samples from the distribution of possible future weather states is critical. In this paper, we propose a methodology to leverage deterministic weather models in the design of probabilistic weather models, leading to improved performance and reduced computing costs. We first introduce \textbf{ArchesWeather}, a transformer-based deterministic model that improves upon Pangu-Weather by removing overrestrictive inductive priors. We then design a probabilistic weather model called \textbf{ArchesWeatherGen} based on flow matching, a modern variant of diffusion models, that is trained to project ArchesWeather's predictions to the distribution of ERA5 weather states. ArchesWeatherGen is a true stochastic emulator of ERA5 and surpasses IFS ENS and NeuralGCM on all WeatherBench headline variables (except for NeuralGCM's geopotential). Our work also aims to democratize the use of deterministic and generative machine learning models in weather forecasting research, with academic computing resources. All models are trained at 1.5° resolution, with a training budget of $\sim$9 V100 days for ArchesWeather and $\sim$45 V100 days for ArchesWeatherGen. For inference, ArchesWeatherGen generates 15-day weather trajectories at a rate of 1 minute per ensemble member on a A100 GPU card. To make our work fully reproducible, our code and models are open source, including the complete pipeline for data preparation, training, and evaluation, at this https URL .

Title: Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance

Authors: Wenhao Sun, Benlei Cui, Jingqun Tang, Xue-Mei Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12974
Pdf URL: https://arxiv.org/pdf/2412.12974
Copy Paste: [[2412.12974]] Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance(https://arxiv.org/abs/2412.12974)
Keywords: diffusion, generative
Abstract: Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly in image generation. However, when employed for object removal tasks, they still encounter issues such as generating random artifacts and the incapacity to repaint foreground object areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method to empower pre-trained diffusion models for stable and effective object removal. Firstly, in light of the observation that the self-attention maps influence the structure and shape details of the generated images, we propose Attention Activation and Suppression (ASS), which re-engineers the self-attention mechanism within the pre-trained diffusion models based on the given mask, thereby prioritizing the background over the foreground object during the reverse generation process. Moreover, we introduce Self-Attention Redirection Guidance (SARG), which utilizes the self-attention redirected by ASS to guide the generation process, effectively removing foreground objects within the mask while simultaneously generating content that is both plausible and coherent. Experiments demonstrate the stability and effectiveness of Attentive Eraser in object removal across a variety of pre-trained diffusion models, outperforming even training-based methods. Furthermore, Attentive Eraser can be implemented in various diffusion model architectures and checkpoints, enabling excellent scalability. Code is available at this https URL.

Title: Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health

Authors: Vivek Kumar, Eirini Ntoutsi, Pushpraj Singh Rajawat, Giacomo Medda, Diego Reforgiato Recupero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.12981
Pdf URL: https://arxiv.org/pdf/2412.12981
Copy Paste: [[2412.12981]] Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health(https://arxiv.org/abs/2412.12981)
Keywords: in-context
Abstract: Large language models (LLMs) have shown promising capabilities in healthcare analysis but face several challenges like hallucinations, parroting, and bias manifestation. These challenges are exacerbated in complex, sensitive, and low-resource domains. Therefore, in this work we introduce IC-AnnoMI, an expert-annotated motivational interviewing (MI) dataset built upon AnnoMI by generating in-context conversational dialogues leveraging LLMs, particularly ChatGPT. IC-AnnoMI employs targeted prompts accurately engineered through cues and tailored information, taking into account therapy style (empathy, reflection), contextual relevance, and false semantic change. Subsequently, the dialogues are annotated by experts, strictly adhering to the Motivational Interviewing Skills Code (MISC), focusing on both the psychological and linguistic dimensions of MI dialogues. We comprehensively evaluate the IC-AnnoMI dataset and ChatGPT's emotional reasoning ability and understanding of domain intricacies by modeling novel classification tasks employing several classical machine learning and current state-of-the-art transformer approaches. Finally, we discuss the effects of progressive prompting strategies and the impact of augmented data in mitigating the biases manifested in IC-AnnoM. Our contributions provide the MI community with not only a comprehensive dataset but also valuable insights for using LLMs in empathetic text generation for conversational therapy in supervised settings.

Title: A New Adversarial Perspective for LiDAR-based 3D Object Detection

Authors: Shijun Zheng, Weiquan Liu, Yu Guo, Yu Zang, Siqi Shen, Cheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13017
Pdf URL: https://arxiv.org/pdf/2412.13017
Copy Paste: [[2412.13017]] A New Adversarial Perspective for LiDAR-based 3D Object Detection(https://arxiv.org/abs/2412.13017)
Keywords: generative
Abstract: Autonomous vehicles (AVs) rely on LiDAR sensors for environmental perception and decision-making in driving scenarios. However, ensuring the safety and reliability of AVs in complex environments remains a pressing challenge. To address this issue, we introduce a real-world dataset (ROLiD) comprising LiDAR-scanned point clouds of two random objects: water mist and smoke. In this paper, we introduce a novel adversarial perspective by proposing an attack framework that utilizes water mist and smoke to simulate environmental interference. Specifically, we propose a point cloud sequence generation method using a motion and content decomposition generative adversarial network named PCS-GAN to simulate the distribution of random objects. Furthermore, leveraging the simulated LiDAR scanning characteristics implemented with Range Image, we examine the effects of introducing random object perturbations at various positions on the target vehicle. Extensive experiments demonstrate that adversarial perturbations based on random objects effectively deceive vehicle detection and reduce the recognition rate of 3D object detection models.

Title: Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach

Authors: Hugo Math, Rainer Lienhart, Robin Schön
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.13041
Pdf URL: https://arxiv.org/pdf/2412.13041
Copy Paste: [[2412.13041]] Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach(https://arxiv.org/abs/2412.13041)
Keywords: self-supervised
Abstract: In this paper, we draw an analogy between processing natural languages and processing multivariate event streams from vehicles in order to predict $\textit{when}$ and $\textit{what}$ error pattern is most likely to occur in the future for a given car. Our approach leverages the temporal dynamics and contextual relationships of our event data from a fleet of cars. Event data is composed of discrete values of error codes as well as continuous values such as time and mileage. Modelled by two causal Transformers, we can anticipate vehicle failures and malfunctions before they happen. Thus, we introduce $\textit{CarFormer}$, a Transformer model trained via a new self-supervised learning strategy, and $\textit{EPredictor}$, an autoregressive Transformer decoder model capable of predicting $\textit{when}$ and $\textit{what}$ error pattern will most likely occur after some error code apparition. Despite the challenges of high cardinality of event types, their unbalanced frequency of appearance and limited labelled data, our experimental results demonstrate the excellent predictive ability of our novel model. Specifically, with sequences of $160$ error codes on average, our model is able with only half of the error codes to achieve $80\%$ F1 score for predicting $\textit{what}$ error pattern will occur and achieves an average absolute error of $58.4 \pm 13.2$h $\textit{when}$ forecasting the time of occurrence, thus enabling confident predictive maintenance and enhancing vehicle safety.

Title: CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

Authors: Mohammad Mahdi Abootorabi, Ehsaneddin Asgari
Subjects: cs.CL, cs.IR, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.13071
Pdf URL: https://arxiv.org/pdf/2412.13071
Copy Paste: [[2412.13071]] CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval(https://arxiv.org/abs/2412.13071)
Keywords: self-supervised
Abstract: This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval approaches in specific scenarios.

Title: Prompt Augmentation for Self-supervised Text-guided Image Manipulation

Authors: Rumeysa Bodur, Binod Bhattarai, Tae-Kyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13081
Pdf URL: https://arxiv.org/pdf/2412.13081
Copy Paste: [[2412.13081]] Prompt Augmentation for Self-supervised Text-guided Image Manipulation(https://arxiv.org/abs/2412.13081)
Keywords: diffusion, self-supervised
Abstract: Text-guided image editing finds applications in various creative and practical fields. While recent studies in image generation have advanced the field, they often struggle with the dual challenges of coherent image transformation and context preservation. In response, our work introduces prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. Specifically, we use the augmented prompts to delineate the intended manipulation area. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. Acknowledging the continuous nature of image manipulations, we further refine our approach by incorporating the similarity concept, creating a Soft Contrastive Loss. The new losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.

Title: F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Authors: Lu Liu, Huiyu Duan, Qiang Hu, Liu Yang, Chunlei Cai, Tianxiao Ye, Huayu Liu, Xiaoyun Zhang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13155
Pdf URL: https://arxiv.org/pdf/2412.13155
Copy Paste: [[2412.13155]] F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration(https://arxiv.org/abs/2412.13155)
Keywords: generative
Abstract: Artificial intelligence generative models exhibit remarkable capabilities in content creation, particularly in face image generation, customization, and restoration. However, current AI-generated faces (AIGFs) often fall short of human preferences due to unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation framework for AIGFs. To address this need, we introduce FaceQ, a large-scale, comprehensive database of AI-generated Face images with fine-grained Quality annotations reflecting human preferences. The FaceQ database comprises 12,255 images generated by 29 models across three tasks: (1) face generation, (2) face customization, and (3) face restoration. It includes 32,742 mean opinion scores (MOSs) from 180 annotators, assessed across multiple dimensions: quality, authenticity, identity (ID) fidelity, and text-image correspondence. Using the FaceQ database, we establish F-Bench, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA), face quality assessment (FQA), AI-generated content image quality assessment (AIGCIQA), and preference evaluation metrics, manifesting that these standard metrics are relatively ineffective in evaluating authenticity, ID fidelity, and text-image correspondence. The FaceQ database will be publicly available upon publication.

Title: Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors

Authors: Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, Guanzhong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13173
Pdf URL: https://arxiv.org/pdf/2412.13173
Copy Paste: [[2412.13173]] Locate n' Rotate: Two-stage Openable Part Detection with Foundation Model Priors(https://arxiv.org/abs/2412.13173)
Keywords: foundation model
Abstract: Detecting the openable parts of articulated objects is crucial for downstream applications in intelligent robotics, such as pulling a drawer. This task poses a multitasking challenge due to the necessity of understanding object categories and motion. Most existing methods are either category-specific or trained on specific datasets, lacking generalization to unseen environments and objects. In this paper, we propose a Transformer-based Openable Part Detection (OPD) framework named Multi-feature Openable Part Detection (MOPD) that incorporates perceptual grouping and geometric priors, outperforming previous methods in performance. In the first stage of the framework, we introduce a perceptual grouping feature model that provides perceptual grouping feature priors for openable part detection, enhancing detection results through a cross-attention mechanism. In the second stage, a geometric understanding feature model offers geometric feature priors for predicting motion parameters. Compared to existing methods, our proposed approach shows better performance in both detection and motion parameter prediction. Codes and models are publicly available at this https URL

Title: Move-in-2D: 2D-Conditioned Human Motion Generation

Authors: Hsin-Ping Huang, Yang Zhou, Jui-Hsien Wang, Difan Liu, Feng Liu, Ming-Hsuan Yang, Zhan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13185
Pdf URL: https://arxiv.org/pdf/2412.13185
Copy Paste: [[2412.13185]] Move-in-2D: 2D-Conditioned Human Motion Generation(https://arxiv.org/abs/2412.13185)
Keywords: diffusion
Abstract: Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.

Title: StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Authors: Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, Sida Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13188
Pdf URL: https://arxiv.org/pdf/2412.13188
Copy Paste: [[2412.13188]] StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models(https://arxiv.org/abs/2412.13188)
Keywords: diffusion, generative
Abstract: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.

Title: GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Authors: Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13193
Pdf URL: https://arxiv.org/pdf/2412.13193
Copy Paste: [[2412.13193]] GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding(https://arxiv.org/abs/2412.13193)
Keywords: self-supervised, foundation model
Abstract: 3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at this https URL.

Title: Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

Authors: Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.13194
Pdf URL: https://arxiv.org/pdf/2412.13194
Copy Paste: [[2412.13194]] Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents(https://arxiv.org/abs/2412.13194)
Keywords: foundation model
Abstract: The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and this http URL the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in this https URL

Title: CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Authors: Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, Xinguo Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13195
Pdf URL: https://arxiv.org/pdf/2412.13195
Copy Paste: [[2412.13195]] CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models(https://arxiv.org/abs/2412.13195)
Keywords: diffusion
Abstract: Text-to-image diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code will be available at this https URL.