2024-11-19

Title: Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Authors: Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10495
Pdf URL: https://arxiv.org/pdf/2411.10495
Copy Paste: [[2411.10495]] Boundary Attention Constrained Zero-Shot Layout-To-Image Generation(https://arxiv.org/abs/2411.10495)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, several studies developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require either fine-tuning pretrained parameters or training additional control modules for the diffusion models. In this work, we propose a novel zero-shot L2I approach, BACON (Boundary Attention Constrained generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing zero-shot L2I techniuqes both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

Title: Prompt-Guided Environmentally Consistent Adversarial Patch

Authors: Chaoqun Li, Huanqian Yan, Lifeng Zhou, Tairan Chen, Zhuodong Liu, Hang Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10498
Pdf URL: https://arxiv.org/pdf/2411.10498
Copy Paste: [[2411.10498]] Prompt-Guided Environmentally Consistent Adversarial Patch(https://arxiv.org/abs/2411.10498)
Keywords: diffusion
Abstract: Adversarial attacks in the physical world pose a significant threat to the security of vision-based systems, such as facial recognition and autonomous driving. Existing adversarial patch methods primarily focus on improving attack performance, but they often produce patches that are easily detectable by humans and struggle to achieve environmental consistency, i.e., blending patches into the environment. This paper introduces a novel approach for generating adversarial patches, which addresses both the visual naturalness and environmental consistency of the patches. We propose Prompt-Guided Environmentally Consistent Adversarial Patch (PG-ECAP), a method that aligns the patch with the environment to ensure seamless integration into the environment. The approach leverages diffusion models to generate patches that are both environmental consistency and effective in evading detection. To further enhance the naturalness and consistency, we introduce two alignment losses: Prompt Alignment Loss and Latent Space Alignment Loss, ensuring that the generated patch maintains its adversarial properties while fitting naturally within its environment. Extensive experiments in both digital and physical domains demonstrate that PG-ECAP outperforms existing methods in attack success rate and environmental consistency.

Title: FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

Authors: Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10499
Pdf URL: https://arxiv.org/pdf/2411.10499
Copy Paste: [[2411.10499]] FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on(https://arxiv.org/abs/2411.10499)
Keywords: diffusion
Abstract: Although image-based virtual try-on has made considerable progress, emerging approaches still encounter challenges in producing high-fidelity and robust fitting images across diverse scenarios. These methods often struggle with issues such as texture-aware maintenance and size-aware fitting, which hinder their overall effectiveness. To address these limitations, we propose a novel garment perception enhancement technique, termed FitDiT, designed for high-fidelity virtual try-on using Diffusion Transformers (DiT) allocating more parameters and attention to high-resolution features. First, to further improve texture-aware maintenance, we introduce a garment texture extractor that incorporates garment priors evolution to fine-tune garment feature, facilitating to better capture rich details such as stripes, patterns, and text. Additionally, we introduce frequency-domain learning by customizing a frequency distance loss to enhance high-frequency garment details. To tackle the size-aware fitting issue, we employ a dilated-relaxed mask strategy that adapts to the correct length of garments, preventing the generation of garments that fill the entire mask area during cross-category try-on. Equipped with the above design, FitDiT surpasses all baselines in both qualitative and quantitative evaluations. It excels in producing well-fitting garments with photorealistic and intricate details, while also achieving competitive inference times of 4.57 seconds for a single 1024x768 image after DiT structure slimming, outperforming existing methods.

Title: OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

Authors: Mathis Koroglu, Hugo Caselles-Dupré, Guillaume Jeanneret Sanmiguel, Matthieu Cord
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10501
Pdf URL: https://arxiv.org/pdf/2411.10501
Copy Paste: [[2411.10501]] OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models(https://arxiv.org/abs/2411.10501)
Keywords: diffusion
Abstract: We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

Title: Everything is a Video: Unifying Modalities through Next-Frame Prediction

Authors: G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10503
Pdf URL: https://arxiv.org/pdf/2411.10503
Copy Paste: [[2411.10503]] Everything is a Video: Unifying Modalities through Next-Frame Prediction(https://arxiv.org/abs/2411.10503)
Keywords: foundation model
Abstract: Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.

Title: DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration

Authors: Xinmin Qiu, Bonan Li, Zicheng Zhang, Congying Han, Tiande Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10508
Pdf URL: https://arxiv.org/pdf/2411.10508
Copy Paste: [[2411.10508]] DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration(https://arxiv.org/abs/2411.10508)
Keywords: diffusion
Abstract: Blind face restoration (BFR) is fundamentally challenged by the extensive range of degradation types and degrees that impact model generalization. Recent advancements in diffusion models have made considerable progress in this field. Nevertheless, a critical limitation is their lack of awareness of specific degradation, leading to potential issues such as unnatural details and inaccurate textures. In this paper, we equip diffusion models with the capability to decouple various degradation as a degradation prompt from low-quality (LQ) face images via unsupervised contrastive learning with reconstruction loss, and demonstrate that this capability significantly improves performance, particularly in terms of the naturalness of the restored images. Our novel restoration scheme, named DR-BFR, guides the denoising of Latent Diffusion Models (LDM) by incorporating Degradation Representation (DR) and content features from LQ images. DR-BFR comprises two modules: 1) Degradation Representation Module (DRM): This module extracts degradation representation with content-irrelevant features from LQ faces and estimates a reasonable distribution in the degradation space through contrastive learning and a specially designed LQ reconstruction. 2) Latent Diffusion Restoration Module (LDRM): This module perceives both degradation features and content features in the latent space, enabling the restoration of high-quality images from LQ inputs. Our experiments demonstrate that the proposed DR-BFR significantly outperforms state-of-the-art methods quantitatively and qualitatively across various datasets. The DR effectively distinguishes between various degradations in blind face inverse problems and provides a reasonably powerful prompt to LDM.

Title: SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Authors: Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10510
Pdf URL: https://arxiv.org/pdf/2411.10510
Copy Paste: [[2411.10510]] SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers(https://arxiv.org/abs/2411.10510)
Keywords: diffusion, generative
Abstract: Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

Title: On the Privacy Risk of In-context Learning

Authors: Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.10512
Pdf URL: https://arxiv.org/pdf/2411.10512
Copy Paste: [[2411.10512]] On the Privacy Risk of In-context Learning(https://arxiv.org/abs/2411.10512)
Keywords: in-context
Abstract: Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task -- often the private dataset of a party, e.g., a company that wants to leverage the LLM for their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model's sensitivity to their prompts -- in the form of a significantly higher prediction confidence on the prompted data -- as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.

Title: Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction

Authors: Po-han Li, Yunhao Yang, Mohammad Omama, Sandeep Chinchali, Ufuk Topcu
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2411.10513
Pdf URL: https://arxiv.org/pdf/2411.10513
Copy Paste: [[2411.10513]] Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction(https://arxiv.org/abs/2411.10513)
Keywords: generative
Abstract: Autonomous agents perceive and interpret their surroundings by integrating multimodal inputs, such as vision, audio, and LiDAR. These perceptual modalities support retrieval tasks, such as place recognition in robotics. However, current multimodal retrieval systems encounter difficulties when parts of the data are missing due to sensor failures or inaccessibility, such as silent videos or LiDAR scans lacking RGB information. We propose Any2Any-a novel retrieval framework that addresses scenarios where both query and reference instances have incomplete modalities. Unlike previous methods limited to the imputation of two modalities, Any2Any handles any number of modalities without training generative models. It calculates pairwise similarities with cross-modal encoders and employs a two-stage calibration process with conformal prediction to align the similarities. Any2Any enables effective retrieval across multimodal datasets, e.g., text-LiDAR and text-time series. It achieves a Recall@5 of 35% on the KITTI dataset, which is on par with baseline models with complete modalities.

Title: "On the goals of linguistic theory": Revisiting Chomskyan theories in the era of AI

Authors: Eva Portelance, Masoud Jasbi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10533
Pdf URL: https://arxiv.org/pdf/2411.10533
Copy Paste: [[2411.10533]] "On the goals of linguistic theory": Revisiting Chomskyan theories in the era of AI(https://arxiv.org/abs/2411.10533)
Keywords: generative
Abstract: Theoretical linguistics seeks to explain what human language is, and why. Linguists and cognitive scientists have proposed different theoretical models of what language is, as well as cognitive factors that shape it, and allow humans to 'produce', 'understand', and 'acquire' natural languages. However, humans may no longer be the only ones learning to 'generate', 'parse', and 'learn' natural language: artificial intelligence (AI) models such as large language models are proving to have impressive linguistic capabilities. Many are thus questioning what role, if any, such models should play in helping theoretical linguistics reach its ultimate research goals? In this paper, we propose to answer this question, by reiterating the tenets of generative linguistics, a leading school of thought in the field, and by considering how AI models as theories of language relate to each of these important concepts. Specifically, we consider three foundational principles, finding roots in the early works of Noam Chomsky: (1) levels of theoretical adequacy; (2) procedures for linguistic theory development; (3) language learnability and Universal Grammar. In our discussions of each principle, we give special attention to two types of AI models: neural language models and neural grammar induction models. We will argue that such models, in particular neural grammar induction models, do have a role to play, but that this role is largely modulated by the stance one takes regarding each of these three guiding principles.

Title: Does Prompt Formatting Have Any Impact on LLM Performance?

Authors: Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, Sadid Hasan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10541
Pdf URL: https://arxiv.org/pdf/2411.10541
Copy Paste: [[2411.10541]] Does Prompt Formatting Have Any Impact on LLM Performance?(https://arxiv.org/abs/2411.10541)
Keywords: in-context
Abstract: In the realm of Large Language Models (LLMs), prompt optimization is crucial for model performance. Although previous research has explored aspects like rephrasing prompt contexts, using various prompting techniques (like in-context learning and chain-of-thought), and ordering few-shot examples, our understanding of LLM sensitivity to prompt templates remains limited. Therefore, this paper examines the impact of different prompt templates on LLM performance. We formatted the same contexts into various human-readable templates, including plain text, Markdown, JSON, and YAML, and evaluated their impact across tasks like natural language reasoning, code generation, and translation using OpenAI's GPT models. Experiments show that GPT-3.5-turbo's performance varies by up to 40\% in a code translation task depending on the prompt template, while larger models like GPT-4 are more robust to these variations. Our analysis highlights the need to reconsider the use of fixed prompt templates, as different formats can significantly affect model performance.

Title: SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

Authors: Priyansh Bhatnagar, Linfeng Wen, Mingu Kang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10543
Pdf URL: https://arxiv.org/pdf/2411.10543
Copy Paste: [[2411.10543]] SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism(https://arxiv.org/abs/2411.10543)
Keywords: generative
Abstract: Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.

Title: Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

Authors: Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10582
Pdf URL: https://arxiv.org/pdf/2411.10582
Copy Paste: [[2411.10582]] Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera(https://arxiv.org/abs/2411.10582)
Keywords: diffusion
Abstract: Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured with a dynamic camera, in-the-wild. This is a challenging task as the monocular input has inherent depth ambiguity, and the moving camera adds additional complexity as the rendered human motion is now a product of both human and camera movement. Not accounting for this confusion, existing GHMR methods often output motions that are unrealistic, e.g. unaccounted root translation of the human causes foot sliding. We present DiffOpt, a novel 3D global HMR method using Diffusion Optimization. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions. We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and Egobody, and demonstrate superior global human motion recovery capability over other state-of-the-art global HMR methods most prominently in long video settings.

Title: Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data

Authors: Kai Helli, David Schnurr, Noah Hollmann, Samuel Müller, Frank Hutter
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.10634
Pdf URL: https://arxiv.org/pdf/2411.10634
Copy Paste: [[2411.10634]] Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data(https://arxiv.org/abs/2411.10634)
Keywords: in-context
Abstract: While most ML models expect independent and identically distributed data, this assumption is often violated in real-world scenarios due to distribution shifts, resulting in the degradation of machine learning model performance. Until now, no tabular method has consistently outperformed classical supervised learning, which ignores these shifts. To address temporal distribution shifts, we present Drift-Resilient TabPFN, a fresh approach based on In-Context Learning with a Prior-Data Fitted Network that learns the learning algorithm itself: it accepts the entire training dataset as input and makes predictions on the test set in a single forward pass. Specifically, it learns to approximate Bayesian inference on synthetic datasets drawn from a prior that specifies the model's inductive bias. This prior is based on structural causal models (SCM), which gradually shift over time. To model shifts of these causal models, we use a secondary SCM, that specifies changes in the primary model parameters. The resulting Drift-Resilient TabPFN can be applied to unseen data, runs in seconds on small to moderately sized datasets and needs no hyperparameter tuning. Comprehensive evaluations across 18 synthetic and real-world datasets demonstrate large performance improvements over a wide range of baselines, such as XGB, CatBoost, TabPFN, and applicable methods featured in the Wild-Time benchmark. Compared to the strongest baselines, it improves accuracy from 0.688 to 0.744 and ROC AUC from 0.786 to 0.832 while maintaining stronger calibration. This approach could serve as significant groundwork for further research on out-of-distribution prediction.

Title: IntentGPT: Few-shot Intent Discovery with Large Language Models

Authors: Juan A. Rodriguez, Nicholas Botzer, David Vazquez, Christopher Pal, Marco Pedersoli, Issam Laradji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10670
Pdf URL: https://arxiv.org/pdf/2411.10670
Copy Paste: [[2411.10670]] IntentGPT: Few-shot Intent Discovery with Large Language Models(https://arxiv.org/abs/2411.10670)
Keywords: in-context
Abstract: In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual assistants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training-free method that effectively prompts Large Language Models (LLMs) such as GPT-4 to discover new intents with minimal labeled data. IntentGPT comprises an \textit{In-Context Prompt Generator}, which generates informative prompts for In-Context Learning, an \textit{Intent Predictor} for classifying and discovering user intents from utterances, and a \textit{Semantic Few-Shot Sampler} that selects relevant few-shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain-specific data and fine-tuning, in popular benchmarks, including CLINC and BANKING, among others.

Title: From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling

Authors: Jinhong Lin, Cheng-En Wu, Huanran Li, Jifan Zhang, Yu Hen Hu, Pedro Morgado
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10685
Pdf URL: https://arxiv.org/pdf/2411.10685
Copy Paste: [[2411.10685]] From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling(https://arxiv.org/abs/2411.10685)
Keywords: self-supervised
Abstract: Masked Image Modeling (MIM) has emerged as a powerful self-supervised learning paradigm for visual representation learning, enabling models to acquire rich visual representations by predicting masked portions of images from their visible regions. While this approach has shown promising results, we hypothesize that its effectiveness may be limited by optimization challenges during early training stages, where models are expected to learn complex image distributions from partial observations before developing basic visual processing capabilities. To address this limitation, we propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our approach introduces a temperature-based annealing scheme that gradually expands the training distribution, enabling more stable and efficient learning trajectories. Through extensive experiments on ImageNet-1K, we demonstrate that our curriculum learning strategy significantly improves both training efficiency and representation quality while requiring substantially fewer training epochs compared to standard Masked Auto-Encoding. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning, providing a practical solution to the early-stage optimization challenges in MIM.

Title: MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations

Authors: Qixuan Jin, Walter Gerych, Marzyeh Ghassemi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10686
Pdf URL: https://arxiv.org/pdf/2411.10686
Copy Paste: [[2411.10686]] MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations(https://arxiv.org/abs/2411.10686)
Keywords: diffusion
Abstract: Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don't generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as clinicians can directly validate the modified images. While Denoising Diffusion Probabilistic Models (Diffusion Models) show promise for natural images, they are impractical for medical use due to the difficulty of describing spurious medical features. To address this, we propose Masked Medical Image Inpainting (MaskMedPaint), which uses text-to-image diffusion models to augment training images by inpainting areas outside key classification regions to match the target domain. We demonstrate that MaskMedPaint enhances generalization to target domains across both natural (Waterbirds, iWildCam) and medical (ISIC 2018, Chest X-ray) datasets, given limited unlabeled target images.

Title: Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection

Authors: Ying Yang, De Cheng, Chaowei Fang, Yubiao Wang, Changzhe Jiao, Lechao Cheng, Nannan Wang
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10701
Pdf URL: https://arxiv.org/pdf/2411.10701
Copy Paste: [[2411.10701]] Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection(https://arxiv.org/abs/2411.10701)
Keywords: diffusion, generative
Abstract: Unsupervised out-of-distribution (OOD) detection aims to identify out-of-domain data by learning only from unlabeled In-Distribution (ID) training samples, which is crucial for developing a safe real-world machine learning system. Current reconstruction-based methods provide a good alternative approach by measuring the reconstruction error between the input and its corresponding generative counterpart in the pixel/feature space. However, such generative methods face a key dilemma: improving the reconstruction power of the generative model while keeping a compact representation of the ID data. To address this issue, we propose the diffusion-based layer-wise semantic reconstruction approach for unsupervised OOD detection. The innovation of our approach is that we leverage the diffusion model's intrinsic data reconstruction ability to distinguish ID samples from OOD samples in the latent feature space. Moreover, to set up a comprehensive and discriminative feature representation, we devise a multi-layer semantic feature extraction strategy. By distorting the extracted features with Gaussian noise and applying the diffusion model for feature reconstruction, the separation of ID and OOD samples is implemented according to the reconstruction errors. Extensive experimental results on multiple benchmarks built upon various datasets demonstrate that our method achieves state-of-the-art performance in terms of detection accuracy and speed. Code is available at

Title: A Regularized LSTM Method for Detecting Fake News Articles

Authors: Tanjina Sultana Camelia, Faizur Rahman Fahim, Md. Musfique Anwar
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2411.10713
Pdf URL: https://arxiv.org/pdf/2411.10713
Copy Paste: [[2411.10713]] A Regularized LSTM Method for Detecting Fake News Articles(https://arxiv.org/abs/2411.10713)
Keywords: diffusion
Abstract: Nowadays, the rapid diffusion of fake news poses a significant problem, as it can spread misinformation and confusion. This paper aims to develop an advanced machine learning solution for detecting fake news articles. Leveraging a comprehensive dataset of news articles, including 23,502 fake news articles and 21,417 accurate news articles, we implemented and evaluated three machine-learning models. Our dataset, curated from diverse sources, provides rich textual content categorized into title, text, subject, and Date features. These features are essential for training robust classification models to distinguish between fake and authentic news articles. The initial model employed a Long Short-Term Memory (LSTM) network, achieving an accuracy of 94%. The second model improved upon this by incorporating additional regularization techniques and fine-tuning hyperparameters, resulting in a 97% accuracy. The final model combined the strengths of previous architectures with advanced optimization strategies, achieving a peak accuracy of 98%. These results demonstrate the effectiveness of our approach in identifying fake news with high precision. Implementing these models showcases significant advancements in natural language processing and machine learning techniques, contributing valuable tools for combating misinformation. Our work highlights the potential for deploying such models in real-world applications, providing a reliable method for automated fake news detection and enhancing the credibility of news dissemination.

Title: Multi Scale Graph Neural Network for Alzheimer's Disease

Authors: Anya Chauhan, Ayush Noori, Zhaozhi Li, Yingnan He, Michelle M Li, Marinka Zitnik, Sudeshna Das
Subjects: cs.LG, q-bio.NC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.10720
Pdf URL: https://arxiv.org/pdf/2411.10720
Copy Paste: [[2411.10720]] Multi Scale Graph Neural Network for Alzheimer's Disease(https://arxiv.org/abs/2411.10720)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a complex, progressive neurodegenerative disorder characterized by extracellular A\b{eta} plaques, neurofibrillary tau tangles, glial activation, and neuronal degeneration, involving multiple cell types and pathways. Current models often overlook the cellular context of these pathways. To address this, we developed a multiscale graph neural network (GNN) model, ALZ PINNACLE, using brain omics data from donors spanning the entire aging to AD spectrum. ALZ PINNACLE is based on the PINNACLE GNN framework, which learns context-aware protein, cell type, and tissue representations within a unified latent space. ALZ PINNACLE was trained on 14,951 proteins, 206,850 protein interactions, 7 cell types, and 48 cell subtypes or states. After pretraining, we investigated the learned embedding of APOE, the largest genetic risk factor for AD, across different cell types. Notably, APOE embeddings showed high similarity in microglial, neuronal, and CD8 cells, suggesting a similar role of APOE in these cell types. Fine tuning the model on AD risk genes revealed cell type contexts predictive of the role of APOE in AD. Our results suggest that ALZ PINNACLE may provide a valuable framework for uncovering novel insights into AD neurobiology.

Title: On-device Anomaly Detection in Conveyor Belt Operations

Authors: Luciano S. Martinez-Rau, Yuxuan Zhang, Bengt Oelmann, Sebastian Bader
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10729
Pdf URL: https://arxiv.org/pdf/2411.10729
Copy Paste: [[2411.10729]] On-device Anomaly Detection in Conveyor Belt Operations(https://arxiv.org/abs/2411.10729)
Keywords: anomaly
Abstract: Mining 4.0 leverages advancements in automation, digitalization, and interconnected technologies from Industry 4.0 to address the unique challenges of the mining sector, enhancing efficiency, safety, and sustainability. Conveyor belts are crucial in mining operations by enabling the continuous and efficient movement of bulk materials over long distances, which directly impacts productivity. While detecting anomalies in specific conveyor belt components, such as idlers, pulleys, and belt surfaces, has been widely studied, identifying the root causes of these failures remains critical due to factors like changing production conditions and operator errors. Continuous monitoring of mining conveyor belt work cycles for anomaly detection is still at an early stage and requires robust solutions. This study proposes two distinctive pattern recognition approaches for real-time anomaly detection in the operational cycles of mining conveyor belts, combining feature extraction, threshold-based cycle detection, and tiny machine-learning classification. Both approaches outperformed a state-of-the-art technique on two datasets for duty cycle classification in terms of F1-scores. The first approach, with 97.3% and 80.2% for normal and abnormal cycles, respectively, reaches the highest performance in the first dataset while the second approach excels on the second dataset, scoring 91.3% and 67.9%. Implemented on two low-power microcontrollers, the methods demonstrated efficient, real-time operation with energy consumption of 13.3 and 20.6 ${\mu}$J during inference. These results offer valuable insights for detecting mechanical failure sources, supporting targeted preventive maintenance, and optimizing production cycles.

Title: TDSM:Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition

Authors: Jeonghyeok Do, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10745
Pdf URL: https://arxiv.org/pdf/2411.10745
Copy Paste: [[2411.10745]] TDSM:Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition(https://arxiv.org/abs/2411.10745)
Keywords: diffusion, generative
Abstract: We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

Title: Steam Turbine Anomaly Detection: An Unsupervised Learning Approach Using Enhanced Long Short-Term Memory Variational Autoencoder

Authors: Weiming Xu, Peng Zhang
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10765
Pdf URL: https://arxiv.org/pdf/2411.10765
Copy Paste: [[2411.10765]] Steam Turbine Anomaly Detection: An Unsupervised Learning Approach Using Enhanced Long Short-Term Memory Variational Autoencoder(https://arxiv.org/abs/2411.10765)
Keywords: anomaly
Abstract: As core thermal power generation equipment, steam turbines incur significant expenses and adverse effects on operation when facing interruptions like downtime, maintenance, and damage. Accurate anomaly detection is the prerequisite for ensuring the safe and stable operation of steam turbines. However, challenges in steam turbine anomaly detection, including inherent anomalies, lack of temporal information analysis, and high-dimensional data complexity, limit the effectiveness of existing methods. To address these challenges, we proposed an Enhanced Long Short-Term Memory Variational Autoencoder using Deep Advanced Features and Gaussian Mixture Model (ELSTMVAE-DAF-GMM) for precise unsupervised anomaly detection in unlabeled datasets. Specifically, LSTMVAE, integrating LSTM with VAE, was used to project high-dimensional time-series data to a low-dimensional phase space. The Deep Autoencoder-Local Outlier Factor (DAE-LOF) sample selection mechanism was used to eliminate inherent anomalies during training, further improving the model's precision and reliability. The novel deep advanced features (DAF) hybridize latent embeddings and reconstruction discrepancies from the LSTMVAE model and provide a more comprehensive data representation within a continuous and structured phase space, significantly enhancing anomaly detection by synergizing temporal dynamics with data pattern variations. These DAF were incorporated into GMM to ensure robust and effective unsupervised anomaly detection. We utilized real operating data from industry steam turbines and conducted both comparison and ablation experiments, demonstrating superior anomaly detection outcomes characterized by high accuracy and minimal false alarm rates compared with existing methods.

Title: Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

Authors: Shitong Shao, Zikai Zhou, Tian Ye, Lichen Bai, Zhiqiang Xu, Zeke Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10781
Pdf URL: https://arxiv.org/pdf/2411.10781
Copy Paste: [[2411.10781]] Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer(https://arxiv.org/abs/2411.10781)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models (DMs) develop at an unprecedented pace, supported by thorough theoretical exploration and empirical analysis. Unfortunately, the discrepancy between DMs and autoregressive models (ARMs) complicates the path toward achieving the goal of unified vision and language generation. Recently, the masked generative Transformer (MGT) serves as a promising intermediary between DM and ARM by predicting randomly masked image tokens (i.e., masked image modeling), combining the efficiency of DM with the discrete token nature of ARM. However, we find that the comprehensive analyses regarding the inference for MGT are virtually non-existent, and thus we aim to present positive design choices to fill this gap. We modify and re-design a set of DM-based inference techniques for MGT and further elucidate their performance on MGT. We also discuss the approach to correcting token's distribution to enhance inference. Extensive experiments and empirical analyses lead to concrete and effective design choices, and these design choices can be merged to achieve further performance gains. For instance, in terms of enhanced inference, we achieve winning rates of approximately 70% compared to vanilla sampling on HPS v2 with the recent SOTA MGT Meissonic. Our contributions have the potential to further enhance the capabilities and future development of MGTs.

Title: C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation

Authors: Jeonghyeok Do, Jaehyup Lee, Munchurl Kim
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10788
Pdf URL: https://arxiv.org/pdf/2411.10788
Copy Paste: [[2411.10788]] C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation(https://arxiv.org/abs/2411.10788)
Keywords: diffusion
Abstract: Synthetic Aperture Radar (SAR) imagery provides robust environmental and temporal coverage (e.g., during clouds, seasons, day-night cycles), yet its noise and unique structural patterns pose interpretation challenges, especially for non-experts. SAR-to-EO (Electro-Optical) image translation (SET) has emerged to make SAR images more perceptually interpretable. However, traditional approaches trained from scratch on limited SAR-EO datasets are prone to overfitting. To address these challenges, we introduce Confidence Diffusion for SAR-to-EO Translation, called C-DiffSET, a framework leveraging pretrained Latent Diffusion Model (LDM) extensively trained on natural images, thus enabling effective adaptation to the EO domain. Remarkably, we find that the pretrained VAE encoder aligns SAR and EO images in the same latent space, even with varying noise levels in SAR inputs. To further improve pixel-wise fidelity for SET, we propose a confidence-guided diffusion (C-Diff) loss that mitigates artifacts from temporal discrepancies, such as appearing or disappearing objects, thereby enhancing structural accuracy. C-DiffSET achieves state-of-the-art (SOTA) results on multiple datasets, significantly outperforming the very recent image-to-image translation methods and SET methods with large margins.

Title: Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts

Authors: Yijian Gao, Dominic Marshall, Xiaodan Xing, Junzhi Ning, Giorgos Papanastasiou, Guang Yang, Matthieu Komorowski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10789
Pdf URL: https://arxiv.org/pdf/2411.10789
Copy Paste: [[2411.10789]] Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts(https://arxiv.org/abs/2411.10789)
Keywords: generative
Abstract: Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care. However, achieving high clinical accuracy is challenging, as radiological images often feature subtle lesions and intricate structures. Existing systems often fall short, largely due to their reliance on fixed size, patch-level image features and insufficient incorporation of pathological information. This can result in the neglect of such subtle patterns and inconsistent descriptions of crucial pathologies. To address these challenges, we propose an innovative approach that leverages pathology-aware regional prompts to explicitly integrate anatomical and pathological information of various scales, significantly enhancing the precision and clinical relevance of generated reports. We develop an anatomical region detector that extracts features from distinct anatomical areas, coupled with a novel multi-label lesion detector that identifies global pathologies. Our approach emulates the diagnostic process of radiologists, producing clinically accurate reports with comprehensive diagnostic capabilities. Experimental results show that our model outperforms previous state-of-the-art methods on most natural language generation and clinical efficacy metrics, with formal expert evaluations affirming its potential to enhance radiology practice.

Title: Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Authors: Tripti Shukla, Srikrishna Karanam, Balaji Vasan Srinivasan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10800
Pdf URL: https://arxiv.org/pdf/2411.10800
Copy Paste: [[2411.10800]] Test-time Conditional Text-to-Image Synthesis Using Diffusion Models(https://arxiv.org/abs/2411.10800)
Keywords: diffusion
Abstract: We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.

Title: Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay

Authors: Feng Chen, Fuguang Han, Cong Guan, Lei Yuan, Zhilong Zhang, Yang Yu, Zongzhang Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10809
Pdf URL: https://arxiv.org/pdf/2411.10809
Copy Paste: [[2411.10809]] Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay(https://arxiv.org/abs/2411.10809)
Keywords: diffusion, generative
Abstract: Given the inherent non-stationarity prevalent in real-world applications, continual Reinforcement Learning (RL) aims to equip the agent with the capability to address a series of sequentially presented decision-making tasks. Within this problem setting, a pivotal challenge revolves around \textit{catastrophic forgetting} issue, wherein the agent is prone to effortlessly erode the decisional knowledge associated with past encountered tasks when learning the new one. In recent progresses, the \textit{generative replay} methods have showcased substantial potential by employing generative models to replay data distribution of past tasks. Compared to storing the data from past tasks directly, this category of methods circumvents the growing storage overhead and possible data privacy concerns. However, constrained by the expressive capacity of generative models, existing \textit{generative replay} methods face challenges in faithfully reconstructing the data distribution of past tasks, particularly in scenarios with a myriad of tasks or high-dimensional data. Inspired by the success of diffusion models in various generative tasks, this paper introduces a novel continual RL algorithm DISTR (Diffusion-based Trajectory Replay) that employs a diffusion model to memorize the high-return trajectory distribution of each encountered task and wakeups these distributions during the policy learning on new tasks. Besides, considering the impracticality of replaying all past data each time, a prioritization mechanism is proposed to prioritize the trajectory replay of pivotal tasks in our method. Empirical experiments on the popular continual RL benchmark \texttt{Continual World} demonstrate that our proposed method obtains a favorable balance between \textit{stability} and \textit{plasticity}, surpassing various existing continual RL baselines in average success rate.

Title: Conformation Generation using Transformer Flows

Authors: Sohil Atul Shah, Vladlen Koltun
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2411.10817
Pdf URL: https://arxiv.org/pdf/2411.10817
Copy Paste: [[2411.10817]] Conformation Generation using Transformer Flows(https://arxiv.org/abs/2411.10817)
Keywords: generative
Abstract: Estimating three-dimensional conformations of a molecular graph allows insight into the molecule's biological and chemical functions. Fast generation of valid conformations is thus central to molecular modeling. Recent advances in graph-based deep networks have accelerated conformation generation from hours to seconds. However, current network architectures do not scale well to large molecules. Here we present ConfFlow, a flow-based model for conformation generation based on transformer networks. In contrast with existing approaches, ConfFlow directly samples in the coordinate space without enforcing any explicit physical constraints. The generative procedure is highly interpretable and is akin to force field updates in molecular dynamics simulation. When applied to the generation of large molecule conformations, ConfFlow improve accuracy by up to $40\%$ relative to state-of-the-art learning-based methods. The source code is made available at this https URL.

Title: One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

Authors: Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, Mengdi Wang
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2411.10830
Pdf URL: https://arxiv.org/pdf/2411.10830
Copy Paste: [[2411.10830]] One-Layer Transformer Provably Learns One-Nearest Neighbor In Context(https://arxiv.org/abs/2411.10830)
Keywords: in-context
Abstract: Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.

Title: Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation

Authors: Jaisidh Singh, Sonam Singh, Amit Arvind Kale, Harsh K Gandhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10845
Pdf URL: https://arxiv.org/pdf/2411.10845
Copy Paste: [[2411.10845]] Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation(https://arxiv.org/abs/2411.10845)
Keywords: foundation model
Abstract: This paper presents a novel method for discovering systematic errors in segmentation models. For instance, a systematic error in the segmentation model can be a sufficiently large number of misclassifications from the model as a parking meter for a target class of pedestrians. With the rapid deployment of these models in critical applications such as autonomous driving, it is vital to detect and interpret these systematic errors. However, the key challenge is automatically discovering such failures on unlabelled data and forming interpretable semantic sub-groups for intervention. For this, we leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors. We demonstrate that such errors are present in SOTA segmentation models (UperNet ConvNeXt and UperNet Swin) trained on the Berkeley Deep Drive and benchmark the approach qualitatively and quantitatively, showing its effectiveness by discovering coherent systematic errors for these models. Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.

Title: Large Vision-Language Models for Remote Sensing Visual Question Answering

Authors: Surasakdi Siripong, Apirak Chaiyapan, Thanakorn Phonchai
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10857
Pdf URL: https://arxiv.org/pdf/2411.10857
Copy Paste: [[2411.10857]] Large Vision-Language Models for Remote Sensing Visual Question Answering(https://arxiv.org/abs/2411.10857)
Keywords: generative
Abstract: Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art baselines. Additionally, a human evaluation study shows that our method produces answers that are more accurate, relevant, and fluent. The results highlight the potential of generative LVLMs in advancing the field of remote sensing analysis.

Title: See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI

Authors: Ruslan Idelfonso Magaña Vsevolodovna
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2411.10861
Pdf URL: https://arxiv.org/pdf/2411.10861
Copy Paste: [[2411.10861]] See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI(https://arxiv.org/abs/2411.10861)
Keywords: generative
Abstract: The generation of complex, large-scale code projects using generative AI models presents challenges due to token limitations, dependency management, and iterative refinement requirements. This paper introduces the See-Saw generative mechanism, a novel methodology for dynamic and recursive code generation. The proposed approach alternates between main code updates and dependency generation to ensure alignment and functionality. By dynamically optimizing token usage and incorporating key elements of the main code into the generation of dependencies, the method enables efficient and scalable code generation for projects requiring hundreds of interdependent files. The mechanism ensures that all code components are synchronized and functional, enabling scalable and efficient project generation. Experimental validation demonstrates the method's capability to manage dependencies effectively while maintaining coherence and minimizing computational overhead.

Title: Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model

Authors: Arnab Kumar Roy, Hemant Kumar Kathania, Adhitiya Sharma
Subjects: cs.CV, cs.HC, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10863
Pdf URL: https://arxiv.org/pdf/2411.10863
Copy Paste: [[2411.10863]] Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model(https://arxiv.org/abs/2411.10863)
Keywords: diffusion, generative
Abstract: Facial Emotion Recognition (FER) plays a crucial role in computer vision, with significant applications in human-computer interaction, affective computing, and areas such as mental health monitoring and personalized learning environments. However, a major challenge in FER task is the class imbalance commonly found in available datasets, which can hinder both model performance and generalization. In this paper, we tackle the issue of data imbalance by incorporating synthetic data augmentation and leveraging the ResEmoteNet model to enhance the overall performance on facial emotion recognition task. We employed Stable Diffusion 2 and Stable Diffusion 3 Medium models to generate synthetic facial emotion data, augmenting the training sets of the FER2013 and RAF-DB benchmark datasets. Training ResEmoteNet with these augmented datasets resulted in substantial performance improvements, achieving accuracies of 96.47% on FER2013 and 99.23% on RAF-DB. These findings shows an absolute improvement of 16.68% in FER2013, 4.47% in RAF-DB and highlight the efficacy of synthetic data augmentation in strengthening FER models and underscore the potential of advanced generative models in FER research and applications. The source code for ResEmoteNet is available at this https URL

Title: MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

Authors: Ansh Shah, K Madhava Krishna
Subjects: cs.CV, cs.AI, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2411.10886
Pdf URL: https://arxiv.org/pdf/2411.10886
Copy Paste: [[2411.10886]] MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation(https://arxiv.org/abs/2411.10886)
Keywords: diffusion, generative
Abstract: Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.

Title: Watermarking Generative Categorical Data

Authors: Bochao Gu, Hengzhi He, Guang Cheng
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10898
Pdf URL: https://arxiv.org/pdf/2411.10898
Copy Paste: [[2411.10898]] Watermarking Generative Categorical Data(https://arxiv.org/abs/2411.10898)
Keywords: generative
Abstract: In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.

Title: SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment

Authors: Quan Ze Chen, K.J. Kevin Feng, Chan Young Park, Amy X. Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10912
Pdf URL: https://arxiv.org/pdf/2411.10912
Copy Paste: [[2411.10912]] SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment(https://arxiv.org/abs/2411.10912)
Keywords: in-context
Abstract: Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs to facilitate pluralistic alignment: scenario banks, group-informed metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ($n = 544$), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ($n = 80$), we observe that SPICA-aligned models are higher rated than a baseline similarity-only retrieval approach, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can effectively align to aggregated values, it is not most suited for aligning to divergent groups.

Title: Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10913
Pdf URL: https://arxiv.org/pdf/2411.10913
Copy Paste: [[2411.10913]] Generating Compositional Scenes via Text-to-image RGBA Instance Generation(https://arxiv.org/abs/2411.10913)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Title: LLM-assisted Physical Invariant Extraction for Cyber-Physical Systems Anomaly Detection

Authors: Danial Abshari, Chenglong Fu, Meera Sridhar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10918
Pdf URL: https://arxiv.org/pdf/2411.10918
Copy Paste: [[2411.10918]] LLM-assisted Physical Invariant Extraction for Cyber-Physical Systems Anomaly Detection(https://arxiv.org/abs/2411.10918)
Keywords: generative, anomaly
Abstract: Modern industrial infrastructures rely heavily on Cyber-Physical Systems (CPS), but these are vulnerable to cyber-attacks with potentially catastrophic effects. To reduce these risks, anomaly detection methods based on physical invariants have been developed. However, these methods often require domain-specific expertise to manually define invariants, making them costly and difficult to scale. To address this limitation, we propose a novel approach to extract physical invariants from CPS testbeds for anomaly detection. Our insight is that CPS design documentation often contains semantically rich descriptions of physical procedures, which can profile inter-correlated dynamics among system components. Leveraging the built-in physics and engineering knowledge of recent generative AI models, we aim to automate this traditionally manual process, improving scalability and reducing costs. This work focuses on designing and optimizing a Retrieval-Augmented-Generation (RAG) workflow with a customized prompting system tailored for CPS documentation, enabling accurate extraction of semantic information and inference of physical invariants from complex, multimodal content. Then, rather than directly applying the inferred invariants for anomaly detection, we introduce an innovative statistics-based learning approach that integrates these invariants into the training dataset. This method addresses limitations such as hallucination and concept drift, enhancing the reliability of the model. We evaluate our approach on real-world public CPS security dataset which contains 86 data points and 58 attacking cases. The results show that our approach achieves a high precision of 0.923, accurately detecting anomalies while minimizing false alarms.

Title: Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Authors: Arushi Gupta, Rafal Kocielnik, Jiayun Wang, Firdavs Nasriddinov, Cherine Yang, Elyssa Wong, Anima Anandkumar, Andrew Hung
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10919
Pdf URL: https://arxiv.org/pdf/2411.10919
Copy Paste: [[2411.10919]] Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment(https://arxiv.org/abs/2411.10919)
Keywords: self-supervised
Abstract: During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70+/-0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.

Title: Constrained Diffusion with Trust Sampling

Authors: William Huang, Yifeng Jiang, Tom Van Wouwe, C. Karen Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10932
Pdf URL: https://arxiv.org/pdf/2411.10932
Copy Paste: [[2411.10932]] Constrained Diffusion with Trust Sampling(https://arxiv.org/abs/2411.10932)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated significant promise in various generative tasks; however, they often struggle to satisfy challenging constraints. Our approach addresses this limitation by rethinking training-free loss-guided diffusion from an optimization perspective. We formulate a series of constrained optimizations throughout the inference process of a diffusion model. In each optimization, we allow the sample to take multiple steps along the gradient of the proxy constraint function until we can no longer trust the proxy, according to the variance at each diffusion level. Additionally, we estimate the state manifold of diffusion model to allow for early termination when the sample starts to wander away from the state manifold at each diffusion step. Trust sampling effectively balances between following the unconditional diffusion model and adhering to the loss guidance, enabling more flexible and accurate constrained generation. We demonstrate the efficacy of our method through extensive experiments on complex tasks, and in drastically different domains of images and 3D motion generation, showing significant improvements over existing methods in terms of generation quality. Our implementation is available at this https URL.

Title: Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion

Authors: Ni Ou, Zhuo Chen, Xinru Zhang, Junzheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10936
Pdf URL: https://arxiv.org/pdf/2411.10936
Copy Paste: [[2411.10936]] Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion(https://arxiv.org/abs/2411.10936)
Keywords: diffusion
Abstract: Cameras and LiDAR are essential sensors for autonomous vehicles. Camera-LiDAR data fusion compensate for deficiencies of stand-alone sensors but relies on precise extrinsic calibration. Many learning-based calibration methods predict extrinsic parameters in a single step. Driven by the growing demand for higher accuracy, a few approaches utilize multi-range models or integrate multiple methods to improve extrinsic parameter predictions, but these strategies incur extended training times and require additional storage for separate models. To address these issues, we propose a single-model iterative approach based on surrogate diffusion to significantly enhance the capacity of individual calibration methods. By applying a buffering technique proposed by us, the inference time of our surrogate diffusion is 43.7% less than that of multi-range models. Additionally, we create a calibration network as our denoiser, featuring both projection-first and encoding-first branches for effective point feature extraction. Extensive experiments demonstrate that our diffusion model outperforms other single-model iterative methods and delivers competitive results compared to multi-range models. Our denoiser exceeds state-of-the-art calibration methods, reducing the rotation error by 24.5% compared to the second-best method. Furthermore, with the proposed diffusion applied, it achieves 20.4% less rotation error and 9.6% less translation error.

Title: Anomaly Detection for People with Visual Impairments Using an Egocentric 360-Degree Camera

Authors: Inpyo Song, Sanghyeon Lee, Minjun Joo, Jangwon Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10945
Pdf URL: https://arxiv.org/pdf/2411.10945
Copy Paste: [[2411.10945]] Anomaly Detection for People with Visual Impairments Using an Egocentric 360-Degree Camera(https://arxiv.org/abs/2411.10945)
Keywords: anomaly
Abstract: Recent advancements in computer vision have led to a renewed interest in developing assistive technologies for individuals with visual impairments. Although extensive research has been conducted in the field of computer vision-based assistive technologies, most of the focus has been on understanding contexts in images, rather than addressing their physical safety and security concerns. To address this challenge, we propose the first step towards detecting anomalous situations for visually impaired people by observing their entire surroundings using an egocentric 360-degree camera. We first introduce a novel egocentric 360-degree video dataset called VIEW360 (Visually Impaired Equipped with Wearable 360-degree camera), which contains abnormal activities that visually impaired individuals may encounter, such as shoulder surfing and pickpocketing. Furthermore, we propose a new architecture called the FDPN (Frame and Direction Prediction Network), which facilitates frame-level prediction of abnormal events and identifying of their directions. Finally, we evaluate our approach on our VIEW360 dataset and the publicly available UCF-Crime and Shanghaitech datasets, demonstrating state-of-the-art performance.

Title: Direct and Explicit 3D Generation from a Single Image

Authors: Haoyu Wu, Meher Gitika Karumuri, Chuhang Zou, Seungbae Bang, Yuelong Li, Dimitris Samaras, Sunil Hadap
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10947
Pdf URL: https://arxiv.org/pdf/2411.10947
Copy Paste: [[2411.10947]] Direct and Explicit 3D Generation from a Single Image(https://arxiv.org/abs/2411.10947)
Keywords: diffusion
Abstract: Current image-to-3D approaches suffer from high computational costs and lack scalability for high-resolution outputs. In contrast, we introduce a novel framework to directly generate explicit surface geometry and texture using multi-view 2D depth and RGB images along with 3D Gaussian features using a repurposed Stable Diffusion model. We introduce a depth branch into U-Net for efficient and high quality multi-view, cross-domain generation and incorporate epipolar attention into the latent-to-pixel decoder for pixel-level multi-view consistency. By back-projecting the generated depth pixels into 3D space, we create a structured 3D representation that can be either rendered via Gaussian splatting or extracted to high-quality meshes, thereby leveraging additional novel view synthesis loss to further improve our performance. Extensive experiments demonstrate that our method surpasses existing baselines in geometry and texture quality while achieving significantly faster generation time.

Title: Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Authors: Zeping Yu, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10950
Pdf URL: https://arxiv.org/pdf/2411.10950
Copy Paste: [[2411.10950]] Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering(https://arxiv.org/abs/2411.10950)
Keywords: in-context
Abstract: Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{this https URL}

Title: TeG: Temporal-Granularity Method for Anomaly Detection with Attention in Smart City Surveillance

Authors: Erkut Akdag, Egor Bondarev, Peter H. N. De With
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11003
Pdf URL: https://arxiv.org/pdf/2411.11003
Copy Paste: [[2411.11003]] TeG: Temporal-Granularity Method for Anomaly Detection with Attention in Smart City Surveillance(https://arxiv.org/abs/2411.11003)
Keywords: anomaly
Abstract: Anomaly detection in video surveillance has recently gained interest from the research community. Temporal duration of anomalies vary within video streams, leading to complications in learning the temporal dynamics of specific events. This paper presents a temporal-granularity method for an anomaly detection model (TeG) in real-world surveillance, combining spatio-temporal features at different time-scales. The TeG model employs multi-head cross-attention blocks and multi-head self-attention blocks for this purpose. Additionally, we extend the UCF-Crime dataset with new anomaly types relevant to Smart City research project. The TeG model is deployed and validated in a city surveillance system, achieving successful real-time results in industrial settings.

Title: Time Step Generating: A Universal Synthesized Deepfake Image Detector

Authors: Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11016
Pdf URL: https://arxiv.org/pdf/2411.11016
Copy Paste: [[2411.11016]] Time Step Generating: A Universal Synthesized Deepfake Image Detector(https://arxiv.org/abs/2411.11016)
Keywords: diffusion, generative
Abstract: Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.

Title: StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Authors: Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11045
Pdf URL: https://arxiv.org/pdf/2411.11045
Copy Paste: [[2411.11045]] StableV2V: Stablizing Shape Consistency in Video-to-Video Editing(https://arxiv.org/abs/2411.11045)
Keywords: generative
Abstract: Recent advancements of generative AI have significantly promoted content creation and editing, where prevailing studies further extend this exciting progress to video editing. In doing so, these studies mainly transfer the inherent motion patterns from the source videos to the edited ones, where results with inferior consistency to user prompts are often observed, due to the lack of particular alignments between the delivered motions and edited contents. To address this limitation, we present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Furthermore, we curate a testing benchmark, namely DAVIS-Edit, for a comprehensive evaluation of video editing, considering various types of prompts and difficulties. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.

Title: D-Cube: Exploiting Hyper-Features of Diffusion Model for Robust Medical Classification

Authors: Minhee Jang, Juheon Son, Thanaporn Viriyasaranon, Junho Kim, Jang-Hwan Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11087
Pdf URL: https://arxiv.org/pdf/2411.11087
Copy Paste: [[2411.11087]] D-Cube: Exploiting Hyper-Features of Diffusion Model for Robust Medical Classification(https://arxiv.org/abs/2411.11087)
Keywords: diffusion
Abstract: The integration of deep learning technologies in medical imaging aims to enhance the efficiency and accuracy of cancer diagnosis, particularly for pancreatic and breast cancers, which present significant diagnostic challenges due to their high mortality rates and complex imaging characteristics. This paper introduces Diffusion-Driven Diagnosis (D-Cube), a novel approach that leverages hyper-features from a diffusion model combined with contrastive learning to improve cancer diagnosis. D-Cube employs advanced feature selection techniques that utilize the robust representational capabilities of diffusion models, enhancing classification performance on medical datasets under challenging conditions such as data imbalance and limited sample availability. The feature selection process optimizes the extraction of clinically relevant features, significantly improving classification accuracy and demonstrating resilience in imbalanced and limited data scenarios. Experimental results validate the effectiveness of D-Cube across multiple medical imaging modalities, including CT, MRI, and X-ray, showing superior performance compared to existing baseline models. D-Cube represents a new strategy in cancer detection, employing advanced deep learning techniques to achieve state-of-the-art diagnostic accuracy and efficiency.

Title: Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

Authors: Yan Zheng, Zhenxiao Liang, Xiaoyan Cong, Lanqing guo, Yuehao Wang, Peihao Wang, Zhangyang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11135
Pdf URL: https://arxiv.org/pdf/2411.11135
Copy Paste: [[2411.11135]] Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method(https://arxiv.org/abs/2411.11135)
Keywords: diffusion
Abstract: We explore the oscillatory behavior observed in inversion methods applied to large-scale text-to-image diffusion models, with a focus on the "Flux" model. By employing a fixed-point-inspired iterative approach to invert real-world images, we observe that the solution does not achieve convergence, instead oscillating between distinct clusters. Through both toy experiments and real-world diffusion models, we demonstrate that these oscillating clusters exhibit notable semantic coherence. We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in rectified flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing. Furthermore, we provide quantitative results demonstrating the effectiveness of our method for tasks such as image enhancement, makeup transfer, reconstruction quality, and guided sampling quality. Higher-quality examples of videos and images are available at \href{this https URL}{this link}.

Title: Infinite Width Limits of Self Supervised Neural Networks

Authors: Maximilian Fleissner, Gautham Govind Anil, Debarghya Ghoshdastidar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11176
Pdf URL: https://arxiv.org/pdf/2411.11176
Copy Paste: [[2411.11176]] Infinite Width Limits of Self Supervised Neural Networks(https://arxiv.org/abs/2411.11176)
Keywords: self-supervised
Abstract: The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behaviour of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound -- it is a commonly encountered misbelief that the kernel behaviour of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is different from previous works on the NTK and may be of independent interest. Overall, our work provides a first rigorous justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of finite width.

Title: Enhanced Anime Image Generation Using USE-CMHSA-GAN

Authors: J. Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11179
Pdf URL: https://arxiv.org/pdf/2411.11179
Copy Paste: [[2411.11179]] Enhanced Anime Image Generation Using USE-CMHSA-GAN(https://arxiv.org/abs/2411.11179)
Keywords: generative
Abstract: With the growing popularity of ACG (Anime, Comics, and Games) culture, generating high-quality anime character images has become an important research topic. This paper introduces a novel Generative Adversarial Network model, USE-CMHSA-GAN, designed to produce high-quality anime character images. The model builds upon the traditional DCGAN framework, incorporating USE and CMHSA modules to enhance feature extraction capabilities for anime character images. Experiments were conducted on the anime-face-dataset, and the results demonstrate that USE-CMHSA-GAN outperforms other benchmark models, including DCGAN, VAE-GAN, and WGAN, in terms of FID and IS scores, indicating superior image quality. These findings suggest that USE-CMHSA-GAN is highly effective for anime character image generation and provides new insights for further improving the quality of generative models.

Title: AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers

Authors: Jake Grigsby, Justin Sasek, Samyak Parajuli, Daniel Adebi, Amy Zhang, Yuke Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11188
Pdf URL: https://arxiv.org/pdf/2411.11188
Copy Paste: [[2411.11188]] AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers(https://arxiv.org/abs/2411.11188)
Keywords: in-context
Abstract: Language models trained on diverse datasets unlock generalization by in-context learning. Reinforcement Learning (RL) policies can achieve a similar effect by meta-learning within the memory of a sequence model. However, meta-RL research primarily focuses on adapting to minor variations of a single task. It is difficult to scale towards more general behavior without confronting challenges in multi-task optimization, and few solutions are compatible with meta-RL's goal of learning from large training sets of unlabeled tasks. To address this challenge, we revisit the idea that multi-task RL is bottlenecked by imbalanced training losses created by uneven return scales across different tasks. We build upon recent advancements in Transformer-based (in-context) meta-RL and evaluate a simple yet scalable solution where both an agent's actor and critic objectives are converted to classification terms that decouple optimization from the current scale of returns. Large-scale comparisons in Meta-World ML45, Multi-Game Procgen, Multi-Task POPGym, Multi-Game Atari, and BabyAI find that this design unlocks significant progress in online multi-task adaptation and memory problems without explicit task labels.

Title: SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

Authors: Ruoxi Sun, Jiamin Chang, Hammond Pearce, Chaowei Xiao, Bo Li, Qi Wu, Surya Nepal, Minhui Xue
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11195
Pdf URL: https://arxiv.org/pdf/2411.11195
Copy Paste: [[2411.11195]] SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach(https://arxiv.org/abs/2411.11195)
Keywords: foundation model
Abstract: Multimodal foundation models (MFMs) represent a significant advancement in artificial intelligence, combining diverse data modalities to enhance learning and understanding across a wide range of applications. However, this integration also brings unique safety and security challenges. In this paper, we conceptualize cybersafety and cybersecurity in the context of multimodal learning and present a comprehensive Systematization of Knowledge (SoK) to unify these concepts in MFMs, identifying key threats to these models. We propose a taxonomy framework grounded in information theory, evaluating and categorizing threats through the concepts of channel capacity, signal, noise, and bandwidth. This approach provides a novel framework that unifies model safety and system security in MFMs, offering a more comprehensive and actionable understanding of the risks involved. We used this to explore existing defense mechanisms, and identified gaps in current research - particularly, a lack of protection for alignment between modalities and a need for more systematic defense methods. Our work contributes to a deeper understanding of the security and safety landscape in MFMs, providing researchers and practitioners with valuable insights for improving the robustness and reliability of these models.

Title: Stealing Training Graphs from Graph Neural Networks

Authors: Minhua Lin, Enyan Dai, Junjie Xu, Jinyuan Jia, Xiang Zhang, Suhang Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.11197
Pdf URL: https://arxiv.org/pdf/2411.11197
Copy Paste: [[2411.11197]] Stealing Training Graphs from Graph Neural Networks(https://arxiv.org/abs/2411.11197)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) have shown promising results in modeling graphs in various tasks. The training of GNNs, especially on specialized tasks such as bioinformatics, demands extensive expert annotations, which are expensive and usually contain sensitive information of data providers. The trained GNN models are often shared for deployment in the real world. As neural networks can memorize the training samples, the model parameters of GNNs have a high risk of leaking private training data. Our theoretical analysis shows the strong connections between trained GNN parameters and the training graphs used, confirming the training graph leakage issue. However, explorations into training data leakage from trained GNNs are rather limited. Therefore, we investigate a novel problem of stealing graphs from trained GNNs. To obtain high-quality graphs that resemble the target training set, a graph diffusion model with diffusion noise optimization is deployed as a graph generator. Furthermore, we propose a selection method that effectively leverages GNN model parameters to identify training graphs from samples generated by the graph diffusion model. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework in stealing training graphs from the trained GNN.

Title: Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

Authors: Tiancheng Lin, Jinglei Zhang, Yi Xu, Kai Chen, Rui Zhang, Chang-Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11219
Pdf URL: https://arxiv.org/pdf/2411.11219
Copy Paste: [[2411.11219]] Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition(https://arxiv.org/abs/2411.11219)
Keywords: self-supervised
Abstract: Context-aware methods have achieved remarkable advancements in supervised scene text recognition by leveraging semantic priors from words. Considering the heterogeneity of text and background in STR, we propose that such contextual priors can be reinterpreted as the relations between textual elements, serving as effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of the dataset due to lexical dependencies, which causes over-fitting problem, thus compromising the representation quality. To address this, our work introduces a unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations. For the RCL branch, we first introduce the relational rearrangement module to cultivate new relations on the fly. Based on this, we further conduct relational contrastive learning to model the intra- and inter-hierarchical relations for frames, sub-words and this http URL the other hand, MIM can naturally boost the context information via masking, where we find that the block masking strategy is more effective for STR. For the effective integration of RCL and MIM, we also introduce a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning. Additionally, to enhance the compatibility of MIM with CNNs, we propose the adoption of sparse convolutions and directly sharing the weights with dense convolutions in training. The proposed RCMSTR demonstrates superior performance in various evaluation protocols for different STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques. Ablation studies and qualitative experimental results further validate the effectiveness of our this http URL code and pre-trained models will be available at this https URL .

Title: Efficient Transfer Learning for Video-language Foundation Models

Authors: Haoxing Chen, Zizheng Huang, Yan Hong, Yanshuo Wang, Zhongcai Lyu, Zhuoer Xu, Jun Lan, Zhangxuan Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11223
Pdf URL: https://arxiv.org/pdf/2411.11223
Copy Paste: [[2411.11223]] Efficient Transfer Learning for Video-language Foundation Models(https://arxiv.org/abs/2411.11223)
Keywords: foundation model
Abstract: Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capacity brought by these additional parameters helps better fit the video-specific inductive biases, existing methods require learning a large number of parameters and are prone to catastrophic forgetting of the original generalizable knowledge. In this paper, we propose a simple yet effective Multi-modal Spatio-Temporal Adapter (MSTA) to improve the alignment between representations in the text and vision branches, achieving a balance between general knowledge and task-specific knowledge. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint. This constraint involves feeding template inputs (i.e., ``a video of $\{\textbf{cls}\}$'') into the trainable language branch, while LLM-generated spatio-temporal descriptions are input into the pre-trained language branch, enforcing consistency between the outputs of the two branches. This mechanism prevents over-fitting to downstream tasks and improves the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA achieves outstanding performance across all evaluations, while using only 2-7\% of the trainable parameters in the original model. Code will be avaliable at this https URL.

Title: MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

Authors: Yingjie Zhou, Zicheng Zhang, Jiezhang Cao, Jun Jia, Yanwei Jiang, Farong Wen, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11235
Pdf URL: https://arxiv.org/pdf/2411.11235
Copy Paste: [[2411.11235]] MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis(https://arxiv.org/abs/2411.11235)
Keywords: generative
Abstract: Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.

Title: ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification

Authors: Son T. Luu, Hiep Nguyen, Trung Vo, Le-Minh Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11247
Pdf URL: https://arxiv.org/pdf/2411.11247
Copy Paste: [[2411.11247]] ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification(https://arxiv.org/abs/2411.11247)
Keywords: in-context
Abstract: In this paper, we propose ZeFaV - a zero-shot based fact-checking verification framework to enhance the performance on fact verification task of large language models by leveraging the in-context learning ability of large language models to extract the relations among the entities within a claim, re-organized the information from the evidence in a relationally logical form, and combine the above information with the original evidence to generate the context from which our fact-checking model provide verdicts for the input claims. We conducted empirical experiments to evaluate our approach on two multi-hop fact-checking datasets including HoVer and FEVEROUS, and achieved potential results results comparable to other state-of-the-art fact verification task methods.

Title: Effective Predictive Modeling for Emergency Department Visits and Evaluating Exogenous Variables Impact: Using Explainable Meta-learning Gradient Boosting

Authors: Mehdi Neshat, Michael Phipps, Nikhil Jha, Danial Khojasteh, Michael Tong, Amir Gandomi
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.11275
Pdf URL: https://arxiv.org/pdf/2411.11275
Copy Paste: [[2411.11275]] Effective Predictive Modeling for Emergency Department Visits and Evaluating Exogenous Variables Impact: Using Explainable Meta-learning Gradient Boosting(https://arxiv.org/abs/2411.11275)
Keywords: foundation model
Abstract: Over an extensive duration, administrators and clinicians have endeavoured to predict Emergency Department (ED) visits with precision, aiming to optimise resource distribution. Despite the proliferation of diverse AI-driven models tailored for precise prognostication, this task persists as a formidable challenge, besieged by constraints such as restrained generalisability, susceptibility to overfitting and underfitting, scalability issues, and complex fine-tuning hyper-parameters. In this study, we introduce a novel Meta-learning Gradient Booster (Meta-ED) approach for precisely forecasting daily ED visits and leveraging a comprehensive dataset of exogenous variables, including socio-demographic characteristics, healthcare service use, chronic diseases, diagnosis, and climate parameters spanning 23 years from Canberra Hospital in ACT, Australia. The proposed Meta-ED consists of four foundational learners-Catboost, Random Forest, Extra Tree, and lightGBoost-alongside a dependable top-level learner, Multi-Layer Perceptron (MLP), by combining the unique capabilities of varied base models (sub-learners). Our study assesses the efficacy of the Meta-ED model through an extensive comparative analysis involving 23 models. The evaluation outcomes reveal a notable superiority of Meta-ED over the other models in accuracy at 85.7% (95% CI ;85.4%, 86.0%) and across a spectrum of 10 evaluation metrics. Notably, when compared with prominent techniques, XGBoost, Random Forest (RF), AdaBoost, LightGBoost, and Extra Tree (ExT), Meta-ED showcases substantial accuracy enhancements of 58.6%, 106.3%, 22.3%, 7.0%, and 15.7%, respectively. Furthermore, incorporating weather-related features demonstrates a 3.25% improvement in the prediction accuracy of visitors' numbers. The encouraging outcomes of our study underscore Meta-ED as a foundation model for the precise prediction of daily ED visitors.

Title: Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications

Authors: Scarlett Raine, Frederic Maire, Niko Suenderhauf, Tobias Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11287
Pdf URL: https://arxiv.org/pdf/2411.11287
Copy Paste: [[2411.11287]] Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications(https://arxiv.org/abs/2411.11287)
Keywords: self-supervised
Abstract: Underwater surveys provide long-term data for informing management strategies, monitoring coral reef health, and estimating blue carbon stocks. Advances in broad-scale survey methods, such as robotic underwater vehicles, have increased the range of marine surveys but generate large volumes of imagery requiring analysis. Computer vision methods such as semantic segmentation aid automated image analysis, but typically rely on fully supervised training with extensive labelled data. While ground truth label masks for tasks like street scene segmentation can be quickly and affordably generated by non-experts through crowdsourcing services like Amazon Mechanical Turk, ecology presents greater challenges. The complexity of underwater images, coupled with the specialist expertise needed to accurately identify species at the pixel level, makes this process costly, time-consuming, and heavily dependent on domain experts. In recent years, some works have performed automated analysis of underwater imagery, and a smaller number of studies have focused on weakly supervised approaches which aim to reduce the expert-provided labelled data required. This survey focuses on approaches which reduce dependency on human expert input, while reviewing the prior and related approaches to position these works in the wider field of underwater perception. Further, we offer an overview of coastal ecosystems and the challenges of underwater imagery. We provide background on weakly and self-supervised deep learning and integrate these elements into a taxonomy that centres on the intersection of underwater monitoring, computer vision, and deep learning, while motivating approaches for weakly supervised deep learning with reduced dependency on domain expert data annotations. Lastly, the survey examines available datasets and platforms, and identifies gaps, barriers, and opportunities for automating underwater surveys.

Title: SADDE: Semi-supervised Anomaly Detection with Dependable Explanations

Authors: Yachao Yuan, Yu Huang, Yali Yuan, Jin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11293
Pdf URL: https://arxiv.org/pdf/2411.11293
Copy Paste: [[2411.11293]] SADDE: Semi-supervised Anomaly Detection with Dependable Explanations(https://arxiv.org/abs/2411.11293)
Keywords: anomaly
Abstract: Semi-supervised learning holds a pivotal position in anomaly detection applications, yet identifying anomaly patterns with a limited number of labeled samples poses a significant challenge. Furthermore, the absence of interpretability poses major obstacles to the practical adoption of semi-supervised frameworks. The majority of existing interpretation techniques are tailored for supervised/unsupervised frameworks or non-security domains, falling short in providing dependable interpretations. In this research paper, we introduce SADDE, a general framework designed to accomplish two primary objectives: (1) to render the anomaly detection process interpretable and enhance the credibility of interpretation outcomes, and (2) to assign high-confidence pseudo labels to unlabeled samples, thereby boosting the performance of anomaly detection systems when supervised data is scarce. To achieve the first objective, we devise a cutting-edge interpretation method that utilizes both global and local interpreters to furnish trustworthy explanations. For the second objective, we conceptualize a novel two-stage semi-supervised learning framework tailored for network anomaly detection, ensuring that the model predictions of both stages align with specific constraints. We apply SADDE to two illustrative network anomaly detection tasks and conduct extensive evaluations in comparison with notable prior works. The experimental findings underscore that SADDE is capable of delivering precise detection results alongside dependable interpretations for semi-supervised network anomaly detection systems. The source code for SADDE is accessible at: this https URL.

Title: Enhancing Decision Transformer with Diffusion-Based Trajectory Branch Generation

Authors: Zhihong Liu, Long Qian, Zeyang Liu, Lipeng Wan, Xingyu Chen, Xuguang Lan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11327
Pdf URL: https://arxiv.org/pdf/2411.11327
Copy Paste: [[2411.11327]] Enhancing Decision Transformer with Diffusion-Based Trajectory Branch Generation(https://arxiv.org/abs/2411.11327)
Keywords: diffusion
Abstract: Decision Transformer (DT) can learn effective policy from offline datasets by converting the offline reinforcement learning (RL) into a supervised sequence modeling task, where the trajectory elements are generated auto-regressively conditioned on the return-to-go (RTG).However, the sequence modeling learning approach tends to learn policies that converge on the sub-optimal trajectories within the dataset, for lack of bridging data to move to better trajectories, even if the condition is set to the highest this http URL address this issue, we introduce Diffusion-Based Trajectory Branch Generation (BG), which expands the trajectories of the dataset with branches generated by a diffusion this http URL trajectory branch is generated based on the segment of the trajectory within the dataset, and leads to trajectories with higher this http URL concatenate the generated branch with the trajectory segment as an expansion of the this http URL expanding, DT has more opportunities to learn policies to move to better trajectories, preventing it from converging to the sub-optimal this http URL, after processing with BG, DT outperforms state-of-the-art sequence modeling methods on D4RL benchmark, demonstrating the effectiveness of adding branches to the dataset without further modifications.

Title: Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

Authors: Qinglong Cao, Ding Wang, Xirui Li, Yuntian Chen, Chao Ma, Xiaokang Yang
Subjects: cs.CV, stat.AP
Abstract URL: https://arxiv.org/abs/2411.11343
Pdf URL: https://arxiv.org/pdf/2411.11343
Copy Paste: [[2411.11343]] Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge(https://arxiv.org/abs/2411.11343)
Keywords: diffusion
Abstract: Video diffusion models have exhibited tremendous progress in various video generation tasks. However, existing models struggle to capture latent physical knowledge, failing to infer physical phenomena that are challenging to articulate with natural language. Generating videos following the fundamental physical laws is still an opening challenge. To address this challenge, we propose a novel method to teach video diffusion models with latent physical phenomenon knowledge, enabling the accurate generation of physically informed phenomena. Specifically, we first pretrain Masked Autoencoders (MAE) to reconstruct the physical phenomena, resulting in output embeddings that encapsulate latent physical phenomenon knowledge. Leveraging these embeddings, we could generate the pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders. Particularly, given that diffusion models typically use CLIP's language encoder for text prompt embeddings, our approach integrates the CLIP visual features informed by latent physical knowledge into a quaternion hidden space. This enables the modeling of spatial relationships to produce physical knowledge-informed pseudo-language prompts. By incorporating these prompt features and fine-tuning the video diffusion model in a parameter-efficient manner, the physical knowledge-informed videos are successfully generated. We validate our method extensively through both numerical simulations and real-world observations of physical phenomena, demonstrating its remarkable performance across diverse scenarios.

Title: TL-CLIP: A Power-specific Multimodal Pre-trained Visual Foundation Model for Transmission Line Defect Recognition

Authors: Ke Zhang, Zhaoye Zheng, Yurong Guo, Jiacun Wang, Jiyuan Yang, Yangjie Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11370
Pdf URL: https://arxiv.org/pdf/2411.11370
Copy Paste: [[2411.11370]] TL-CLIP: A Power-specific Multimodal Pre-trained Visual Foundation Model for Transmission Line Defect Recognition(https://arxiv.org/abs/2411.11370)
Keywords: foundation model
Abstract: Transmission line defect recognition models have traditionally used general pre-trained weights as the initial basis for their training. These models often suffer weak generalization capability due to the lack of domain knowledge in the pre-training dataset. To address this issue, we propose a two-stage transmission-line-oriented contrastive language-image pre-training (TL-CLIP) framework, which lays a more effective foundation for transmission line defect recognition. The pre-training process employs a novel power-specific multimodal algorithm assisted with two power-specific pre-training tasks for better modeling the power-related semantic knowledge contained in the inspection data. To fine-tune the pre-trained model, we develop a transfer learning strategy, namely fine-tuning with pre-training objective (FTP), to alleviate the overfitting problem caused by limited inspection data. Experimental results demonstrate that the proposed method significantly improves the performance of transmission line defect recognition in both classification and detection tasks, indicating clear advantages over traditional pre-trained models in the scene of transmission line inspection.

Title: LeC$^2$O-NeRF: Learning Continuous and Compact Large-Scale Occupancy for Urban Scenes

Authors: Zhenxing Mi, Dan Xu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.11374
Pdf URL: https://arxiv.org/pdf/2411.11374
Copy Paste: [[2411.11374]] LeC$^2$O-NeRF: Learning Continuous and Compact Large-Scale Occupancy for Urban Scenes(https://arxiv.org/abs/2411.11374)
Keywords: self-supervised
Abstract: In NeRF, a critical problem is to effectively estimate the occupancy to guide empty-space skipping and point sampling. Grid-based methods work well for small-scale scenes. However, on large-scale scenes, they are limited by predefined bounding boxes, grid resolutions, and high memory usage for grid updates, and thus struggle to speed up training for large-scale, irregularly bounded and complex urban scenes without sacrificing accuracy. In this paper, we propose to learn a continuous and compact large-scale occupancy network, which can classify 3D points as occupied or unoccupied points. We train this occupancy network end-to-end together with the radiance field in a self-supervised manner by three designs. First, we propose a novel imbalanced occupancy loss to regularize the occupancy network. It makes the occupancy network effectively control the ratio of unoccupied and occupied points, motivated by the prior that most of 3D scene points are unoccupied. Second, we design an imbalanced architecture containing a large scene network and a small empty space network to separately encode occupied and unoccupied points classified by the occupancy network. This imbalanced structure can effectively model the imbalanced nature of occupied and unoccupied regions. Third, we design an explicit density loss to guide the occupancy network, making the density of unoccupied points smaller. As far as we know, we are the first to learn a continuous and compact occupancy of large-scale NeRF by a network. In our experiments, our occupancy network can quickly learn more compact, accurate and smooth occupancy compared to the occupancy grid. With our learned occupancy as guidance for empty space skipping on challenging large-scale benchmarks, our method consistently obtains higher accuracy compared to the occupancy grid, and our method can speed up state-of-the-art NeRF methods without sacrificing accuracy.

Title: CLUE-MARK: Watermarking Diffusion Models using CLWE

Authors: Kareem Shehata, Aashish Kolluri, Prateek Saxena
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11434
Pdf URL: https://arxiv.org/pdf/2411.11434
Copy Paste: [[2411.11434]] CLUE-MARK: Watermarking Diffusion Models using CLWE(https://arxiv.org/abs/2411.11434)
Keywords: diffusion
Abstract: As AI-generated images become widespread, reliable watermarking is essential for content verification, copyright enforcement, and combating disinformation. Existing techniques rely on heuristic approaches and lack formal guarantees of undetectability, making them vulnerable to steganographic attacks that can expose or erase the watermark. Additionally, these techniques often degrade output quality by introducing perceptible changes, which is not only undesirable but an important barrier to adoption in practice. In this work, we introduce CLUE-Mark, the first provably undetectable watermarking scheme for diffusion models. CLUE-Mark requires no changes to the model being watermarked, is computationally efficient, and because it is provably undetectable is guaranteed to have no impact on model output quality. Our approach leverages the Continuous Learning With Errors (CLWE) problem -- a cryptographically hard lattice problem -- to embed watermarks in the latent noise vectors used by diffusion models. By proving undetectability via reduction to a cryptographically hard problem we ensure not only that the watermark is imperceptible to human observers or adhoc heuristics, but to \emph{any} efficient detector that does not have the secret key. CLUE-Mark allows multiple keys to be embedded, enabling traceability of images to specific users without altering model parameters. Empirical evaluations on state-of-the-art diffusion models confirm that CLUE-Mark achieves high recoverability, preserves image quality, and is robust to minor perturbations such JPEG compression and brightness adjustments. Uniquely, CLUE-Mark cannot be detected nor removed by recent steganographic attacks.

Title: The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather

Authors: Markus Schön, Jona Ruof, Thomas Wodtko, Michael Buchholz, Klaus Dietmayer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11455
Pdf URL: https://arxiv.org/pdf/2411.11455
Copy Paste: [[2411.11455]] The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather(https://arxiv.org/abs/2411.11455)
Keywords: self-supervised
Abstract: Depth estimation is an essential task toward full scene understanding since it allows the projection of rich semantic information captured by cameras into 3D space. While the field has gained much attention recently, datasets for depth estimation lack scene diversity or sensor modalities. This work presents the ADUULM-360 dataset, a novel multi-modal dataset for depth estimation. The ADUULM-360 dataset covers all established autonomous driving sensor modalities, cameras, lidars, and radars. It covers a frontal-facing stereo setup, six surround cameras covering the full 360-degree, two high-resolution long-range lidar sensors, and five long-range radar sensors. It is also the first depth estimation dataset that contains diverse scenes in good and adverse weather conditions. We conduct extensive experiments using state-of-the-art self-supervised depth estimation methods under different training tasks, such as monocular training, stereo training, and full surround training. Discussing these results, we demonstrate common limitations of state-of-the-art methods, especially in adverse weather conditions, which hopefully will inspire future research in this area. Our dataset, development kit, and trained baselines are available at this https URL.

Title: MGNiceNet: Unified Monocular Geometric Scene Understanding

Authors: Markus Schön, Michael Buchholz, Klaus Dietmayer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11466
Pdf URL: https://arxiv.org/pdf/2411.11466
Copy Paste: [[2411.11466]] MGNiceNet: Unified Monocular Geometric Scene Understanding(https://arxiv.org/abs/2411.11466)
Keywords: self-supervised
Abstract: Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at this https URL.

Title: MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion

Authors: Dongseok Shim, Yichun Shi, Kejie Li, H. Jin Kim, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11475
Pdf URL: https://arxiv.org/pdf/2411.11475
Copy Paste: [[2411.11475]] MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion(https://arxiv.org/abs/2411.11475)
Keywords: diffusion, generative
Abstract: Recent advancements in text-to-3D generation, building on the success of high-performance text-to-image generative models, have made it possible to create imaginative and richly textured 3D objects from textual descriptions. However, a key challenge remains in effectively decoupling light-independent and lighting-dependent components to enhance the quality of generated 3D models and their relighting performance. In this paper, we present MVLight, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process. This enables the model to synthesize high-quality images that faithfully reflect the specified lighting environment across multiple camera views. By leveraging this capability to Score Distillation Sampling (SDS), we can effectively synthesize 3D models with improved geometric precision and relighting capabilities. We validate the effectiveness of MVLight through extensive experiments and a user study.

Title: Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Authors: Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11496
Pdf URL: https://arxiv.org/pdf/2411.11496
Copy Paste: [[2411.11496]] Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models(https://arxiv.org/abs/2411.11496)
Keywords: generative
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \url{this https URL}.

Title: LaVin-DiT: Large Vision Diffusion Transformer

Authors: Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11505
Pdf URL: https://arxiv.org/pdf/2411.11505
Copy Paste: [[2411.11505]] LaVin-DiT: Large Vision Diffusion Transformer(https://arxiv.org/abs/2411.11505)
Keywords: diffusion, foundation model, generative, in-context
Abstract: This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be open-sourced.

Title: Learning a Neural Association Network for Self-supervised Multi-Object Tracking

Authors: Shuai Li, Michael Burke, Subramanian Ramamoorthy, Juergen Gall
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11514
Pdf URL: https://arxiv.org/pdf/2411.11514
Copy Paste: [[2411.11514]] Learning a Neural Association Network for Self-supervised Multi-Object Tracking(https://arxiv.org/abs/2411.11514)
Keywords: self-supervised
Abstract: This paper introduces a novel framework to learn data association for multi-object tracking in a self-supervised manner. Fully-supervised learning methods are known to achieve excellent tracking performances, but acquiring identity-level annotations is tedious and time-consuming. Motivated by the fact that in real-world scenarios object motion can be usually represented by a Markov process, we present a novel expectation maximization (EM) algorithm that trains a neural network to associate detections for tracking, without requiring prior knowledge of their temporal correspondences. At the core of our method lies a neural Kalman filter, with an observation model conditioned on associations of detections parameterized by a neural network. Given a batch of frames as input, data associations between detections from adjacent frames are predicted by a neural network followed by a Sinkhorn normalization that determines the assignment probabilities of detections to states. Kalman smoothing is then used to obtain the marginal probability of observations given the inferred states, producing a training objective to maximize this marginal probability using gradient descent. The proposed framework is fully differentiable, allowing the underlying neural model to be trained end-to-end. We evaluate our approach on the challenging MOT17 and MOT20 datasets and achieve state-of-the-art results in comparison to self-supervised trackers using public detections. We furthermore demonstrate the capability of the learned model to generalize across datasets.

Title: Cascaded Diffusion Models for 2D and 3D Microscopy Image Synthesis to Enhance Cell Segmentation

Authors: Rüveyda Yilmaz, Kaan Keven, Yuli Wu, Johannes Stegmaier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11515
Pdf URL: https://arxiv.org/pdf/2411.11515
Copy Paste: [[2411.11515]] Cascaded Diffusion Models for 2D and 3D Microscopy Image Synthesis to Enhance Cell Segmentation(https://arxiv.org/abs/2411.11515)
Keywords: diffusion
Abstract: Automated cell segmentation in microscopy images is essential for biomedical research, yet conventional methods are labor-intensive and prone to error. While deep learning-based approaches have proven effective, they often require large annotated datasets, which are scarce due to the challenges of manual annotation. To overcome this, we propose a novel framework for synthesizing densely annotated 2D and 3D cell microscopy images using cascaded diffusion models. Our method synthesizes 2D and 3D cell masks from sparse 2D annotations using multi-level diffusion models and NeuS, a 3D surface reconstruction approach. Following that, a pretrained 2D Stable Diffusion model is finetuned to generate realistic cell textures and the final outputs are combined to form cell populations. We show that training a segmentation model with a combination of our synthetic data and real data improves cell segmentation performance by up to 9\% across multiple datasets. Additionally, the FID scores indicate that the synthetic data closely resembles real data. The code for our proposed approach will be available at this https URL\_diffusion.

Title: SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions

Authors: Shuo Zhang, Jian K. Liu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.11530
Pdf URL: https://arxiv.org/pdf/2411.11530
Copy Paste: [[2411.11530]] SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions(https://arxiv.org/abs/2411.11530)
Keywords: self-supervised
Abstract: Protein language models (PLMs) are capable of learning the relationships between protein sequences and functions by treating amino acid sequences as textual data in a self-supervised manner. However, fine-tuning these models typically demands substantial computational resources and time, with results that may not always be optimized for specific tasks. To overcome these challenges, this study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model specifically for protein property prediction tasks, utilizing only sequence information. Additionally, a multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information, thereby enhancing the model's comprehension of protein sequences. Experimental results of extensive classification and regression tasks demonstrate that the fine-tuned model achieves strong performance and faster convergence across multiple regression and classification tasks.

Title: Generative Spatio-temporal GraphNet for Transonic Wing Pressure Distribution Forecasting

Authors: Gabriele Immordino, Andrea Vaiuso, Andrea Da Ronch, Marcello Righi
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2411.11592
Pdf URL: https://arxiv.org/pdf/2411.11592
Copy Paste: [[2411.11592]] Generative Spatio-temporal GraphNet for Transonic Wing Pressure Distribution Forecasting(https://arxiv.org/abs/2411.11592)
Keywords: generative
Abstract: This study presents a framework for predicting unsteady transonic wing pressure distributions, integrating an autoencoder architecture with graph convolutional networks and graph-based temporal layers to model time dependencies. The framework compresses high-dimensional pressure distribution data into a lower-dimensional latent space using an autoencoder, ensuring efficient data representation while preserving essential features. Within this latent space, graph-based temporal layers are employed to predict future wing pressures based on past data, effectively capturing temporal dependencies and improving predictive accuracy. This combined approach leverages the strengths of autoencoders for dimensionality reduction, graph convolutional networks for handling unstructured grid data, and temporal layers for modeling time-based sequences. The effectiveness of the proposed framework is validated through its application to the Benchmark Super Critical Wing test case, achieving accuracy comparable to computational fluid dynamics, while significantly reducing prediction time. This framework offers a scalable, computationally efficient solution for the aerodynamic analysis of unsteady phenomena.

Title: Leveraging Computational Pathology AI for Noninvasive Optical Imaging Analysis Without Retraining

Authors: Danny Barash, Emilie Manning, Aidan Van Vleck, Omri Hirsch, Kyi Lei Aye, Jingxi Li, Philip O. Scumpia, Aydogan Ozcan, Sumaira Aasi, Kerri E. Rieger, Kavita Y. Sarin, Oren Freifeld, Yonatan Winetraub
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11613
Pdf URL: https://arxiv.org/pdf/2411.11613
Copy Paste: [[2411.11613]] Leveraging Computational Pathology AI for Noninvasive Optical Imaging Analysis Without Retraining(https://arxiv.org/abs/2411.11613)
Keywords: foundation model
Abstract: Noninvasive optical imaging modalities can probe patient's tissue in 3D and over time generate gigabytes of clinically relevant data per sample. There is a need for AI models to analyze this data and assist clinical workflow. The lack of expert labelers and the large dataset required (>100,000 images) for model training and tuning are the main hurdles in creating foundation models. In this paper we introduce FoundationShift, a method to apply any AI model from computational pathology without retraining. We show our method is more accurate than state of the art models (SAM, MedSAM, SAM-Med2D, CellProfiler, Hover-Net, PLIP, UNI and ChatGPT), with multiple imaging modalities (OCT and RCM). This is achieved without the need for model retraining or fine-tuning. Applying our method to noninvasive in vivo images could enable physicians to readily incorporate optical imaging modalities into their clinical practice, providing real time tissue analysis and improving patient care.

Title: Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare

Authors: Leon Kopitar, Primoz Kocbek, Lucija Gosak, Gregor Stiglic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11635
Pdf URL: https://arxiv.org/pdf/2411.11635
Copy Paste: [[2411.11635]] Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare(https://arxiv.org/abs/2411.11635)
Keywords: generative
Abstract: This review examines the development of abstractive NLP-based text summarization approaches and compares them to existing techniques for extractive summarization. A brief history of text summarization from the 1950s to the introduction of pre-trained language models such as Bidirectional Encoder Representations from Transformer (BERT) and Generative Pre-training Transformers (GPT) are presented. In total, 60 studies were identified in PubMed and Web of Science, of which 29 were excluded and 24 were read and evaluated for eligibility, resulting in the use of seven studies for further analysis. This chapter also includes a section with examples including an example of a comparison between GPT-3 and state-of-the-art GPT-4 solutions in scientific text summarisation. Natural language processing has not yet reached its full potential in the generation of brief textual summaries. As there are acknowledged concerns that must be addressed, we can expect gradual introduction of such models in practise.

Title: TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection

Authors: Mengxuan Li, Ke Liu, Hongyang Chen, Jiajun Bu, Hongwei Wang, Haishuai Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11641
Pdf URL: https://arxiv.org/pdf/2411.11641
Copy Paste: [[2411.11641]] TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection(https://arxiv.org/abs/2411.11641)
Keywords: anomaly
Abstract: Time series anomaly detection aims to identify unusual patterns in data or deviations from systems' expected behavior. The reconstruction-based methods are the mainstream in this task, which learn point-wise representation via unsupervised learning. However, the unlabeled anomaly points in training data may cause these reconstruction-based methods to learn and reconstruct anomalous data, resulting in the challenge of capturing normal patterns. In this paper, we propose a time series anomaly detection method based on implicit neural representation (INR) reconstruction, named TSINR, to address this challenge. Due to the property of spectral bias, TSINR enables prioritizing low-frequency signals and exhibiting poorer performance on high-frequency abnormal data. Specifically, we adopt INR to parameterize time series data as a continuous function and employ a transformer-based architecture to predict the INR of given data. As a result, the proposed TSINR method achieves the advantage of capturing the temporal continuity and thus is more sensitive to discontinuous anomaly data. In addition, we further design a novel form of INR continuous function to learn inter- and intra-channel information, and leverage a pre-trained large language model to amplify the intense fluctuations in anomalies. Extensive experiments demonstrate that TSINR achieves superior overall performance on both univariate and multivariate time series anomaly detection benchmarks compared to other state-of-the-art reconstruction-based methods. Our codes are available.

Title: Conceptwm: A Diffusion Model Watermark for Concept Protection

Authors: Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Subjects: cs.CR, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2411.11688
Pdf URL: https://arxiv.org/pdf/2411.11688
Copy Paste: [[2411.11688]] Conceptwm: A Diffusion Model Watermark for Concept Protection(https://arxiv.org/abs/2411.11688)
Keywords: diffusion
Abstract: The personalization techniques of diffusion models succeed in generating specific concepts but also pose threats to copyright protection and illegal use. Model Watermarking is an effective method to prevent the unauthorized use of subject-driven or style-driven image generation, safeguarding concept copyrights. However, under the goal of concept-oriented protection, current watermarking schemes typically add watermarks to all images rather than applying them in a refined manner targeted at specific concepts. Additionally, the personalization techniques of diffusion models can easily remove watermarks. Existing watermarking methods struggle to achieve fine-grained watermark embedding with a few images of specific concept and prevent removal of watermarks through personalized fine-tuning. Therefore, we introduce a novel concept-oriented watermarking framework that seamlessly embeds imperceptible watermarks into the concept of diffusion models. We conduct extensive experiments and ablation studies to verify our framework. Our code is available at this https URL.

Title: Robust Reinforcement Learning under Diffusion Models for Data with Jumps

Authors: Chenyang Jiang, Donggyu Kim, Alejandra Quintos, Yazhen Wang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.11697
Pdf URL: https://arxiv.org/pdf/2411.11697
Copy Paste: [[2411.11697]] Robust Reinforcement Learning under Diffusion Models for Data with Jumps(https://arxiv.org/abs/2411.11697)
Keywords: diffusion
Abstract: Reinforcement Learning (RL) has proven effective in solving complex decision-making tasks across various domains, but challenges remain in continuous-time settings, particularly when state dynamics are governed by stochastic differential equations (SDEs) with jump components. In this paper, we address this challenge by introducing the Mean-Square Bipower Variation Error (MSBVE) algorithm, which enhances robustness and convergence in scenarios involving significant stochastic noise and jumps. We first revisit the Mean-Square TD Error (MSTDE) algorithm, commonly used in continuous-time RL, and highlight its limitations in handling jumps in state dynamics. The proposed MSBVE algorithm minimizes the mean-square quadratic variation error, offering improved performance over MSTDE in environments characterized by SDEs with jumps. Simulations and formal proofs demonstrate that the MSBVE algorithm reliably estimates the value function in complex settings, surpassing MSTDE's performance when faced with jump processes. These findings underscore the importance of alternative error metrics to improve the resilience and effectiveness of RL algorithms in continuous-time frameworks.

Title: Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Authors: Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, Dacheng Tao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.11727
Pdf URL: https://arxiv.org/pdf/2411.11727
Copy Paste: [[2411.11727]] Aligning Few-Step Diffusion Models with Dense Reward Difference Learning(https://arxiv.org/abs/2411.11727)
Keywords: diffusion
Abstract: Aligning diffusion models with downstream objectives is essential for their practical applications. However, standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. To address this, we introduce Stepwise Diffusion Policy Optimization (SDPO), a novel alignment method tailored for few-step diffusion models. Unlike prior approaches that rely on a single sparse reward from only the final step of each denoising trajectory for trajectory-level optimization, SDPO incorporates dense reward feedback at every intermediate step. By learning the differences in dense rewards between paired samples, SDPO facilitates stepwise optimization of few-step diffusion models, ensuring consistent alignment across all denoising steps. To promote stable and efficient training, SDPO introduces an online reinforcement learning framework featuring several novel strategies designed to effectively exploit the stepwise granularity of dense rewards. Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities. Code is avaliable at this https URL.

Title: BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Authors: Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2411.11745
Pdf URL: https://arxiv.org/pdf/2411.11745
Copy Paste: [[2411.11745]] BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration(https://arxiv.org/abs/2411.11745)
Keywords: generative
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.

Title: LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

Authors: Enshuo Hsu, Kirk Roberts
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11779
Pdf URL: https://arxiv.org/pdf/2411.11779
Copy Paste: [[2411.11779]] LLM-IE: A Python Package for Generative Information Extraction with Large Language Models(https://arxiv.org/abs/2411.11779)
Keywords: generative
Abstract: Objectives: Despite the recent adoption of large language models (LLMs) for biomedical information extraction, challenges in prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed LLM-IE: a Python package for building complete information extraction pipelines. Our key innovation is an interactive LLM agent to support schema definition and prompt design. Materials and Methods: The LLM-IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We benchmarked on the i2b2 datasets and conducted a system evaluation. Results: The sentence-based prompting algorithm resulted in the best performance while requiring a longer inference time. System evaluation provided intuitive visualization. Discussion: LLM-IE was designed from practical NLP experience in healthcare and has been adopted in internal projects. It should hold great value to the biomedical NLP community. Conclusion: We developed a Python package, LLM-IE, that provides building blocks for robust information extraction pipeline construction.

Title: Generative World Explorer

Authors: Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11844
Pdf URL: https://arxiv.org/pdf/2411.11844
Copy Paste: [[2411.11844]] Generative World Explorer(https://arxiv.org/abs/2411.11844)
Keywords: generative
Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world this http URL contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.