2024-12-12

Title: Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

Authors: Dongjie Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07797
Pdf URL: https://arxiv.org/pdf/2412.07797
Copy Paste: [[2412.07797]] Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(https://arxiv.org/abs/2412.07797)
Keywords: transformer
Abstract: In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Mogo can generate continuous and cyclic motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative performance in out-of-distribution generation.

Title: Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Authors: Myeongseob Ko, Henry Li, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07808
Pdf URL: https://arxiv.org/pdf/2412.07808
Copy Paste: [[2412.07808]] Boosting Alignment for Post-Unlearning Text-to-Image Generative Models(https://arxiv.org/abs/2412.07808)
Keywords: diffusion, generative
Abstract: Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. However, this often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. Driven by these concerns, machine unlearning has become crucial to effectively purge undesirable knowledge from models. While existing literature has studied various unlearning techniques, these often suffer from either poor unlearning quality or degradation in text-image alignment after unlearning, due to the competitive nature of these objectives. To address these challenges, we propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives. We further derive the characterization of such an update. In addition, we design procedures to strategically diversify the unlearning and remaining datasets to boost performance improvement. Our evaluation demonstrates that our method effectively removes target classes from recent diffusion-based generative models and concepts from stable diffusion models while maintaining close alignment with the models' original trained states, thus outperforming state-of-the-art baselines. Our code will be made available at \url{this https URL}.

Title: Multi-Response Preference Optimization with Augmented Ranking Dataset

Authors: Hansle Gwon, Imjin Ahn, Young-Hak Kim, Sanghyun Park, Tae Joon Jun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.07812
Pdf URL: https://arxiv.org/pdf/2412.07812
Copy Paste: [[2412.07812]] Multi-Response Preference Optimization with Augmented Ranking Dataset(https://arxiv.org/abs/2412.07812)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have been remarkable, with new models consistently surpassing their predecessors. These advancements are underpinned by extensive research on various training mechanisms. Among these, Preference Optimization has played a significant role in improving the performance of LLMs by incorporating human preferences into the training process. However, constructing preference optimization datasets is challenging and the optimization process is highly sensitive to the dataset quality. In this study, we propose a novel approach to augment Preference Optimization datasets. Additionally, we introduce a Multi-response-based Preference Optimization training method that enables the simultaneous learning of multiple responses.

Title: Intelligent System for Automated Molecular Patent Infringement Assessment

Authors: Yaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, Xiang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07819
Pdf URL: https://arxiv.org/pdf/2412.07819
Copy Paste: [[2412.07819]] Intelligent System for Automated Molecular Patent Infringement Assessment(https://arxiv.org/abs/2412.07819)
Keywords: large language model
Abstract: Automated drug discovery offers significant potential for accelerating the development of novel therapeutics by substituting labor-intensive human workflows with machine-driven processes. However, a critical bottleneck persists in the inability of current automated frameworks to assess whether newly designed molecules infringe upon existing patents, posing significant legal and financial risks. We introduce PatentFinder, a novel tool-enhanced and multi-agent framework that accurately and comprehensively evaluates small molecules for patent infringement. It incorporates both heuristic and model-based tools tailored for decomposed subtasks, featuring: MarkushParser, which is capable of optical chemical structure recognition of molecular and Markush structures, and MarkushMatcher, which enhances large language models' ability to extract substituent groups from molecules accurately. On our benchmark dataset MolPatent-240, PatentFinder outperforms baseline approaches that rely solely on large language models, demonstrating a 13.8\% increase in F1-score and a 12\% rise in accuracy. Experimental results demonstrate that PatentFinder mitigates label bias to produce balanced predictions and autonomously generates detailed, interpretable patent infringement reports. This work not only addresses a pivotal challenge in automated drug discovery but also demonstrates the potential of decomposing complex scientific tasks into manageable subtasks for specialized, tool-augmented agents.

Title: Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Authors: Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07820
Pdf URL: https://arxiv.org/pdf/2412.07820
Copy Paste: [[2412.07820]] Hyperband-based Bayesian Optimization for Black-box Prompt Selection(https://arxiv.org/abs/2412.07820)
Keywords: large language model
Abstract: Optimal prompt selection is crucial for maximizing large language model (LLM) performance on downstream tasks. As the most powerful models are proprietary and can only be invoked via an API, users often manually refine prompts in a black-box setting by adjusting instructions and few-shot examples until they achieve good performance as measured on a validation set. Recent methods addressing static black-box prompt selection face significant limitations: They often fail to leverage the inherent structure of prompts, treating instructions and few-shot exemplars as a single block of text. Moreover, they often lack query-efficiency by evaluating prompts on all validation instances, or risk sub-optimal selection of a prompt by using random subsets of validation instances. We introduce HbBoPs, a novel Hyperband-based Bayesian optimization method for black-box prompt selection addressing these key limitations. Our approach combines a structural-aware deep kernel Gaussian Process to model prompt performance with Hyperband as a multi-fidelity scheduler to select the number of validation instances for prompt evaluations. The structural-aware modeling approach utilizes separate embeddings for instructions and few-shot exemplars, enhancing the surrogate model's ability to capture prompt performance and predict which prompt to evaluate next in a sample-efficient manner. Together with Hyperband as a multi-fidelity scheduler we further enable query-efficiency by adaptively allocating resources across different fidelity levels, keeping the total number of validation instances prompts are evaluated on low. Extensive evaluation across ten benchmarks and three LLMs demonstrate that HbBoPs outperforms state-of-the-art methods.

Title: 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Authors: Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07825
Pdf URL: https://arxiv.org/pdf/2412.07825
Copy Paste: [[2412.07825]] 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark(https://arxiv.org/abs/2412.07825)
Keywords: robust
Abstract: 3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available this https URL.

Title: Evaluating the Potential of Federated Learning for Maize Leaf Disease Prediction

Authors: Thalita Mendonça Antico, Larissa F. Rodrigues Moreira, Rodrigo Moreira
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07872
Pdf URL: https://arxiv.org/pdf/2412.07872
Copy Paste: [[2412.07872]] Evaluating the Potential of Federated Learning for Maize Leaf Disease Prediction(https://arxiv.org/abs/2412.07872)
Keywords: privacy, federate
Abstract: The diagnosis of diseases in food crops based on machine learning seemed satisfactory and suitable for use on a large scale. The Convolutional Neural Networks (CNNs) perform accurately in the disease prediction considering the image capture of the crop leaf, being extensively enhanced in the literature. These machine learning techniques fall short in data privacy, as they require sharing the data in the training process with a central server, disregarding competitive or regulatory concerns. Thus, Federated Learning (FL) aims to support distributed training to address recognized gaps in centralized training. As far as we know, this paper inaugurates the use and evaluation of FL applied in maize leaf diseases. We evaluated the performance of five CNNs trained under the distributed paradigm and measured their training time compared to the classification performance. In addition, we consider the suitability of distributed training considering the volume of network traffic and the number of parameters of each CNN. Our results indicate that FL potentially enhances data privacy in heterogeneous domains.

Title: Comparative Analysis of Deep Learning Approaches for Harmful Brain Activity Detection Using EEG

Authors: Shivraj Singh Bhatti, Aryan Yadav, Mitali Monga, Neeraj Kumar
Subjects: cs.LG, cs.AI, eess.SP, q-bio.NC
Abstract URL: https://arxiv.org/abs/2412.07878
Pdf URL: https://arxiv.org/pdf/2412.07878
Copy Paste: [[2412.07878]] Comparative Analysis of Deep Learning Approaches for Harmful Brain Activity Detection Using EEG(https://arxiv.org/abs/2412.07878)
Keywords: robust, transformer
Abstract: The classification of harmful brain activities, such as seizures and periodic discharges, play a vital role in neurocritical care, enabling timely diagnosis and intervention. Electroencephalography (EEG) provides a non-invasive method for monitoring brain activity, but the manual interpretation of EEG signals are time-consuming and rely heavily on expert judgment. This study presents a comparative analysis of deep learning architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and EEGNet, applied to the classification of harmful brain activities using both raw EEG data and time-frequency representations generated through Continuous Wavelet Transform (CWT). We evaluate the performance of these models use multimodal data representations, including high-resolution spectrograms and waveform data, and introduce a multi-stage training strategy to improve model robustness. Our results show that training strategies, data preprocessing, and augmentation techniques are as critical to model success as architecture choice, with multi-stage TinyViT and EfficientNet demonstrating superior performance. The findings underscore the importance of robust training regimes in achieving accurate and efficient EEG classification, providing valuable insights for deploying AI models in clinical practice.

Title: Pix2Poly: A Sequence Prediction Method for End-to-end Polygonal Building Footprint Extraction from Remote Sensing Imagery

Authors: Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.07899
Pdf URL: https://arxiv.org/pdf/2412.07899
Copy Paste: [[2412.07899]] Pix2Poly: A Sequence Prediction Method for End-to-end Polygonal Building Footprint Extraction from Remote Sensing Imagery(https://arxiv.org/abs/2412.07899)
Keywords: extraction, transformer, generative
Abstract: Extraction of building footprint polygons from remotely sensed data is essential for several urban understanding tasks such as reconstruction, navigation, and mapping. Despite significant progress in the area, extracting accurate polygonal building footprints remains an open problem. In this paper, we introduce Pix2Poly, an attention-based end-to-end trainable and differentiable deep neural network capable of directly generating explicit high-quality building footprints in a ring graph format. Pix2Poly employs a generative encoder-decoder transformer to produce a sequence of graph vertex tokens whose connectivity information is learned by an optimal matching network. Compared to previous graph learning methods, ours is a truly end-to-end trainable approach that extracts high-quality building footprints and road networks without requiring complicated, computationally intensive raster loss functions and intricate training pipelines. Upon evaluating Pix2Poly on several complex and challenging datasets, we report that Pix2Poly outperforms state-of-the-art methods in several vector shape quality metrics while being an entirely explicit method. Our code is available at this https URL.

Title: Score Change of Variables

Authors: Stephen Robbins
Subjects: cs.LG, cs.AI, math.PR
Abstract URL: https://arxiv.org/abs/2412.07904
Pdf URL: https://arxiv.org/pdf/2412.07904
Copy Paste: [[2412.07904]] Score Change of Variables(https://arxiv.org/abs/2412.07904)
Keywords: diffusion
Abstract: We derive a general change of variables formula for score functions, showing that for a smooth, invertible transformation $\mathbf{y} = \phi(\mathbf{x})$, the transformed score function $\nabla_{\mathbf{y}} \log q(\mathbf{y})$ can be expressed directly in terms of $\nabla_{\mathbf{x}} \log p(\mathbf{x})$. Using this result, we develop two applications: First, we establish a reverse-time Itô lemma for score-based diffusion models, allowing the use of $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ to reverse an SDE in the transformed space without directly learning $\nabla_{\mathbf{y}} \log q_t(\mathbf{y})$. This approach enables training diffusion models in one space but sampling in another, effectively decoupling the forward and reverse processes. Second, we introduce generalized sliced score matching, extending traditional sliced score matching from linear projections to arbitrary smooth transformations. This provides greater flexibility in high-dimensional density estimation. We demonstrate these theoretical advances through applications to diffusion on the probability simplex and empirically compare our generalized score matching approach against traditional sliced score matching methods.

Title: Rethinking Emotion Annotations in the Era of Large Language Models

Authors: Minxue Niu, Yara El-Tawil, Amrit Romana, Emily Mower Provost
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.07906
Pdf URL: https://arxiv.org/pdf/2412.07906
Copy Paste: [[2412.07906]] Rethinking Emotion Annotations in the Era of Large Language Models(https://arxiv.org/abs/2412.07906)
Keywords: large language model
Abstract: Modern affective computing systems rely heavily on datasets with human-annotated emotion labels, for training and evaluation. However, human annotations are expensive to obtain, sensitive to study design, and difficult to quality control, because of the subjective nature of emotions. Meanwhile, Large Language Models (LLMs) have shown remarkable performance on many Natural Language Understanding tasks, emerging as a promising tool for text annotation. In this work, we analyze the complexities of emotion annotation in the context of LLMs, focusing on GPT-4 as a leading model. In our experiments, GPT-4 achieves high ratings in a human evaluation study, painting a more positive picture than previous work, in which human labels served as the only ground truth. On the other hand, we observe differences between human and GPT-4 emotion perception, underscoring the importance of human input in annotation studies. To harness GPT-4's strength while preserving human perspective, we explore two ways of integrating GPT-4 into emotion annotation pipelines, showing its potential to flag low-quality labels, reduce the workload of human annotators, and improve downstream model learning performance and efficiency. Together, our findings highlight opportunities for new emotion labeling practices and suggest the use of LLMs as a promising tool to aid human annotation.

Title: Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Authors: Can Yaras, Siyi Chen, Peng Wang, Qing Qu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.07909
Pdf URL: https://arxiv.org/pdf/2412.07909
Copy Paste: [[2412.07909]] Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning(https://arxiv.org/abs/2412.07909)
Keywords: generative
Abstract: Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.

Title: Robust Multiple Description Neural Video Codec with Masked Transformer for Dynamic and Noisy Networks

Authors: Xinyue Hu, Wei Ye, Jiaxiang Tang, Eman Ramadan, Zhi-Li Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07922
Pdf URL: https://arxiv.org/pdf/2412.07922
Copy Paste: [[2412.07922]] Robust Multiple Description Neural Video Codec with Masked Transformer for Dynamic and Noisy Networks(https://arxiv.org/abs/2412.07922)
Keywords: robust, transformer
Abstract: Multiple Description Coding (MDC) is a promising error-resilient source coding method that is particularly suitable for dynamic networks with multiple (yet noisy and unreliable) paths. However, conventional MDC video codecs suffer from cumbersome architectures, poor scalability, limited loss resilience, and lower compression efficiency. As a result, MDC has never been widely adopted. Inspired by the potential of neural video codecs, this paper rethinks MDC design. We propose a novel MDC video codec, NeuralMDC, demonstrating how bidirectional transformers trained for masked token prediction can vastly simplify the design of MDC video codec. To compress a video, NeuralMDC starts by tokenizing each frame into its latent representation and then splits the latent tokens to create multiple descriptions containing correlated information. Instead of using motion prediction and warping operations, NeuralMDC trains a bidirectional masked transformer to model the spatial-temporal dependencies of latent representations and predict the distribution of the current representation based on the past. The predicted distribution is used to independently entropy code each description and infer any potentially lost tokens. Extensive experiments demonstrate NeuralMDC achieves state-of-the-art loss resilience with minimal sacrifices in compression efficiency, significantly outperforming the best existing residual-coding-based error-resilient neural video codec.

Title: Asking Again and Again: Exploring LLM Robustness to Repeated Questions

Authors: Sagi Shaier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.07923
Pdf URL: https://arxiv.org/pdf/2412.07923
Copy Paste: [[2412.07923]] Asking Again and Again: Exploring LLM Robustness to Repeated Questions(https://arxiv.org/abs/2412.07923)
Keywords: robust, large language model
Abstract: This study examines whether large language models (LLMs), such as ChatGPT, specifically the latest GPT-4o-mini, exhibit sensitivity to repeated prompts and whether repeating a question can improve response accuracy. We hypothesize that reiterating a question within a single prompt might enhance the model's focus on key elements of the query. To test this, we evaluate ChatGPT's performance on a large sample of two reading comprehension datasets under both open-book and closed-book settings, varying the repetition of each question to 1, 3, or 5 times per prompt. Our findings indicate that the model does not demonstrate sensitivity to repeated questions, highlighting its robustness and consistency in this context.

Title: Non-Normal Diffusion Models

Authors: Henry Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07935
Pdf URL: https://arxiv.org/pdf/2412.07935
Copy Paste: [[2412.07935]] Non-Normal Diffusion Models(https://arxiv.org/abs/2412.07935)
Keywords: diffusion, generative
Abstract: Diffusion models generate samples by incrementally reversing a process that turns data into noise. We show that when the step size goes to zero, the reversed process is invariant to the distribution of these increments. This reveals a previously unconsidered parameter in the design of diffusion models: the distribution of the diffusion step $\Delta x_k := x_{k} - x_{k + 1}$. This parameter is implicitly set by default to be normally distributed in most diffusion models. By lifting this assumption, we generalize the framework for designing diffusion models and establish an expanded class of diffusion processes with greater flexibility in the choice of loss function used during training. We demonstrate the effectiveness of these models on density estimation and generative modeling tasks on standard image datasets, and show that different choices of the distribution of $\Delta x_k$ result in qualitatively different generated samples.

Title: PGRID: Power Grid Reconstruction in Informal Developments Using High-Resolution Aerial Imagery

Authors: Simone Fobi Nsutezo, Amrita Gupta, Duncan Kebut, Seema Iyer, Luana Marotti, Rahul Dodhia, Juan M. Lavista Ferres, Anthony Ortiz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07944
Pdf URL: https://arxiv.org/pdf/2412.07944
Copy Paste: [[2412.07944]] PGRID: Power Grid Reconstruction in Informal Developments Using High-Resolution Aerial Imagery(https://arxiv.org/abs/2412.07944)
Keywords: segmentation
Abstract: As of 2023, a record 117 million people have been displaced worldwide, more than double the number from a decade ago [22]. Of these, 32 million are refugees under the UNHCR mandate, with 8.7 million residing in refugee camps. A critical issue faced by these populations is the lack of access to electricity, with 80% of the 8.7 million refugees and displaced persons in camps globally relying on traditional biomass for cooking and lacking reliable power for essential tasks such as cooking and charging phones. Often, the burden of collecting firewood falls on women and children, who frequently travel up to 20 kilometers into dangerous areas, increasing their vulnerability.[7] Electricity access could significantly alleviate these challenges, but a major obstacle is the lack of accurate power grid infrastructure maps, particularly in resource-constrained environments like refugee camps, needed for energy access planning. Existing power grid maps are often outdated, incomplete, or dependent on costly, complex technologies, limiting their practicality. To address this issue, PGRID is a novel application-based approach, which utilizes high-resolution aerial imagery to detect electrical poles and segment electrical lines, creating precise power grid maps. PGRID was tested in the Turkana region of Kenya, specifically the Kakuma and Kalobeyei Camps, covering 84 km2 and housing over 200,000 residents. Our findings show that PGRID delivers high-fidelity power grid maps especially in unplanned settlements, with F1-scores of 0.71 and 0.82 for pole detection and line segmentation, respectively. This study highlights a practical application for leveraging open data and limited labels to improve power grid mapping in unplanned settlements, where the growing number of displaced persons urgently need sustainable energy infrastructure solutions.

Title: GPT-2 Through the Lens of Vector Symbolic Architectures

Authors: Johannes Knittel, Tushaar Gangavarapu, Hendrik Strobelt, Hanspeter Pfister
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07947
Pdf URL: https://arxiv.org/pdf/2412.07947
Copy Paste: [[2412.07947]] GPT-2 Through the Lens of Vector Symbolic Architectures(https://arxiv.org/abs/2412.07947)
Keywords: transformer
Abstract: Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.

Title: MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference

Authors: Parsa Ghazvinian, Robert Podschwadt, Prajwal Panzade, Mohammad H. Rafiei, Daniel Takabi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.07954
Pdf URL: https://arxiv.org/pdf/2412.07954
Copy Paste: [[2412.07954]] MOFHEI: Model Optimizing Framework for Fast and Efficient Homomorphically Encrypted Neural Network Inference(https://arxiv.org/abs/2412.07954)
Keywords: privacy
Abstract: Due to the extensive application of machine learning (ML) in a wide range of fields and the necessity of data privacy, privacy-preserving machine learning (PPML) solutions have recently gained significant traction. One group of approaches relies on Homomorphic Encryption (HE), which enables us to perform ML tasks over encrypted data. However, even with state-of-the-art HE schemes, HE operations are still significantly slower compared to their plaintext counterparts and require a considerable amount of memory. Therefore, we propose MOFHEI, a framework that optimizes the model to make HE-based neural network inference, referred to as private inference (PI), fast and efficient. First, our proposed learning-based method automatically transforms a pre-trained ML model into its compatible version with HE operations, called the HE-friendly version. Then, our iterative block pruning method prunes the model's parameters in configurable block shapes in alignment with the data packing method. This allows us to drop a significant number of costly HE operations, thereby reducing the latency and memory consumption while maintaining the model's performance. We evaluate our framework through extensive experiments on different models using various datasets. Our method achieves up to 98% pruning ratio on LeNet, eliminating up to 93% of the required HE operations for performing PI, reducing latency and the required memory by factors of 9.63 and 4.04, respectively, with negligible accuracy loss.

Title: Forking Paths in Neural Text Generation

Authors: Eric Bigelow, Ari Holtzman, Hidenori Tanaka, Tomer Ullman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.07961
Pdf URL: https://arxiv.org/pdf/2412.07961
Copy Paste: [[2412.07961]] Forking Paths in Neural Text Generation(https://arxiv.org/abs/2412.07961)
Keywords: large language model
Abstract: Estimating uncertainty in Large Language Models (LLMs) is important for properly evaluating LLMs, and ensuring safety for users. However, prior approaches to uncertainty estimation focus on the final answer in generated text, ignoring intermediate steps that might dramatically impact the outcome. We hypothesize that there exist key forking tokens, such that re-sampling the system at those specific tokens, but not others, leads to very different outcomes. To test this empirically, we develop a novel approach to representing uncertainty dynamics across individual tokens of text generation, and applying statistical models to test our hypothesis. Our approach is highly flexible: it can be applied to any dataset and any LLM, without fine tuning or accessing model weights. We use our method to analyze LLM responses on 7 different tasks across 4 domains, spanning a wide range of typical use cases. We find many examples of forking tokens, including surprising ones such as punctuation marks, suggesting that LLMs are often just a single token away from saying something very different.

Title: Mayfly: Private Aggregate Insights from Ephemeral Streams of On-Device User Data

Authors: Christopher Bian, Albert Cheu, Stanislav Chiknavaryan, Zoe Gong, Marco Gruteser, Oliver Guinan, Yannis Guzman, Peter Kairouz, Artem Lagzdin, Ryan McKenna, Grace Ni, Edo Roth, Maya Spivak, Timon Van Overveldt, Ren Yi
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2412.07962
Pdf URL: https://arxiv.org/pdf/2412.07962
Copy Paste: [[2412.07962]] Mayfly: Private Aggregate Insights from Ephemeral Streams of On-Device User Data(https://arxiv.org/abs/2412.07962)
Keywords: privacy, federate
Abstract: This paper introduces Mayfly, a federated analytics approach enabling aggregate queries over ephemeral on-device data streams without central persistence of sensitive user data. Mayfly minimizes data via on-device windowing and contribution bounding through SQL-programmability, anonymizes user data via streaming differential privacy (DP), and mandates immediate in-memory cross-device aggregation on the server -- ensuring only privatized aggregates are revealed to data analysts. Deployed for a sustainability use case estimating transportation carbon emissions from private location data, Mayfly computed over 4 million statistics across more than 500 million devices with a per-device, per-week DP $\varepsilon = 2$ while meeting strict data utility requirements. To achieve this, we designed a new DP mechanism for Group-By-Sum workloads leveraging statistical properties of location data, with potential applicability to other domains.

Title: HalluCana: Fixing LLM Hallucination with A Canary Lookahead

Authors: Tianyi Li, Erenay Dayanik, Shubhi Tyagi, Andrea Pierleoni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.07965
Pdf URL: https://arxiv.org/pdf/2412.07965
Copy Paste: [[2412.07965]] HalluCana: Fixing LLM Hallucination with A Canary Lookahead(https://arxiv.org/abs/2412.07965)
Keywords: large language model
Abstract: In this paper, we present HalluCana, a canary lookahead to detect and correct factuality hallucinations of Large Language Models (LLMs) in long-form generation. HalluCana detects and intervenes as soon as traces of hallucination emerge, during and even before generation. To support timely detection, we exploit the internal factuality representation in the LLM hidden space, where we investigate various proxies to the LLMs' factuality self-assessment, and discuss its relation to the models' context familiarity from their pre-training. On biography generation, our method improves generation quality by up to 2.5x, while consuming over 6 times less compute.

Title: Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation

Authors: Kurt H.W. Stolle (Eindhoven University of Technology)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07966
Pdf URL: https://arxiv.org/pdf/2412.07966
Copy Paste: [[2412.07966]] Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation(https://arxiv.org/abs/2412.07966)
Keywords: transformer, segmentation
Abstract: In this work, we present Multiformer, a novel approach to depth-aware video panoptic segmentation (DVPS) based on the mask transformer paradigm. Our method learns object representations that are shared across segmentation, monocular depth estimation, and object tracking subtasks. In contrast to recent unified approaches that progressively refine a common object representation, we propose a hybrid method using task-specific branches within each decoder block, ultimately fusing them into a shared representation at the block interfaces. Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets demonstrate that Multiformer achieves state-of-the-art performance across all DVPS metrics, outperforming previous methods by substantial margins. With a ResNet-50 backbone, Multiformer surpasses the previous best result by 3.0 DVPQ points while also improving depth estimation accuracy. Using a Swin-B backbone, Multiformer further improves performance by 4.0 DVPQ points. Multiformer also provides valuable insights into the design of multi-task decoder architectures.

Title: Distributed Gradient Descent with Many Local Steps in Overparameterized Models

Authors: Heng Zhu, Harsh Vardhan, Arya Mazumdar
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2412.07971
Pdf URL: https://arxiv.org/pdf/2412.07971
Copy Paste: [[2412.07971]] Distributed Gradient Descent with Many Local Steps in Overparameterized Models(https://arxiv.org/abs/2412.07971)
Keywords: federate, large language model
Abstract: In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.

Title: Phase-aware Training Schedule Simplifies Learning in Flow-Based Generative Models

Authors: Santiago Aranguri, Francesco Insulla
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.07972
Pdf URL: https://arxiv.org/pdf/2412.07972
Copy Paste: [[2412.07972]] Phase-aware Training Schedule Simplifies Learning in Flow-Based Generative Models(https://arxiv.org/abs/2412.07972)
Keywords: generative
Abstract: We analyze the training of a two-layer autoencoder used to parameterize a flow-based generative model for sampling from a high-dimensional Gaussian mixture. Previous work shows that the phase where the relative probability between the modes is learned disappears as the dimension goes to infinity without an appropriate time schedule. We introduce a time dilation that solves this problem. This enables us to characterize the learned velocity field, finding a first phase where the probability of each mode is learned and a second phase where the variance of each mode is learned. We find that the autoencoder representing the velocity field learns to simplify by estimating only the parameters relevant to each phase. Turning to real data, we propose a method that, for a given feature, finds intervals of time where training improves accuracy the most on that feature. Since practitioners take a uniform distribution over training times, our method enables more efficient training. We provide preliminary experiments validating this approach.

Title: AmCLR: Unified Augmented Learning for Cross-Modal Representations

Authors: Ajay Jagannath, Aayush Upadhyay, Anant Mehta
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.07979
Pdf URL: https://arxiv.org/pdf/2412.07979
Copy Paste: [[2412.07979]] AmCLR: Unified Augmented Learning for Cross-Modal Representations(https://arxiv.org/abs/2412.07979)
Keywords: robust
Abstract: Contrastive learning has emerged as a pivotal framework for representation learning, underpinning advances in both unimodal and bimodal applications like SimCLR and CLIP. To address fundamental limitations like large batch size dependency and bimodality, methods such as SogCLR leverage stochastic optimization for the global contrastive objective. Inspired by SogCLR's efficiency and adaptability, we introduce AmCLR and xAmCLR objective functions tailored for bimodal vision-language models to further enhance the robustness of contrastive learning. AmCLR integrates diverse augmentations, including text paraphrasing and image transformations, to reinforce the alignment of contrastive representations, keeping batch size limited to a few hundred samples unlike CLIP which needs batch size of 32,768 to produce reasonable results. xAmCLR further extends this paradigm by incorporating intra-modal alignments between original and augmented modalities for richer feature learning. These advancements yield a more resilient and generalizable contrastive learning process, aimed at overcoming bottlenecks in scaling and augmentative diversity. Since we have built our framework on the existing SogCLR, we are able to demonstrate improved representation quality with fewer computational resources, establishing a foundation for scalable and robust multi-modal learning.

Title: TTVD: Towards a Geometric Framework for Test-Time Adaptation Based on Voronoi Diagram

Authors: Mingxi Lei, Chunwei Ma, Meng Ding, Yufan Zhou, Ziyun Huang, Jinhui Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.07980
Pdf URL: https://arxiv.org/pdf/2412.07980
Copy Paste: [[2412.07980]] TTVD: Towards a Geometric Framework for Test-Time Adaptation Based on Voronoi Diagram(https://arxiv.org/abs/2412.07980)
Keywords: robust
Abstract: Deep learning models often struggle with generalization when deploying on real-world data, due to the common distributional shift to the training data. Test-time adaptation (TTA) is an emerging scheme used at inference time to address this issue. In TTA, models are adapted online at the same time when making predictions to test data. Neighbor-based approaches have gained attention recently, where prototype embeddings provide location information to alleviate the feature shift between training and testing data. However, due to their inherit limitation of simplicity, they often struggle to learn useful patterns and encounter performance degradation. To confront this challenge, we study the TTA problem from a geometric point of view. We first reveal that the underlying structure of neighbor-based methods aligns with the Voronoi Diagram, a classical computational geometry model for space partitioning. Building on this observation, we propose the Test-Time adjustment by Voronoi Diagram guidance (TTVD), a novel framework that leverages the benefits of this geometric property. Specifically, we explore two key structures: 1) Cluster-induced Voronoi Diagram (CIVD): This integrates the joint contribution of self-supervision and entropy-based methods to provide richer information. 2) Power Diagram (PD): A generalized version of the Voronoi Diagram that refines partitions by assigning weights to each Voronoi cell. Our experiments under rigid, peer-reviewed settings on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and ImageNet-R shows that TTVD achieves remarkable improvements compared to state-of-the-art methods. Moreover, extensive experimental results also explore the effects of batch size and class imbalance, which are two scenarios commonly encountered in real-world applications. These analyses further validate the robustness and adaptability of our proposed framework.

Title: Diffusion-Based Attention Warping for Consistent 3D Scene Editing

Authors: Eyal Gomel, Lior Wolf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.07984
Pdf URL: https://arxiv.org/pdf/2412.07984
Copy Paste: [[2412.07984]] Diffusion-Based Attention Warping for Consistent 3D Scene Editing(https://arxiv.org/abs/2412.07984)
Keywords: diffusion
Abstract: We present a novel method for 3D scene editing using diffusion models, designed to ensure view consistency and realism across perspectives. Our approach leverages attention features extracted from a single reference image to define the intended edits. These features are warped across multiple views by aligning them with scene geometry derived from Gaussian splatting depth estimates. Injecting these warped features into other viewpoints enables coherent propagation of edits, achieving high fidelity and spatial alignment in 3D space. Extensive evaluations demonstrate the effectiveness of our method in generating versatile edits of 3D scenes, significantly advancing the capabilities of scene manipulation compared to the existing methods. Project page: \url{this https URL}

Title: Concept Bottleneck Large Language Models

Authors: Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.07992
Pdf URL: https://arxiv.org/pdf/2412.07992
Copy Paste: [[2412.07992]] Concept Bottleneck Large Language Models(https://arxiv.org/abs/2412.07992)
Keywords: interpretability, large language model
Abstract: We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. We investigate two essential tasks in the NLP domain: text classification and text generation. In text classification, CB-LLM narrows the performance gap with traditional black-box models and provides clear interpretability. In text generation, we show how interpretable neurons in CB-LLM can be used for concept detection and steering text generation. Our CB-LLMs enable greater interaction between humans and LLMs across a variety of tasks -- a feature notably absent in existing LLMs. Our code is available at this https URL.

Title: Enhancing Remote Adversarial Patch Attacks on Face Detectors with Tiling and Scaling

Authors: Masora Okano, Koichi Ito, Masakatsu Nishigaki, Tetsushi Ohki
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.07996
Pdf URL: https://arxiv.org/pdf/2412.07996
Copy Paste: [[2412.07996]] Enhancing Remote Adversarial Patch Attacks on Face Detectors with Tiling and Scaling(https://arxiv.org/abs/2412.07996)
Keywords: attack, extraction
Abstract: This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RAP that targets general object detectors, but the former has multiple issues in the attack process the latter does not. (1) It is possible to detect objects of various scales. In particular, the area of small objects that are convolved during feature extraction by CNN is small,so the area that affects the inference results is also small. (2) It is a two-class classification, so there is a large gap in characteristics between the classes. This makes it difficult to attack the inference results by directing them to a different class. In this paper, we propose a new patch placement method and loss function for each problem. The patches targeting the proposed face detector showed superior detection obstruct effects compared to the patches targeting the general object detector.

Title: Accurate Prediction of Temperature Indicators in Eastern China Using a Multi-Scale CNN-LSTM-Attention model

Authors: Jiajiang Shen, Weiyan Wu, Qianyu Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.07997
Pdf URL: https://arxiv.org/pdf/2412.07997
Copy Paste: [[2412.07997]] Accurate Prediction of Temperature Indicators in Eastern China Using a Multi-Scale CNN-LSTM-Attention model(https://arxiv.org/abs/2412.07997)
Keywords: extraction
Abstract: In recent years, the importance of accurate weather forecasting has become increasingly prominent due to the impacts of global climate change and the rapid development of data science. Traditional forecasting methods often struggle to handle the complexity and nonlinearity inherent in climate data. To address these challenges, we propose a weather prediction model based on a multi-scale convolutional CNN-LSTM-Attention architecture, specifically designed for time series forecasting of temperature data in China. The model integrates Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, and attention mechanisms to leverage the strengths of spatial feature extraction, temporal sequence modeling, and the ability to focus on important features. The development process of the model includes data collection, preprocessing, feature extraction, and model building. Experimental results show that the model performs excellently in predicting temperature trends with high accuracy. The final computed results indicate that the Mean Squared Error (MSE) is 1.978295 and the Root Mean Squared Error (RMSE) is 0.8106562. This work marks a significant advancement in applying deep learning techniques to meteorological data, offering a valuable tool for improving weather forecasting accuracy and providing essential support for decision-making in areas such as urban planning, agriculture, and energy management.

Title: MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents

Authors: Yun Xing, Nhat Chung, Jie Zhang, Yue Cao, Ivor Tsang, Yang Liu, Lei Ma, Qing Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08014
Pdf URL: https://arxiv.org/pdf/2412.08014
Copy Paste: [[2412.08014]] MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents(https://arxiv.org/abs/2412.08014)
Keywords: attack, generative
Abstract: Physical adversarial attacks in driving scenarios can expose critical vulnerabilities in visual perception models. However, developing such attacks remains challenging due to diverse real-world backgrounds and the requirement for maintaining visual naturality. Building upon this challenge, we reformulate physical adversarial attacks as a one-shot patch-generation problem. Our approach generates adversarial patches through a deep generative model that considers the specific scene context, enabling direct physical deployment in matching environments. The primary challenge lies in simultaneously achieving two objectives: generating adversarial patches that effectively mislead object detection systems while determining contextually appropriate placement within the scene. We propose MAGIC (Mastering Physical Adversarial Generation In Context), a novel framework powered by multi-modal LLM agents to address these challenges. MAGIC automatically understands scene context and orchestrates adversarial patch generation through the synergistic interaction of language and vision capabilities. MAGIC orchestrates three specialized LLM agents: The adv-patch generation agent (GAgent) masters the creation of deceptive patches through strategic prompt engineering for text-to-image models. The adv-patch deployment agent (DAgent) ensures contextual coherence by determining optimal placement strategies based on scene understanding. The self-examination agent (EAgent) completes this trilogy by providing critical oversight and iterative refinement of both processes. We validate our method on both digital and physical level, \ie, nuImage and manually captured real scenes, where both statistical and visual results prove that our MAGIC is powerful and effectively for attacking wide-used object detection systems.

Title: GLL: A Differentiable Graph Learning Layer for Neural Networks

Authors: Jason Brown, Bohan Chen, Harris Hardiman-Mostow, Jeff Calder, Andrea L. Bertozzi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08016
Pdf URL: https://arxiv.org/pdf/2412.08016
Copy Paste: [[2412.08016]] GLL: A Differentiable Graph Learning Layer for Neural Networks(https://arxiv.org/abs/2412.08016)
Keywords: attack, robust
Abstract: Standard deep learning architectures used for classification generate label predictions with a projection head and softmax activation function. Although successful, these methods fail to leverage the relational information between samples in the batch for generating label predictions. In recent works, graph-based learning techniques, namely Laplace learning, have been heuristically combined with neural networks for both supervised and semi-supervised learning (SSL) tasks. However, prior works approximate the gradient of the loss function with respect to the graph learning algorithm or decouple the processes; end-to-end integration with neural networks is not achieved. In this work, we derive backpropagation equations, via the adjoint method, for inclusion of a general family of graph learning layers into a neural network. This allows us to precisely integrate graph Laplacian-based label propagation into a neural network layer, replacing a projection head and softmax activation function for classification tasks. Using this new framework, our experimental results demonstrate smooth label transitions across data, improved robustness to adversarial attacks, improved generalization, and improved training dynamics compared to the standard softmax-based approach.

Title: TinyThinker: Distilling Reasoning through Coarse-to-Fine Knowledge Internalization with Self-Reflection

Authors: Shengmin Piao, Sanghyun Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08024
Pdf URL: https://arxiv.org/pdf/2412.08024
Copy Paste: [[2412.08024]] TinyThinker: Distilling Reasoning through Coarse-to-Fine Knowledge Internalization with Self-Reflection(https://arxiv.org/abs/2412.08024)
Keywords: large language model
Abstract: Large Language Models exhibit impressive reasoning capabilities across diverse tasks, motivating efforts to distill these capabilities into smaller models through generated reasoning data. However, direct training on such synthesized reasoning data may lead to superficial imitation of reasoning process, rather than fostering a genuine integration of reasoning capabilities with underlying knowledge. To address this, we propose TinyThinker, a framework introducing two novel approaches. First, we introduce a three-stage process that incrementally guides the student model through the reasoning process, progressively refining knowledge from coarse to fine granularity. Second, we develop a two-phase training framework comprising an initial reasoning acquisition phase followed by a self-reflection phase utilizing self-generated data. Experiments on commonsense reasoning benchmarks demonstrate that TinyThinker achieves superior performance compared to baselines. Ablation studies further validate the effectiveness of each component in our framework. TinyThinker is extendable to other knowledge-intensive reasoning tasks, offering an alternative strategy for developing effective reasoning capabilities in smaller language models. Codes are available at this https URL

Title: Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation

Authors: Zhigang Cen, Ningyan Guo, Wenjing Xu, Zhiyong Feng, Danlan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08034
Pdf URL: https://arxiv.org/pdf/2412.08034
Copy Paste: [[2412.08034]] Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation(https://arxiv.org/abs/2412.08034)
Keywords: segmentation
Abstract: Video semantic segmentation(VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel-level static-dynamic contexts matching, utilizing techniques such as optical flow and attention mechanisms. Instead, this paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency (SD-CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static-dynamic semantic alignment module. The former provides class-level constraints for the model, obtaining personalized inter-class features and diversified intra-class features. The latter first establishes intra-frame spatial multi-scale and multi-level correlations to achieve static semantic alignment. Then, based on cross-frame static perceptual differences, it performs two-stage cross-frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a window-based attention map calculation method that leverages the sparsity of attention points during cross-frame aggregation to reduce computation cost. Extensive experiments on VSPW and Cityscapes datasets show that the proposed approach outperforms state-of-the-art methods. Our implementation will be open-sourced on GitHub.

Title: Bootstrapping Heterogeneous Graph Representation Learning via Large Language Models: A Generalized Approach

Authors: Hang Gao, Chenhao Zhang, Fengge Wu, Junsuo Zhao, Changwen Zheng, Huaping Liu
Subjects: cs.LG, cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2412.08038
Pdf URL: https://arxiv.org/pdf/2412.08038
Copy Paste: [[2412.08038]] Bootstrapping Heterogeneous Graph Representation Learning via Large Language Models: A Generalized Approach(https://arxiv.org/abs/2412.08038)
Keywords: large language model
Abstract: Graph representation learning methods are highly effective in handling complex non-Euclidean data by capturing intricate relationships and features within graph structures. However, traditional methods face challenges when dealing with heterogeneous graphs that contain various types of nodes and edges due to the diverse sources and complex nature of the data. Existing Heterogeneous Graph Neural Networks (HGNNs) have shown promising results but require prior knowledge of node and edge types and unified node feature formats, which limits their applicability. Recent advancements in graph representation learning using Large Language Models (LLMs) offer new solutions by integrating LLMs' data processing capabilities, enabling the alignment of various graph representations. Nevertheless, these methods often overlook heterogeneous graph data and require extensive preprocessing. To address these limitations, we propose a novel method that leverages the strengths of both LLM and GNN, allowing for the processing of graph data with any format and type of nodes and edges without the need for type information or special preprocessing. Our method employs LLM to automatically summarize and classify different data formats and types, aligns node features, and uses a specialized GNN for targeted learning, thus obtaining effective graph representations for downstream tasks. Theoretical analysis and experimental validation have demonstrated the effectiveness of our method.

Title: Surveying Facial Recognition Models for Diverse Indian Demographics: A Comparative Analysis on LFW and Custom Dataset

Authors: Pranav Pant, Niharika Dadu, Harsh V. Singh, Anshul Thakur
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08048
Pdf URL: https://arxiv.org/pdf/2412.08048
Copy Paste: [[2412.08048]] Surveying Facial Recognition Models for Diverse Indian Demographics: A Comparative Analysis on LFW and Custom Dataset(https://arxiv.org/abs/2412.08048)
Keywords: segmentation
Abstract: Facial recognition technology has made significant advances, yet its effectiveness across diverse ethnic backgrounds, particularly in specific Indian demographics, is less explored. This paper presents a detailed evaluation of both traditional and deep learning-based facial recognition models using the established LFW dataset and our newly developed IITJ Faces of Academia Dataset (JFAD), which comprises images of students from IIT Jodhpur. This unique dataset is designed to reflect the ethnic diversity of India, providing a critical test bed for assessing model performance in a focused academic environment. We analyze models ranging from holistic approaches like Eigenfaces and SIFT to advanced hybrid models that integrate CNNs with Gabor filters, Laplacian transforms, and segmentation techniques. Our findings reveal significant insights into the models' ability to adapt to the ethnic variability within Indian demographics and suggest modifications to enhance accuracy and inclusivity in real-world applications. The JFAD not only serves as a valuable resource for further research but also highlights the need for developing facial recognition systems that perform equitably across diverse populations.

Title: M2SE: A Multistage Multitask Instruction Tuning Strategy for Unified Sentiment and Emotion Analysis

Authors: Ao Li, Longwei Xu, Chen Ling, Jinghui Zhang, Pengwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08049
Pdf URL: https://arxiv.org/pdf/2412.08049
Copy Paste: [[2412.08049]] M2SE: A Multistage Multitask Instruction Tuning Strategy for Unified Sentiment and Emotion Analysis(https://arxiv.org/abs/2412.08049)
Keywords: extraction, large language model
Abstract: Sentiment analysis and emotion recognition are crucial for applications such as human-computer interaction and depression detection. Traditional unimodal methods often fail to capture the complexity of emotional expressions due to conflicting signals from different modalities. Current Multimodal Large Language Models (MLLMs) also face challenges in detecting subtle facial expressions and addressing a wide range of emotion-related tasks. To tackle these issues, we propose M2SE, a Multistage Multitask Sentiment and Emotion Instruction Tuning Strategy for general-purpose MLLMs. It employs a combined approach to train models on tasks such as multimodal sentiment analysis, emotion recognition, facial expression recognition, emotion reason inference, and emotion cause-pair extraction. We also introduce the Emotion Multitask dataset (EMT), a custom dataset that supports these five tasks. Our model, Emotion Universe (EmoVerse), is built on a basic MLLM framework without modifications, yet it achieves substantial improvements across these tasks when trained with the M2SE strategy. Extensive experiments demonstrate that EmoVerse outperforms existing methods, achieving state-of-the-art results in sentiment and emotion tasks. These results highlight the effectiveness of M2SE in enhancing multimodal emotion perception. The dataset and code are available at this https URL.

Title: CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation

Authors: Aishwarya Mandyam, Shengpu Tang, Jiayu Yao, Jenna Wiens, Barbara E. Engelhardt
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08052
Pdf URL: https://arxiv.org/pdf/2412.08052
Copy Paste: [[2412.08052]] CANDOR: Counterfactual ANnotated DOubly Robust Off-Policy Evaluation(https://arxiv.org/abs/2412.08052)
Keywords: robust
Abstract: Off-policy evaluation (OPE) provides safety guarantees by estimating the performance of a policy before deployment. Recent work introduced IS+, an importance sampling (IS) estimator that uses expert-annotated counterfactual samples to improve behavior dataset coverage. However, IS estimators are known to have high variance; furthermore, the performance of IS+ deteriorates when annotations are imperfect. In this work, we propose a family of OPE estimators inspired by the doubly robust (DR) principle. A DR estimator combines IS with a reward model estimate, known as the direct method (DM), and offers favorable statistical guarantees. We propose three strategies for incorporating counterfactual annotations into a DR-inspired estimator and analyze their properties under various realistic settings. We prove that using imperfect annotations in the DM part of the estimator best leverages the annotations, as opposed to using them in the IS part. To support our theoretical findings, we evaluate the proposed estimators in three contextual bandit environments. Our empirical results show that when the reward model is misspecified and the annotations are imperfect, it is most beneficial to use the annotations only in the DM portion of a DR estimator. Based on these theoretical and empirical insights, we provide a practical guide for using counterfactual annotations in different realistic settings.

Title: DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time

Authors: Jin Hu, Xianglong Liu, Jiakai Wang, Junkai Zhang, Xianqi Yang, Haotong Qin, Yuqing Ma, Ke Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08053
Pdf URL: https://arxiv.org/pdf/2412.08053
Copy Paste: [[2412.08053]] DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time(https://arxiv.org/abs/2412.08053)
Keywords: attack, robust, steal, generative
Abstract: Physical adversarial examples (PAEs) are regarded as "whistle-blowers" of real-world risks in deep-learning applications. However, current PAE generation studies show limited adaptive attacking ability to diverse and varying scenes. The key challenges in generating dynamic PAEs are exploring their patterns under noisy gradient feedback and adapting the attack to agnostic scenario natures. To address the problems, we present DynamicPAE, the first generative framework that enables scene-aware real-time physical attacks beyond static attacks. Specifically, to train the dynamic PAE generator under noisy gradient feedback, we introduce the residual-driven sample trajectory guidance technique, which redefines the training task to break the limited feedback information restriction that leads to the degeneracy problem. Intuitively, it allows the gradient feedback to be passed to the generator through a low-noise auxiliary task, thereby guiding the optimization away from degenerate solutions and facilitating a more comprehensive and stable exploration of feasible PAEs. To adapt the generator to agnostic scenario natures, we introduce the context-aligned scene expectation simulation process, consisting of the conditional-uncertainty-aligned data module and the skewness-aligned objective re-weighting module. The former enhances robustness in the context of incomplete observation by employing a conditional probabilistic model for domain randomization, while the latter facilitates consistent stealth control across different attack targets by automatically reweighting losses based on the skewness indicator. Extensive digital and physical evaluations demonstrate the superior attack performance of DynamicPAE, attaining a 1.95 $\times$ boost (65.55% average AP drop under attack) on representative object detectors (e.g., Yolo-v8) over state-of-the-art static PAE generating methods.

Title: Federated In-Context LLM Agent Learning

Authors: Panlong Wu, Kangshuo Li, Junbao Nan, Fangxin Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.08054
Pdf URL: https://arxiv.org/pdf/2412.08054
Copy Paste: [[2412.08054]] Federated In-Context LLM Agent Learning(https://arxiv.org/abs/2412.08054)
Keywords: privacy, federate, large language model
Abstract: Large Language Models (LLMs) have revolutionized intelligent services by enabling logical reasoning, tool use, and interaction with external systems as agents. The advancement of LLMs is frequently hindered by the scarcity of high-quality data, much of which is inherently sensitive. Federated learning (FL) offers a potential solution by facilitating the collaborative training of distributed LLMs while safeguarding private data. However, FL frameworks face significant bandwidth and computational demands, along with challenges from heterogeneous data distributions. The emerging in-context learning capability of LLMs offers a promising approach by aggregating natural language rather than bulky model parameters. Yet, this method risks privacy leakage, as it necessitates the collection and presentation of data samples from various clients during aggregation. In this paper, we propose a novel privacy-preserving Federated In-Context LLM Agent Learning (FICAL) algorithm, which to our best knowledge for the first work unleashes the power of in-context learning to train diverse LLM agents through FL. In our design, knowledge compendiums generated by a novel LLM-enhanced Knowledge Compendiums Generation (KCG) module are transmitted between clients and the server instead of model parameters in previous FL methods. Apart from that, an incredible Retrieval Augmented Generation (RAG) based Tool Learning and Utilizing (TLU) module is designed and we incorporate the aggregated global knowledge compendium as a teacher to teach LLM agents the usage of tools. We conducted extensive experiments and the results show that FICAL has competitive performance compared to other SOTA baselines with a significant communication cost decrease of $\mathbf{3.33\times10^5}$ times.

Title: Cluster-Enhanced Federated Graph Neural Network for Recommendation

Authors: Haiyan Wang, Ye Yuan
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2412.08066
Pdf URL: https://arxiv.org/pdf/2412.08066
Copy Paste: [[2412.08066]] Cluster-Enhanced Federated Graph Neural Network for Recommendation(https://arxiv.org/abs/2412.08066)
Keywords: security, privacy, federate
Abstract: Personal interaction data can be effectively modeled as individual graphs for each user in recommender this http URL Neural Networks (GNNs)-based recommendation techniques have become extremely popular since they can capture high-order collaborative signals between users and items by aggregating the individual graph into a global interactive this http URL, this centralized approach inherently poses a threat to user privacy and security. Recently, federated GNN-based recommendation techniques have emerged as a promising solution to mitigate privacy concerns. Nevertheless, current implementations either limit on-device training to an unaccompanied individual graphs or necessitate reliance on an extra third-party server to touch other individual graphs, which also increases the risk of privacy leakage. To address this challenge, we propose a Cluster-enhanced Federated Graph Neural Network framework for Recommendation, named CFedGR, which introduces high-order collaborative signals to augment individual graphs in a privacy preserving manner. Specifically, the server clusters the pretrained user representations to identify high-order collaborative signals. In addition, two efficient strategies are devised to reduce communication between devices and the server. Extensive experiments on three benchmark datasets validate the effectiveness of our proposed methods.

Title: EM-Net: Gaze Estimation with Expectation Maximization Algorithm

Authors: Zhang Cheng, Yanxia Wang, Guoyu Xia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08074
Pdf URL: https://arxiv.org/pdf/2412.08074
Copy Paste: [[2412.08074]] EM-Net: Gaze Estimation with Expectation Maximization Algorithm(https://arxiv.org/abs/2412.08074)
Keywords: robust
Abstract: In recent years, the accuracy of gaze estimation techniques has gradually improved, but existing methods often rely on large datasets or large models to improve performance, which leads to high demands on computational resources. In terms of this issue, this paper proposes a lightweight gaze estimation model EM-Net based on deep learning and traditional machine learning algorithms Expectation Maximization algorithm. First, the proposed Global Attention Mechanism(GAM) is added to extract features related to gaze estimation to improve the model's ability to capture global dependencies and thus improve its performance. Second, by learning hierarchical feature representations through the EM module, the model has strong generalization ability, which reduces the need for sample size. Experiments have confirmed that, on the premise of using only 50% of the training data, EM-Net improves the performance of Gaze360, MPIIFaceGaze, and RT-Gene datasets by 2.2%, 2.02%, and 2.03%, respectively, compared with GazeNAS-ETH. It also shows good robustness in the face of Gaussian noise interference.

Title: Statistical Downscaling via High-Dimensional Distribution Matching with Generative Models

Authors: Zhong Yi Wan, Ignacio Lopez-Gomez, Robert Carver, Tapio Schneider, John Anderson, Fei Sha, Leonardo Zepeda-Núñez
Subjects: cs.LG, math.NA, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2412.08079
Pdf URL: https://arxiv.org/pdf/2412.08079
Copy Paste: [[2412.08079]] Statistical Downscaling via High-Dimensional Distribution Matching with Generative Models(https://arxiv.org/abs/2412.08079)
Keywords: generative
Abstract: Statistical downscaling is a technique used in climate modeling to increase the resolution of climate simulations. High-resolution climate information is essential for various high-impact applications, including natural hazard risk assessment. However, simulating climate at high resolution is intractable. Thus, climate simulations are often conducted at a coarse scale and then downscaled to the desired resolution. Existing downscaling techniques are either simulation-based methods with high computational costs, or statistical approaches with limitations in accuracy or application specificity. We introduce Generative Bias Correction and Super-Resolution (GenBCSR), a two-stage probabilistic framework for statistical downscaling that overcomes the limitations of previous methods. GenBCSR employs two transformations to match high-dimensional distributions at different resolutions: (i) the first stage, bias correction, aligns the distributions at coarse scale, (ii) the second stage, statistical super-resolution, lifts the corrected coarse distribution by introducing fine-grained details. Each stage is instantiated by a state-of-the-art generative model, resulting in an efficient and effective computational pipeline for the well-studied distribution matching problem. By framing the downscaling problem as distribution matching, GenBCSR relaxes the constraints of supervised learning, which requires samples to be aligned. Despite not requiring such correspondence, we show that GenBCSR surpasses standard approaches in predictive accuracy of critical impact variables, particularly in predicting the tails (99% percentile) of composite indexes composed of interacting variables, achieving up to 4-5 folds of error reduction.

Title: How to select slices for annotation to train best-performing deep learning segmentation models for cross-sectional medical images?

Authors: Yixin Zhang, Kevin Kramer, Maciej A. Mazurowski
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08081
Pdf URL: https://arxiv.org/pdf/2412.08081
Copy Paste: [[2412.08081]] How to select slices for annotation to train best-performing deep learning segmentation models for cross-sectional medical images?(https://arxiv.org/abs/2412.08081)
Keywords: segmentation
Abstract: Automated segmentation of medical images highly depends on the availability of accurate manual image annotations. Such annotations are very time-consuming and costly to generate, and often require specialized expertise, particularly for cross-sectional images which contain many slices for each patient. It is crucial to ensure the best use of annotation resources. In this paper, we systematically answer the question of how to select slices of cross-sectional medical images in order to maximize performance of the resulting deep learning segmentation models. We conducted experiments on 4 medical imaging segmentation tasks with varying annotation budgets, numbers of annotated cases, numbers of annotated slices per volume, slice selection techniques, and mask interpolations. We found that: 1) It is almost always preferable to annotate fewer slices per volume and more volumes given an annotation budget. 2) Selecting slices for annotation by unsupervised active learning (UAL) is not superior to selecting slices randomly or at fixed intervals, provided that each volume is allocated the same number of annotated slices. 3) Interpolating masks between annotated slices rarely enhances model performance, with exceptions of some specific configuration for 3D models.

Title: FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention

Authors: Zhongyi Zhang, Jie Zhang, Wenbo Zhou, Xinghui Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08082
Pdf URL: https://arxiv.org/pdf/2412.08082
Copy Paste: [[2412.08082]] FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention(https://arxiv.org/abs/2412.08082)
Keywords: attack, robust, watermark
Abstract: Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer's effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.

Title: A Systematic Literature Review on the NIS2 Directive

Authors: Jukka Ruohonen
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2412.08084
Pdf URL: https://arxiv.org/pdf/2412.08084
Copy Paste: [[2412.08084]] A Systematic Literature Review on the NIS2 Directive(https://arxiv.org/abs/2412.08084)
Keywords: security
Abstract: A directive known as NIS2 was enacted in the European Union (EU) in late 2022. It deals particularly with European critical infrastructures, enlarging their scope substantially from an older directive that only considered the energy and transport sectors as critical. The directive's focus is on cyber security of critical infrastructures, although together with other new EU laws it expands to other security domains as well. Given the importance of the directive and most of all the importance of critical infrastructures, the paper presents a systematic literature review on academic research addressing the NIS2 directive either explicitly or implicitly. According to the review, existing research has often framed and discussed the directive with the EU's other cyber security laws. In addition, existing research has often operated in numerous contextual areas, including industrial control systems, telecommunications, the energy and water sectors, and infrastructures for information sharing and situational awareness. Despite the large scope of existing research, the review reveals noteworthy research gaps and worthwhile topics to examine in further research.

Title: Multilingual LLMs Inherently Reward In-Language Time-Sensitive Semantic Alignment for Low-Resource Languages

Authors: Ashutosh Bajpai, Tanmoy Chakraborty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08090
Pdf URL: https://arxiv.org/pdf/2412.08090
Copy Paste: [[2412.08090]] Multilingual LLMs Inherently Reward In-Language Time-Sensitive Semantic Alignment for Low-Resource Languages(https://arxiv.org/abs/2412.08090)
Keywords: transformer, large language model
Abstract: The unwavering disparity in labeled resources between resource-rich languages and those considered low-resource remains a significant impediment for Large Language Models (LLMs). Recent strides in cross-lingual in-context learning (X-ICL), mainly through semantically aligned examples retrieved from multilingual pre-trained transformers, have shown promise in mitigating this issue. However, our investigation reveals that LLMs intrinsically reward in-language semantically aligned cross-lingual instances over direct cross-lingual semantic alignments, with a pronounced disparity in handling time-sensitive queries in the X-ICL setup. Such queries demand sound temporal reasoning ability from LLMs, yet the advancements have predominantly focused on English. This study aims to bridge this gap by improving temporal reasoning capabilities in low-resource languages. To this end, we introduce mTEMPREASON a temporal reasoning dataset aimed at the varied degrees of low-resource languages and propose Cross-Lingual Time-Sensitive Semantic Alignment (CLiTSSA), a novel method to improve temporal reasoning in these contexts. To facilitate this, we construct an extension of mTEMPREASON comprising pairs of parallel cross-language temporal queries along with their anticipated in-language semantic similarity scores. Our empirical evidence underscores the superior performance of CLiTSSA compared to established baselines across three languages - Romanian, German, and French, encompassing three temporal tasks and including a diverse set of four contemporaneous LLMs. This marks a significant step forward in addressing resource disparity in the context of temporal reasoning across languages.

Title: Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting

Authors: Fuqiang Liu, Sicong Jiang, Luis Miranda-Moreno, Seongjin Choi, Lijun Sun
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.08099
Pdf URL: https://arxiv.org/pdf/2412.08099
Copy Paste: [[2412.08099]] Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting(https://arxiv.org/abs/2412.08099)
Keywords: defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) have recently demonstrated significant potential in the field of time series forecasting, offering impressive capabilities in handling complex temporal data. However, their robustness and reliability in real-world applications remain under-explored, particularly concerning their susceptibility to adversarial attacks. In this paper, we introduce a targeted adversarial attack framework for LLM-based time series forecasting. By employing both gradient-free and black-box optimization methods, we generate minimal yet highly effective perturbations that significantly degrade the forecasting accuracy across multiple datasets and LLM architectures. Our experiments, which include models like TimeGPT and LLM-Time with GPT-3.5, GPT-4, LLaMa, and Mistral, show that adversarial attacks lead to much more severe performance degradation than random noise, and demonstrate the broad effectiveness of our attacks across different LLMs. The results underscore the critical vulnerabilities of LLMs in time series forecasting, highlighting the need for robust defense mechanisms to ensure their reliable deployment in practical applications.

Title: Generative Zoo

Authors: Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, Peter Kulits
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08101
Pdf URL: https://arxiv.org/pdf/2412.08101
Copy Paste: [[2412.08101]] Generative Zoo(https://arxiv.org/abs/2412.08101)
Keywords: generative
Abstract: The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark, despite being trained solely on synthetic data. this https URL

Title: Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation

Authors: Hee-Seon Kim, Minbeom Kim, Changick Kim
Subjects: cs.CV, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.08108
Pdf URL: https://arxiv.org/pdf/2412.08108
Copy Paste: [[2412.08108]] Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation(https://arxiv.org/abs/2412.08108)
Keywords: attack, robust, large language model
Abstract: Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs). However, these models remain vulnerable to adversarial attacks. Among such attacks, Universal Adversarial Perturbations (UAPs) are especially powerful, as a single optimized perturbation can mislead the model across various input images. In this work, we introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP), capable of universally deceiving VLMs across both image and text inputs. To successfully disrupt the vision encoder's fundamental process, we analyze the core components of the attention mechanism. After identifying value vectors in the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a label-free manner with a frozen model. Despite being developed as a black-box to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently outperforming baseline methods across vision-language tasks. Extensive ablation studies and analyses further demonstrate the robustness of Doubly-UAP and provide insights into how it influences internal attention mechanisms.

Title: Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses

Authors: Jiayun Luo, Mir Rayat Imtiaz Hossain, Boyang Li, Leonid Sigal
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08110
Pdf URL: https://arxiv.org/pdf/2412.08110
Copy Paste: [[2412.08110]] Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses(https://arxiv.org/abs/2412.08110)
Keywords: segmentation
Abstract: Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering). However, most VLMs rely on coarse-grained image-caption pairs for alignment, relying on data volume to resolve ambiguities and ground linguistic concepts in images. The richer semantic and syntactic structure within text is largely overlooked. To address this, we propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision, by hierarchically decomposing captions into the constituent Subject, Noun Phrases, and Composite Phrases. Entailment between these constituent components allows us to formulate additional regularization constraints on the VLM attention maps. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Addition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed; we illustrate its efficacy on BLIP and ALBEF. HIST outperforms baseline VLMs, achieving up to +9.8% improvement in visual grounding, +6.3% in multi-object referring segmentation, +1.1% in image-text retrieval, and +0.2% in visual question answering, underscoring the value of structuring learning in VLMs.

Title: DAKD: Data Augmentation and Knowledge Distillation using Diffusion Models for SAR Oil Spill Segmentation

Authors: Jaeho Moon, Jeonghwan Yun, Jaehyun Kim, Jaehyup Lee, Munchurl Kim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08116
Pdf URL: https://arxiv.org/pdf/2412.08116
Copy Paste: [[2412.08116]] DAKD: Data Augmentation and Knowledge Distillation using Diffusion Models for SAR Oil Spill Segmentation(https://arxiv.org/abs/2412.08116)
Keywords: robust, diffusion, segmentation
Abstract: Oil spills in the ocean pose severe environmental risks, making early detection essential. Synthetic aperture radar (SAR) based oil spill segmentation offers robust monitoring under various conditions but faces challenges due to the limited labeled data and inherent speckle noise in SAR imagery. To address these issues, we propose (i) a diffusion-based Data Augmentation and Knowledge Distillation (DAKD) pipeline and (ii) a novel SAR oil spill segmentation network, called SAROSS-Net. In our DAKD pipeline, we present a diffusion-based SAR-JointNet that learns to generate realistic SAR images and their labels for segmentation, by effectively modeling joint distribution with balancing two modalities. The DAKD pipeline augments the training dataset and distills knowledge from SAR-JointNet by utilizing generated soft labels (pixel-wise probability maps) to supervise our SAROSS-Net. The SAROSS-Net is designed to selectively transfer high-frequency features from noisy SAR images, by employing novel Context-Aware Feature Transfer blocks along skip connections. We demonstrate our SAR-JointNet can generate realistic SAR images and well-aligned segmentation labels, providing the augmented data to train SAROSS-Net with enhanced generalizability. Our SAROSS-Net trained with the DAKD pipeline significantly outperforms existing SAR oil spill segmentation methods with large margins.

Title: DiffRaman: A Conditional Latent Denoising Diffusion Probabilistic Model for Bacterial Raman Spectroscopy Identification Under Limited Data Conditions

Authors: Haiming Yao, Wei Luo, Ang Gao, Tao Zhou, Xue Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08131
Pdf URL: https://arxiv.org/pdf/2412.08131
Copy Paste: [[2412.08131]] DiffRaman: A Conditional Latent Denoising Diffusion Probabilistic Model for Bacterial Raman Spectroscopy Identification Under Limited Data Conditions(https://arxiv.org/abs/2412.08131)
Keywords: diffusion, generative
Abstract: Raman spectroscopy has attracted significant attention in various biochemical detection fields, especially in the rapid identification of pathogenic bacteria. The integration of this technology with deep learning to facilitate automated bacterial Raman spectroscopy diagnosis has emerged as a key focus in recent research. However, the diagnostic performance of existing deep learning methods largely depends on a sufficient dataset, and in scenarios where there is a limited availability of Raman spectroscopy data, it is inadequate to fully optimize the numerous parameters of deep neural networks. To address these challenges, this paper proposes a data generation method utilizing deep generative models to expand the data volume and enhance the recognition accuracy of bacterial Raman spectra. Specifically, we introduce DiffRaman, a conditional latent denoising diffusion probability model for Raman spectra generation. Experimental results demonstrate that synthetic bacterial Raman spectra generated by DiffRaman can effectively emulate real experimental spectra, thereby enhancing the performance of diagnostic models, especially under conditions of limited data. Furthermore, compared to existing generative models, the proposed DiffRaman offers improvements in both generation quality and computational efficiency. Our DiffRaman approach offers a well-suited solution for automated bacteria Raman spectroscopy diagnosis in data-scarce scenarios, offering new insights into alleviating the labor of spectroscopic measurements and enhancing rare bacteria identification.

Title: Learn How to Query from Unlabeled Data Streams in Federated Learning

Authors: Yuchang Sun, Xinran Li, Tao Lin, Jun Zhang
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.08138
Pdf URL: https://arxiv.org/pdf/2412.08138
Copy Paste: [[2412.08138]] Learn How to Query from Unlabeled Data Streams in Federated Learning(https://arxiv.org/abs/2412.08138)
Keywords: privacy, federate
Abstract: Federated learning (FL) enables collaborative learning among decentralized clients while safeguarding the privacy of their local data. Existing studies on FL typically assume offline labeled data available at each client when the training starts. Nevertheless, the training data in practice often arrive at clients in a streaming fashion without ground-truth labels. Given the expensive annotation cost, it is critical to identify a subset of informative samples for labeling on clients. However, selecting samples locally while accommodating the global training objective presents a challenge unique to FL. In this work, we tackle this conundrum by framing the data querying process in FL as a collaborative decentralized decision-making problem and proposing an effective solution named LeaDQ, which leverages multi-agent reinforcement learning algorithms. In particular, under the implicit guidance from global information, LeaDQ effectively learns the local policies for distributed clients and steers them towards selecting samples that can enhance the global model's accuracy. Extensive simulations on image and text tasks show that LeaDQ advances the model performance in various FL scenarios, outperforming the benchmarking algorithms.

Title: A Survey on Private Transformer Inference

Authors: Yang Li, Xinyu Zhou, Yitong Wang, Liangxin Qian, Jun Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08145
Pdf URL: https://arxiv.org/pdf/2412.08145
Copy Paste: [[2412.08145]] A Survey on Private Transformer Inference(https://arxiv.org/abs/2412.08145)
Keywords: secure, privacy, transformer
Abstract: Transformer models have revolutionized AI, enabling applications like content generation and sentiment analysis. However, their use in Machine Learning as a Service (MLaaS) raises significant privacy concerns, as centralized servers process sensitive user data. Private Transformer Inference (PTI) addresses these issues using cryptographic techniques such as Secure Multi-Party Computation (MPC) and Homomorphic Encryption (HE), enabling secure model inference without exposing inputs or models. This paper reviews recent advancements in PTI, analyzing state-of-the-art solutions, their challenges, and potential improvements. We also propose evaluation guidelines to assess resource efficiency and privacy guarantees, aiming to bridge the gap between high-performance inference and data privacy.

Title: How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging

Authors: Hugo Monzón Maldonado, Thomas Möllenhoff, Nico Daheim, Iryna Gurevych, Mohammad Emtiyaz Khan
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08147
Pdf URL: https://arxiv.org/pdf/2412.08147
Copy Paste: [[2412.08147]] How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging(https://arxiv.org/abs/2412.08147)
Keywords: transformer
Abstract: When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.

Title: A Review of Intelligent Device Fault Diagnosis Technologies Based on Machine Vision

Authors: Guiran Liu, Binrong Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08148
Pdf URL: https://arxiv.org/pdf/2412.08148
Copy Paste: [[2412.08148]] A Review of Intelligent Device Fault Diagnosis Technologies Based on Machine Vision(https://arxiv.org/abs/2412.08148)
Keywords: robust, transformer
Abstract: This paper provides a comprehensive review of mechanical equipment fault diagnosis methods, focusing on the advancements brought by Transformer-based models. It details the structure, working principles, and benefits of Transformers, particularly their self-attention mechanism and parallel computation capabilities, which have propelled their widespread application in natural language processing and computer vision. The discussion highlights key Transformer model variants, such as Vision Transformers (ViT) and their extensions, which leverage self-attention to improve accuracy and efficiency in visual tasks. Furthermore, the paper examines the application of Transformer-based approaches in intelligent fault diagnosis for mechanical systems, showcasing their superior ability to extract and recognize patterns from complex sensor data for precise fault identification. Despite these advancements, challenges remain, including the reliance on extensive labeled datasets, significant computational demands, and difficulties in deploying models on resource-limited devices. To address these limitations, the paper proposes future research directions, such as developing lightweight Transformer architectures, integrating multimodal data sources, and enhancing adaptability to diverse operational conditions. These efforts aim to further expand the application of Transformer-based methods in mechanical fault diagnosis, making them more robust, efficient, and suitable for real-world industrial environments.

Title: AsyncDSB: Schedule-Asynchronous Diffusion Schr\"odinger Bridge for Image Inpainting

Authors: Zihao Han, Baoquan Zhang, Lisai Zhang, Shanshan Feng, Kenghong Lin, Guotao Liang, Yunming Ye, Xiaochen Qi, Guangming Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08149
Pdf URL: https://arxiv.org/pdf/2412.08149
Copy Paste: [[2412.08149]] AsyncDSB: Schedule-Asynchronous Diffusion Schr\"odinger Bridge for Image Inpainting(https://arxiv.org/abs/2412.08149)
Keywords: diffusion
Abstract: Image inpainting is an important image generation task, which aims to restore corrupted image from partial visible area. Recently, diffusion Schrödinger bridge methods effectively tackle this task by modeling the translation between corrupted and target images as a diffusion Schrödinger bridge process along a noising schedule path. Although these methods have shown superior performance, in this paper, we find that 1) existing methods suffer from a schedule-restoration mismatching issue, i.e., the theoretical schedule and practical restoration processes usually exist a large discrepancy, which theoretically results in the schedule not fully leveraged for restoring images; and 2) the key reason causing such issue is that the restoration process of all pixels are actually asynchronous but existing methods set a synchronous noise schedule to them, i.e., all pixels shares the same noise schedule. To this end, we propose a schedule-Asynchronous Diffusion Schrödinger Bridge (AsyncDSB) for image inpainting. Our insight is preferentially scheduling pixels with high frequency (i.e., large gradients) and then low frequency (i.e., small gradients). Based on this insight, given a corrupted image, we first train a network to predict its gradient map in corrupted area. Then, we regard the predicted image gradient as prior and design a simple yet effective pixel-asynchronous noise schedule strategy to enhance the diffusion Schrödinger bridge. Thanks to the asynchronous schedule at pixels, the temporal interdependence of restoration process between pixels can be fully characterized for high-quality image inpainting. Experiments on real-world datasets show that our AsyncDSB achieves superior performance, especially on FID with around 3% - 14% improvement over state-of-the-art baseline methods.

Title: Antelope: Potent and Concealed Jailbreak Attack Strategy

Authors: Xin Zhao, Xiaojun Chen, Haoyu Gao
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.08156
Pdf URL: https://arxiv.org/pdf/2412.08156
Copy Paste: [[2412.08156]] Antelope: Potent and Concealed Jailbreak Attack Strategy(https://arxiv.org/abs/2412.08156)
Keywords: security, attack, robust, diffusion, generative
Abstract: Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

Title: DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models

Authors: Haonan Yuan, Qingyun Sun, Zhaonan Wang, Xingcheng Fu, Cheng Ji, Yongjian Wang, Bo Jin, Jianxin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08160
Pdf URL: https://arxiv.org/pdf/2412.08160
Copy Paste: [[2412.08160]] DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models(https://arxiv.org/abs/2412.08160)
Keywords: attack, robust
Abstract: Dynamic graphs exhibit intertwined spatio-temporal evolutionary patterns, widely existing in the real world. Nevertheless, the structure incompleteness, noise, and redundancy result in poor robustness for Dynamic Graph Neural Networks (DGNNs). Dynamic Graph Structure Learning (DGSL) offers a promising way to optimize graph structures. However, aside from encountering unacceptable quadratic complexity, it overly relies on heuristic priors, making it hard to discover underlying predictive patterns. How to efficiently refine the dynamic structures, capture intrinsic dependencies, and learn robust representations, remains under-explored. In this work, we propose the novel DG-Mamba, a robust and efficient Dynamic Graph structure learning framework with the Selective State Space Models (Mamba). To accelerate the spatio-temporal structure learning, we propose a kernelized dynamic message-passing operator that reduces the quadratic time complexity to linear. To capture global intrinsic dynamics, we establish the dynamic graph as a self-contained system with State Space Model. By discretizing the system states with the cross-snapshot graph adjacency, we enable the long-distance dependencies capturing with the selective snapshot scan. To endow learned dynamic structures more expressive with informativeness, we propose the self-supervised Principle of Relevant Information for DGSL to regularize the most relevant yet least redundant information, enhancing global robustness. Extensive experiments demonstrate the superiority of the robustness and efficiency of our DG-Mamba compared with the state-of-the-art baselines against adversarial attacks.

Title: Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation

Authors: Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.08161
Pdf URL: https://arxiv.org/pdf/2412.08161
Copy Paste: [[2412.08161]] Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation(https://arxiv.org/abs/2412.08161)
Keywords: large language model, segmentation
Abstract: Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object-level details and ii) the timing of when objects start and stop producing sounds. Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. To address this issue, we propose a Collaborative Hybrid Propagator Framework~(Co-Prop). This framework includes two main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame Audio-Insert Propagation. To Anchor the audio boundary, we employ retrieval-assist prompts with Qwen large language models to identify control points of audio semantic changes. These control points split the audio into semantically consistent audio portions. After obtaining the control point lists, we propose the Audio Insertion Propagator to process each audio portion using a frame-by-frame audio insertion propagation and matching approach. We curated a compact dataset comprising diverse source conversion cases and devised a metric to assess alignment rates. Compared to traditional simultaneous processing methods, our approach reduces memory requirements and facilitates frame alignment. Experimental results demonstrate the effectiveness of our approach across three datasets and two backbones. Furthermore, our method can be integrated with existing AVVS approaches, offering plug-and-play functionality to enhance their performance.

Title: NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech Detection using Ensembling of BERT-based models

Authors: Anmol Guragain, Nadika Poudel, Rajesh Piryani, Bishesh Khanal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08163
Pdf URL: https://arxiv.org/pdf/2412.08163
Copy Paste: [[2412.08163]] NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech Detection using Ensembling of BERT-based models(https://arxiv.org/abs/2412.08163)
Keywords: transformer
Abstract: This paper explores hate speech detection in Devanagari-scripted languages, focusing on Hindi and Nepali, for Subtask B of the CHIPSAL@COLING 2025 Shared Task. Using a range of transformer-based models such as XLM-RoBERTa, MURIL, and IndicBERT, we examine their effectiveness in navigating the nuanced boundary between hate speech and free expression. Our best performing model, implemented as ensemble of multilingual BERT models achieve Recall of 0.7762 (Rank 3/31 in terms of recall) and F1 score of 0.6914 (Rank 17/31). To address class imbalance, we used backtranslation for data augmentation, and cosine similarity to preserve label consistency after augmentation. This work emphasizes the need for hate speech detection in Devanagari-scripted languages and presents a foundation for further research.

Title: Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software

Authors: Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, Yang Liu
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.08167
Pdf URL: https://arxiv.org/pdf/2412.08167
Copy Paste: [[2412.08167]] Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software(https://arxiv.org/abs/2412.08167)
Keywords: protect, fair
Abstract: Intersectional fairness is a critical requirement for Machine Learning (ML) software, demanding fairness across subgroups defined by multiple protected attributes. This paper introduces FairHOME, a novel ensemble approach using higher order mutation of inputs to enhance intersectional fairness of ML software during the inference phase. Inspired by social science theories highlighting the benefits of diversity, FairHOME generates mutants representing diverse subgroups for each input instance, thus broadening the array of perspectives to foster a fairer decision-making process. Unlike conventional ensemble methods that combine predictions made by different models, FairHOME combines predictions for the original input and its mutants, all generated by the same ML model, to reach a final decision. Notably, FairHOME is even applicable to deployed ML software as it bypasses the need for training new models. We extensively evaluate FairHOME against seven state-of-the-art fairness improvement methods across 24 decision-making tasks using widely adopted metrics. FairHOME consistently outperforms existing methods across all metrics considered. On average, it enhances intersectional fairness by 47.5%, surpassing the currently best-performing method by 9.6 percentage points.

Title: Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Authors: Mohammadmostafa Rostamkhani, Baktash Ansari, Hoorieh Sabzevari, Farzan Rahmani, Sauleh Eetemadi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08169
Pdf URL: https://arxiv.org/pdf/2412.08169
Copy Paste: [[2412.08169]] Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions(https://arxiv.org/abs/2412.08169)
Keywords: robust
Abstract: In recent years, Visual Question Answering (VQA) has made significant strides, particularly with the advent of multimodal models that integrate vision and language understanding. However, existing VQA datasets often overlook the complexities introduced by image illusions, which pose unique challenges for both human perception and model interpretation. In this study, we introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar. These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions. We assess the zero-shot performance of various models, fine-tune selected models on our datasets, and propose a simple yet effective solution for illusion detection using Gaussian and blur low-pass filters. We show that this method increases the performance of models significantly and in the case of BLIP-2 on IllusionAnimals without any fine-tuning, it outperforms humans. Our findings highlight the disparity between human and model perception of illusions and demonstrate that fine-tuning and specific preprocessing techniques can significantly enhance model robustness. This work contributes to the development of more human-like visual understanding in multimodal models and suggests future directions for adapting filters using learnable parameters.

Title: Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?

Authors: Zihao Li, Lecheng Zheng, Bowen Jin, Dongqi Fu, Baoyu Jing, Yikun Ban, Jingrui He, Jiawei Han
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2412.08174
Pdf URL: https://arxiv.org/pdf/2412.08174
Copy Paste: [[2412.08174]] Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?(https://arxiv.org/abs/2412.08174)
Keywords: large language model
Abstract: While great success has been achieved in building vision models with Contrastive Language-Image Pre-training (CLIP) over Internet-scale image-text pairs, building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues: the scarcity of labeled data and text supervision, different levels of downstream tasks, and the conceptual gaps between domains. In this work, to address these issues, we leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data, given only a few semantically labeled samples, each with extremely weak text supervision. Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously. To accomplish this, we improve state-of-the-art graph prompt method, and then propose the first graph-language multi-modal prompt learning approach for exploiting the knowledge in pre-trained models. Notably, due to the insufficient supervision for fine-tuning, in our paradigm, the pre-trained GNN and the LLM are kept frozen, so the learnable parameters are much fewer than fine-tuning any pre-trained model. Through extensive experiments on real-world datasets, we demonstrate the superior performance of our paradigm in few-shot, multi-task-level, and cross-domain settings. Moreover, we build the first CLIP-style zero-shot classification prototype that can generalize GNNs to unseen classes with extremely weak text supervision.

Title: Analyzing and Improving Model Collapse in Rectified Flow Models

Authors: Huminhao Zhu, Fangyikang Wang, Tianyu Ding, Qing Qu, Zhihui Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08175
Pdf URL: https://arxiv.org/pdf/2412.08175
Copy Paste: [[2412.08175]] Analyzing and Improving Model Collapse in Rectified Flow Models(https://arxiv.org/abs/2412.08175)
Keywords: generative
Abstract: Generative models aim to produce synthetic data indistinguishable from real distributions, but iterative training on self-generated data can lead to \emph{model collapse (MC)}, where performance degrades over time. In this work, we provide the first theoretical analysis of MC in Rectified Flow by framing it within the context of Denoising Autoencoders (DAEs). We show that when DAE models are trained on recursively generated synthetic data with small noise variance, they suffer from MC with progressive diminishing generation quality. To address this MC issue, we propose methods that strategically incorporate real data into the training process, even when direct noise-image pairs are unavailable. Our proposed techniques, including Reverse Collapse-Avoiding (RCA) Reflow and Online Collapse-Avoiding Reflow (OCAR), effectively prevent MC while maintaining the efficiency benefits of Rectified Flow. Extensive experiments on standard image datasets demonstrate that our methods not only mitigate MC but also improve sampling efficiency, leading to higher-quality image generation with fewer sampling steps.

Title: TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

Authors: Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.08176
Pdf URL: https://arxiv.org/pdf/2412.08176
Copy Paste: [[2412.08176]] TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning(https://arxiv.org/abs/2412.08176)
Keywords: large language model
Abstract: Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at this https URL

Title: SecureNT: A Practical Framework for Efficient Topology Protection and Monitoring

Authors: Chengze Du, Jibin Shi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08177
Pdf URL: https://arxiv.org/pdf/2412.08177
Copy Paste: [[2412.08177]] SecureNT: A Practical Framework for Efficient Topology Protection and Monitoring(https://arxiv.org/abs/2412.08177)
Keywords: secure, security, privacy, protect
Abstract: Network tomography plays a crucial role in network monitoring and management, where network topology serves as the fundamental basis for various tomography tasks including traffic matrix estimation and link performance inference. The topology information, however, can be inferred through end-to-end measurements using various inference algorithms, posing significant security risks to network infrastructure. While existing protection methods attempt to secure topology information by manipulating end-to-end delay measurements, they often require complex computation and sophisticated modification strategies, making real-time protection challenging. Moreover, these delay-based modifications typically render the measurements unusable for network monitoring, even by trusted users, as the manipulated delays distort the actual network performance characteristics. This paper presents a novel privacy-preserving framework that addresses these limitations. Our approach provides efficient topology protection while maintaining the utility of measurements for authorized network monitoring. Through extensive evaluation on both simulated and real-world networks topology, we demonstrate that our framework achieves superior privacy protection compared to existing methods while enabling trusted users to effectively monitor network performance. Our solution offers a practical approach for organizations to protect sensitive topology information without sacrificing their network monitoring capabilities.

Title: From communities to interpretable network and word embedding: an unified approach

Authors: Thibault Prouteau, Nicolas Dugué, Simon Guillot
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08187
Pdf URL: https://arxiv.org/pdf/2412.08187
Copy Paste: [[2412.08187]] From communities to interpretable network and word embedding: an unified approach(https://arxiv.org/abs/2412.08187)
Keywords: interpretability
Abstract: Modelling information from complex systems such as humans social interaction or words co-occurrences in our languages can help to understand how these systems are organized and function. Such systems can be modelled by networks, and network theory provides a useful set of methods to analyze them. Among these methods, graph embedding is a powerful tool to summarize the interactions and topology of a network in a vectorized feature space. When used in input of machine learning algorithms, embedding vectors help with common graph problems such as link prediction, graph matching, etc. Word embedding has the goal of representing the sense of words, extracting it from large text corpora. Despite differences in the structure of information in input of embedding algorithms, many graph embedding approaches are adapted and inspired from methods in NLP. Limits of these methods are observed in both domains. Most of these methods require long and resource greedy training. Another downside to most methods is that they are black-box, from which understanding how the information is structured is rather complex. Interpretability of a model allows understanding how the vector space is structured without the need for external information, and thus can be audited more easily. With both these limitations in mind, we propose a novel framework to efficiently embed network vertices in an interpretable vector space. Our Lower Dimension Bipartite Framework (LDBGF) leverages the bipartite projection of a network using cliques to reduce dimensionality. Along with LDBGF, we introduce two implementations of this framework that rely on communities instead of cliques: SINr-NR and SINr-MF. We show that SINr-MF can perform well on classical graphs and SINr-NR can produce high-quality graph and word embeddings that are interpretable and stable across runs.

Title: Mixture of Experts Meets Decoupled Message Passing: Towards General and Adaptive Node Classification

Authors: Xuanze Chen, Jiajun Zhou, Shanqing Yu, Qi Xuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08193
Pdf URL: https://arxiv.org/pdf/2412.08193
Copy Paste: [[2412.08193]] Mixture of Experts Meets Decoupled Message Passing: Towards General and Adaptive Node Classification(https://arxiv.org/abs/2412.08193)
Keywords: robust, transformer
Abstract: Graph neural networks excel at graph representation learning but struggle with heterophilous data and long-range dependencies. And graph transformers address these issues through self-attention, yet face scalability and noise challenges on large-scale graphs. To overcome these limitations, we propose GNNMoE, a universal model architecture for node classification. This architecture flexibly combines fine-grained message-passing operations with a mixture-of-experts mechanism to build feature encoding blocks. Furthermore, by incorporating soft and hard gating layers to assign the most suitable expert networks to each node, we enhance the model's expressive power and adaptability to different graph types. In addition, we introduce adaptive residual connections and an enhanced FFN module into GNNMoE, further improving the expressiveness of node representation. Extensive experimental results demonstrate that GNNMoE performs exceptionally well across various types of graph data, effectively alleviating the over-smoothing issue and global noise, enhancing model robustness and adaptability, while also ensuring computational efficiency on large-scale graphs.

Title: SAFIRE: Segment Any Forged Image Region

Authors: Myung-Joon Kwon, Wonjun Lee, Seung-Hun Nam, Minji Son, Changick Kim
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.08197
Pdf URL: https://arxiv.org/pdf/2412.08197
Copy Paste: [[2412.08197]] SAFIRE: Segment Any Forged Image Region(https://arxiv.org/abs/2412.08197)
Keywords: segmentation
Abstract: Most techniques approach the problem of image forgery localization as a binary segmentation task, training neural networks to label original areas as 0 and forged areas as 1. In contrast, we tackle this issue from a more fundamental perspective by partitioning images according to their originating sources. To this end, we propose Segment Any Forged Image Region (SAFIRE), which solves forgery localization using point prompting. Each point on an image is used to segment the source region containing itself. This allows us to partition images into multiple source regions, a capability achieved for the first time. Additionally, rather than memorizing certain forgery traces, SAFIRE naturally focuses on uniform characteristics within each source region. This approach leads to more stable and effective learning, achieving superior performance in both the new task and the traditional binary forgery localization.

Title: Adaptive$^2$: Adaptive Domain Mining for Fine-grained Domain Adaptation Modeling

Authors: Wenxuan Sun, Zixuan Yang, Yunli Wang, Zhen Zhang, Zhiqiang Wang, Yu Li, Jian Yang, Yiming Yang, Shiyang Wen, Peng Jiang, Kun Gai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08198
Pdf URL: https://arxiv.org/pdf/2412.08198
Copy Paste: [[2412.08198]] Adaptive$^2$: Adaptive Domain Mining for Fine-grained Domain Adaptation Modeling(https://arxiv.org/abs/2412.08198)
Keywords: fair
Abstract: Advertising systems often face the multi-domain challenge, where data distributions vary significantly across scenarios. Existing domain adaptation methods primarily focus on building domain-adaptive neural networks but often rely on hand-crafted domain information, e.g., advertising placement, which may be sub-optimal. We think that fine-grained "domain" patterns exist that are difficult to hand-craft in online advertisement. Thus, we propose Adaptive$^2$, a novel framework that first learns domains adaptively using a domain mining module by self-supervision and then employs a shared&specific network to model shared and conflicting information. As a practice, we use VQ-VAE as the domain mining module and conduct extensive experiments on public benchmarks. Results show that traditional domain adaptation methods with hand-crafted domains perform no better than single-domain models under fair FLOPS conditions, highlighting the importance of domain definition. In contrast, Adaptive$^2$ outperforms existing approaches, emphasizing the effectiveness of our method and the significance of domain mining. We also deployed Adaptive$^2$ in the live streaming scenario of Kuaishou Advertising System, demonstrating its commercial value and potential for automatic domain identification. To the best of our knowledge, Adaptive$^2$ is the first approach to automatically learn both domain identification and adaptation in online advertising, opening new research directions for this area.

Title: GN-FR:Generalizable Neural Radiance Fields for Flare Removal

Authors: Gopi Raju Matta, Rahul Siddartha, Rongali Simhachala Venkata Girish, Sumit Sharma, Kaushik Mitra
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08200
Pdf URL: https://arxiv.org/pdf/2412.08200
Copy Paste: [[2412.08200]] GN-FR:Generalizable Neural Radiance Fields for Flare Removal(https://arxiv.org/abs/2412.08200)
Keywords: transformer
Abstract: Flare, an optical phenomenon resulting from unwanted scattering and reflections within a lens system, presents a significant challenge in imaging. The diverse patterns of flares, such as halos, streaks, color bleeding, and haze, complicate the flare removal process. Existing traditional and learning-based methods have exhibited limited efficacy due to their reliance on single-image approaches, where flare removal is highly ill-posed. We address this by framing flare removal as a multi-view image problem, taking advantage of the view-dependent nature of flare artifacts. This approach leverages information from neighboring views to recover details obscured by flare in individual images. Our proposed framework, GN-FR (Generalizable Neural Radiance Fields for Flare Removal), can render flare-free views from a sparse set of input images affected by lens flare and generalizes across different scenes in an unsupervised manner. GN-FR incorporates several modules within the Generalizable NeRF Transformer (GNT) framework: Flare-occupancy Mask Generation (FMG), View Sampler (VS), and Point Sampler (PS). To overcome the impracticality of capturing both flare-corrupted and flare-free data, we introduce a masking loss function that utilizes mask information in an unsupervised setting. Additionally, we present a 3D multi-view flare dataset, comprising 17 real flare scenes with 782 images, 80 real flare patterns, and their corresponding annotated flare-occupancy masks. To our knowledge, this is the first work to address flare removal within a Neural Radiance Fields (NeRF) framework.

Title: Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Authors: Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08201
Pdf URL: https://arxiv.org/pdf/2412.08201
Copy Paste: [[2412.08201]] Model-Editing-Based Jailbreak against Safety-aligned Large Language Models(https://arxiv.org/abs/2412.08201)
Keywords: security, attack, robust, steal, large language model
Abstract: Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

Title: Unicorn: Unified Neural Image Compression with One Number Reconstruction

Authors: Qi Zheng, Haozhi Wang, Zihao Liu, Jiaming Liu, Peiye Liu, Zhijian Hao, Yanheng Lu, Dimin Niu, Jinjia Zhou, Minge Jing, Yibo Fan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08210
Pdf URL: https://arxiv.org/pdf/2412.08210
Copy Paste: [[2412.08210]] Unicorn: Unified Neural Image Compression with One Number Reconstruction(https://arxiv.org/abs/2412.08210)
Keywords: diffusion
Abstract: Prevalent lossy image compression schemes can be divided into: 1) explicit image compression (EIC), including traditional standards and neural end-to-end algorithms; 2) implicit image compression (IIC) based on implicit neural representations (INR). The former is encountering impasses of either leveling off bitrate reduction at a cost of tremendous complexity while the latter suffers from excessive smoothing quality as well as lengthy decoder models. In this paper, we propose an innovative paradigm, which we dub \textbf{Unicorn} (\textbf{U}nified \textbf{N}eural \textbf{I}mage \textbf{C}ompression with \textbf{O}ne \textbf{N}number \textbf{R}econstruction). By conceptualizing the images as index-image pairs and learning the inherent distribution of pairs in a subtle neural network model, Unicorn can reconstruct a visually pleasing image from a randomly generated noise with only one index number. The neural model serves as the unified decoder of images while the noises and indexes corresponds to explicit representations. As a proof of concept, we propose an effective and efficient prototype of Unicorn based on latent diffusion models with tailored model designs. Quantitive and qualitative experimental results demonstrate that our prototype achieves significant bitrates reduction compared with EIC and IIC algorithms. More impressively, benefitting from the unified decoder, our compression ratio escalates as the quantity of images increases. We envision that more advanced model designs will endow Unicorn with greater potential in image compression. We will release our codes in \url{this https URL}.

Title: Hierarchical Classification for Automated Image Annotation of Coral Reef Benthic Structures

Authors: Célia Blondin, Joris Guérin, Kelly Inagaki, Guilherme Longo, Laure Berti-Équille
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08228
Pdf URL: https://arxiv.org/pdf/2412.08228
Copy Paste: [[2412.08228]] Hierarchical Classification for Automated Image Annotation of Coral Reef Benthic Structures(https://arxiv.org/abs/2412.08228)
Keywords: protect
Abstract: Automated benthic image annotation is crucial to efficiently monitor and protect coral reefs against climate change. Current machine learning approaches fail to capture the hierarchical nature of benthic organisms covering reef substrata, i.e., coral taxonomic levels and health condition. To address this limitation, we propose to annotate benthic images using hierarchical classification. Experiments on a custom dataset from a Northeast Brazilian coral reef show that our approach outperforms flat classifiers, improving both F1 and hierarchical F1 scores by approximately 2\% across varying amounts of training data. In addition, this hierarchical method aligns more closely with ecological objectives.

Title: Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Authors: Bohan Li, Xin Jin, Jiajun Deng, Yasheng Sun, Xiaofeng Wang, Wenjun Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08243
Pdf URL: https://arxiv.org/pdf/2412.08243
Copy Paste: [[2412.08243]] Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction(https://arxiv.org/abs/2412.08243)
Keywords: segmentation
Abstract: Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset.

Title: Accurate Medical Named Entity Recognition Through Specialized NLP Models

Authors: Jiacheng Hu, Runyuan Bao, Yang Lin, Hanchao Zhang, Yanlin Xiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08255
Pdf URL: https://arxiv.org/pdf/2412.08255
Copy Paste: [[2412.08255]] Accurate Medical Named Entity Recognition Through Specialized NLP Models(https://arxiv.org/abs/2412.08255)
Keywords: privacy, robust, extraction, interpretability
Abstract: This study evaluated the effect of BioBERT in medical text processing for the task of medical named entity recognition. Through comparative experiments with models such as BERT, ClinicalBERT, SciBERT, and BlueBERT, the results showed that BioBERT achieved the best performance in both precision and F1 score, verifying its applicability and superiority in the medical field. BioBERT enhances its ability to understand professional terms and complex medical texts through pre-training on biomedical data, providing a powerful tool for medical information extraction and clinical decision support. The study also explored the privacy and compliance challenges of BioBERT when processing medical data, and proposed future research directions for combining other medical-specific models to improve generalization and robustness. With the development of deep learning technology, the potential of BioBERT in application fields such as intelligent medicine, personalized treatment, and disease prediction will be further expanded. Future research can focus on the real-time and interpretability of the model to promote its widespread application in the medical field.

Title: Comments on: RIO: Return Instruction Obfuscation for Bare-Metal IoT Devices with Binary Analysis

Authors: Kai Lehniger, Peter Langendörfer
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.08257
Pdf URL: https://arxiv.org/pdf/2412.08257
Copy Paste: [[2412.08257]] Comments on: RIO: Return Instruction Obfuscation for Bare-Metal IoT Devices with Binary Analysis(https://arxiv.org/abs/2412.08257)
Keywords: attack
Abstract: This is a comment on "RIO: Return Instruction Obfuscation for Bare-Metal IoT Devices with Binary Analysis". RIO prevents finding gadgets for Return-Oriented Programming attacks by encrypting return instructions. This paper shows flaws in the design of RIO that allow for the easy retrieval of the plaintext return instructions without decrypting them. Additionally, changes are proposed to improve upon the original idea.

Title: Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering

Authors: Pascal Tilli, Ngoc Thang Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08263
Pdf URL: https://arxiv.org/pdf/2412.08263
Copy Paste: [[2412.08263]] Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering(https://arxiv.org/abs/2412.08263)
Keywords: interpretability
Abstract: Explainable artificial intelligence (XAI) aims to make machine learning models more transparent. While many approaches focus on generating explanations post-hoc, interpretable approaches, which generate the explanations intrinsically alongside the predictions, are relatively rare. In this work, we integrate different discrete subset sampling methods into a graph-based visual question answering system to compare their effectiveness in generating interpretable explanatory subgraphs intrinsically. We evaluate the methods on the GQA dataset and show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy, while also achieving strong co-occurrences between answer and question tokens. Furthermore, we conduct a human evaluation to assess the interpretability of the generated subgraphs using a comparative setting with the extended Bradley-Terry model, showing that the answer and question token co-occurrence metrics strongly correlate with human preferences. Our source code is publicly available.

Title: Neural Observation Field Guided Hybrid Optimization of Camera Placement

Authors: Yihan Cao, Jiazhao Zhang, Zhinan Yu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08266
Pdf URL: https://arxiv.org/pdf/2412.08266
Copy Paste: [[2412.08266]] Neural Observation Field Guided Hybrid Optimization of Camera Placement(https://arxiv.org/abs/2412.08266)
Keywords: robust
Abstract: Camera placement is crutial in multi-camera systems such as virtual reality, autonomous driving, and high-quality reconstruction. The camera placement challenge lies in the nonlinear nature of high-dimensional parameters and the unavailability of gradients for target functions like coverage and visibility. Consequently, most existing methods tackle this challenge by leveraging non-gradient-based optimization this http URL this work, we present a hybrid camera placement optimization approach that incorporates both gradient-based and non-gradient-based optimization methods. This design allows our method to enjoy the advantages of smooth optimization convergence and robustness from gradient-based and non-gradient-based optimization, respectively. To bridge the two disparate optimization methods, we propose a neural observation field, which implicitly encodes the coverage and observation quality. The neural observation field provides the measurements of the camera observations and corresponding gradients without the assumption of target scenes, making our method applicable to diverse scenarios, including 2D planar shapes, 3D objects, and room-scale 3D this http URL experiments on diverse datasets demonstrate that our method achieves state-of-the-art performance, while requiring only a fraction (8x less) of the typical computation time. Furthermore, we conducted a real-world experiment using a custom-built capture system, confirming the resilience of our approach to real-world environmental noise.

Title: LCFO: Long Context and Long Form Output Dataset and Benchmarking

Authors: Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08268
Pdf URL: https://arxiv.org/pdf/2412.08268
Copy Paste: [[2412.08268]] LCFO: Long Context and Long Form Output Dataset and Benchmarking(https://arxiv.org/abs/2412.08268)
Keywords: generative, large language model
Abstract: This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6). The LCFO benchmark offers a standardized platform for evaluating summarization and summary expansion performance, as well as corresponding automatic metrics, thereby providing an important evaluation framework to advance generative AI.

Title: Local Features Meet Stochastic Anonymization: Revolutionizing Privacy-Preserving Face Recognition for Black-Box Models

Authors: Yuanwei Liu, Chengyu Jia, Ruqi Xiao, Xuemai Jia, Hui Wei, Kui Jiang, Zheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08276
Pdf URL: https://arxiv.org/pdf/2412.08276
Copy Paste: [[2412.08276]] Local Features Meet Stochastic Anonymization: Revolutionizing Privacy-Preserving Face Recognition for Black-Box Models(https://arxiv.org/abs/2412.08276)
Keywords: privacy, protect
Abstract: The task of privacy-preserving face recognition (PPFR) currently faces two major unsolved challenges: (1) existing methods are typically effective only on specific face recognition models and struggle to generalize to black-box face recognition models; (2) current methods employ data-driven reversible representation encoding for privacy protection, making them susceptible to adversarial learning and reconstruction of the original image. We observe that face recognition models primarily rely on local features ({e.g., face contour, skin texture, and so on) for identification. Thus, by disrupting global features while enhancing local features, we achieve effective recognition even in black-box environments. Additionally, to prevent adversarial models from learning and reversing the anonymization process, we adopt an adversarial learning-based approach with irreversible stochastic injection to ensure the stochastic nature of the anonymization. Experimental results demonstrate that our method achieves an average recognition accuracy of 94.21\% on black-box models, outperforming existing methods in both privacy protection and anti-reconstruction capabilities.

Title: How Does the Smoothness Approximation Method Facilitate Generalization for Federated Adversarial Learning?

Authors: Wenjun Ding, Ying An, Lixing Chen, Shichao Kan, Fan Wu, Zhe Qu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08282
Pdf URL: https://arxiv.org/pdf/2412.08282
Copy Paste: [[2412.08282]] How Does the Smoothness Approximation Method Facilitate Generalization for Federated Adversarial Learning?(https://arxiv.org/abs/2412.08282)
Keywords: attack, robust, federate
Abstract: Federated Adversarial Learning (FAL) is a robust framework for resisting adversarial attacks on federated learning. Although some FAL studies have developed efficient algorithms, they primarily focus on convergence performance and overlook generalization. Generalization is crucial for evaluating algorithm performance on unseen data. However, generalization analysis is more challenging due to non-smooth adversarial loss functions. A common approach to addressing this issue is to leverage smoothness approximation. In this paper, we develop algorithm stability measures to evaluate the generalization performance of two popular FAL algorithms: \textit{Vanilla FAL (VFAL)} and {\it Slack FAL (SFAL)}, using three different smooth approximation methods: 1) \textit{Surrogate Smoothness Approximation (SSA)}, (2) \textit{Randomized Smoothness Approximation (RSA)}, and (3) \textit{Over-Parameterized Smoothness Approximation (OPSA)}. Based on our in-depth analysis, we answer the question of how to properly set the smoothness approximation method to mitigate generalization error in FAL. Moreover, we identify RSA as the most effective method for reducing generalization error. In highly data-heterogeneous scenarios, we also recommend employing SFAL to mitigate the deterioration of generalization performance caused by heterogeneity. Based on our theoretical results, we provide insights to help develop more efficient FAL algorithms, such as designing new metrics and dynamic aggregation rules to mitigate heterogeneity.

Title: Adaptive Prompting for Continual Relation Extraction: A Within-Task Variance Perspective

Authors: Minh Le, Tien Ngoc Luu, An Nguyen The, Thanh-Thien Le, Trang Nguyen, Thanh Tung Nguyen, Linh Ngo Van, Thien Huu Nguyen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08285
Pdf URL: https://arxiv.org/pdf/2412.08285
Copy Paste: [[2412.08285]] Adaptive Prompting for Continual Relation Extraction: A Within-Task Variance Perspective(https://arxiv.org/abs/2412.08285)
Keywords: extraction, generative
Abstract: To address catastrophic forgetting in Continual Relation Extraction (CRE), many current approaches rely on memory buffers to rehearse previously learned knowledge while acquiring new tasks. Recently, prompt-based methods have emerged as potent alternatives to rehearsal-based strategies, demonstrating strong empirical performance. However, upon analyzing existing prompt-based approaches for CRE, we identified several critical limitations, such as inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in shared parameters, and suboptimal handling of cross-task and within-task variances. To overcome these challenges, we draw inspiration from the relationship between prefix-tuning and mixture of experts, proposing a novel approach that employs a prompt pool for each task, capturing variations within each task while enhancing cross-task variances. Furthermore, we incorporate a generative model to consolidate prior knowledge within shared parameters, eliminating the need for explicit data storage. Extensive experiments validate the efficacy of our approach, demonstrating superior performance over state-of-the-art prompt-based and rehearsal-free methods in continual relation extraction.

Title: k-HyperEdge Medoids for Clustering Ensemble

Authors: Feijiang Li, Jieting Wang, Liuya zhang, Yuhua Qian, Shuai jin, Tao Yan, Liang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08289
Pdf URL: https://arxiv.org/pdf/2412.08289
Copy Paste: [[2412.08289]] k-HyperEdge Medoids for Clustering Ensemble(https://arxiv.org/abs/2412.08289)
Keywords: robust
Abstract: Clustering ensemble has been a popular research topic in data science due to its ability to improve the robustness of the single clustering method. Many clustering ensemble methods have been proposed, most of which can be categorized into clustering-view and sample-view methods. The clustering-view method is generally efficient, but it could be affected by the unreliability that existed in base clustering results. The sample-view method shows good performance, while the construction of the pairwise sample relation is time-consuming. In this paper, the clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem and a clustering ensemble method based on k-HyperEdge Medoids that considers the characteristics of the above two types of clustering ensemble methods is proposed. In the method, a set of hyperedges is selected from the clustering view efficiently, then the hyperedges are diffused and adjusted from the sample view guided by a hyperedge loss function to construct an effective k-HyperEdge Medoid set. The loss function is mainly reduced by assigning samples to the hyperedge with the highest degree of belonging. Theoretical analyses show that the solution can approximate the optimal, the assignment method can gradually reduce the loss function, and the estimation of the belonging degree is statistically reasonable. Experiments on artificial data show the working mechanism of the proposed method. The convergence of the method is verified by experimental analysis of twenty data sets. The effectiveness and efficiency of the proposed method are also verified on these data, with nine representative clustering ensemble algorithms as reference.

Title: Code LLMs: A Taxonomy-based Survey

Authors: Nishat Raihan, Christian Newman, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08291
Pdf URL: https://arxiv.org/pdf/2412.08291
Copy Paste: [[2412.08291]] Code LLMs: A Taxonomy-based Survey(https://arxiv.org/abs/2412.08291)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks and have recently expanded their impact to coding tasks, bridging the gap between natural languages (NL) and programming languages (PL). This taxonomy-based survey provides a comprehensive analysis of LLMs in the NL-PL domain, investigating how these models are utilized in coding tasks and examining their methodologies, architectures, and training processes. We propose a taxonomy-based framework that categorizes relevant concepts, providing a unified classification system to facilitate a deeper understanding of this rapidly evolving field. This survey offers insights into the current state and future directions of LLMs in coding tasks, including their applications and limitations.

Title: Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations

Authors: Nikil Roashan Selvam, Amil Merchant, Stefano Ermon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08292
Pdf URL: https://arxiv.org/pdf/2412.08292
Copy Paste: [[2412.08292]] Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations(https://arxiv.org/abs/2412.08292)
Keywords: diffusion
Abstract: In diffusion models, samples are generated through an iterative refinement process, requiring hundreds of sequential model evaluations. Several recent methods have introduced approximations (fewer discretization steps or distillation) to trade off speed at the cost of sample quality. In contrast, we introduce Self-Refining Diffusion Samplers (SRDS) that retain sample quality and can improve latency at the cost of additional parallel compute. We take inspiration from the Parareal algorithm, a popular numerical method for parallel-in-time integration of differential equations. In SRDS, a quick but rough estimate of a sample is first created and then iteratively refined in parallel through Parareal iterations. SRDS is not only guaranteed to accurately solve the ODE and converge to the serial solution but also benefits from parallelization across the diffusion trajectory, enabling batched inference and pipelining. As we demonstrate for pre-trained diffusion models, the early convergence of this refinement procedure drastically reduces the number of steps required to produce a sample, speeding up generation for instance by up to 1.7x on a 25-step StableDiffusion-v2 benchmark and up to 4.3x on longer trajectories.

Title: Enhancing Cybersecurity in IoT Networks: A Deep Learning Approach to Anomaly Detection

Authors: Yining Pang, Chenghan Li
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08301
Pdf URL: https://arxiv.org/pdf/2412.08301
Copy Paste: [[2412.08301]] Enhancing Cybersecurity in IoT Networks: A Deep Learning Approach to Anomaly Detection(https://arxiv.org/abs/2412.08301)
Keywords: security, attack, robust
Abstract: With the proliferation of the Internet and smart devices, IoT technology has seen significant advancements and has become an integral component of smart homes, urban security, smart logistics, and other sectors. IoT facilitates real-time monitoring of critical production indicators, enabling businesses to detect potential quality issues, anticipate equipment malfunctions, and refine processes, thereby minimizing losses and reducing costs. Furthermore, IoT enhances real-time asset tracking, optimizing asset utilization and management. However, the expansion of IoT has also led to a rise in cybercrimes, with devices increasingly serving as vectors for malicious attacks. As the number of IoT devices grows, there is an urgent need for robust network security measures to counter these escalating threats. This paper introduces a deep learning model incorporating LSTM and attention mechanisms, a pivotal strategy in combating cybercrime in IoT networks. Our experiments, conducted on datasets including IoT-23, BoT-IoT, IoT network intrusion, MQTT, and MQTTset, demonstrate that our proposed method outperforms existing baselines.

Title: Edge-Splitting MLP: Node Classification on Homophilic and Heterophilic Graphs without Message Passing

Authors: Matthias Kohn, Marcel Hoffmann, Ansgar Scherp
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.08310
Pdf URL: https://arxiv.org/pdf/2412.08310
Copy Paste: [[2412.08310]] Edge-Splitting MLP: Node Classification on Homophilic and Heterophilic Graphs without Message Passing(https://arxiv.org/abs/2412.08310)
Keywords: robust
Abstract: Message Passing Neural Networks (MPNNs) have demonstrated remarkable success in node classification on homophilic graphs. It has been shown that they do not solely rely on homophily but on neighborhood distributions of nodes, i.e., consistency of the neighborhood label distribution within the same class. MLP-based models do not use message passing, \eg Graph-MLP incorporates the neighborhood in a separate loss function. These models are faster and more robust to edge noise. Graph-MLP maps adjacent nodes closer in the embedding space but is unaware of the neighborhood pattern of the labels, i.e., relies solely on homophily. Edge Splitting GNN (ES-GNN) is a model specialized for heterophilic graphs and splits the edges into task-relevant and task-irrelevant, respectively. To mitigate the limitations of Graph-MLP on heterophilic graphs, we propose ES-MLP that combines Graph-MLP with an edge-splitting mechanism from ES-GNN. It incorporates the edge splitting into the loss of Graph-MLP to learn two separate adjacency matrices based on relevant and irrelevant feature pairs. Our experiments on seven datasets with six baselines show that ES-MLP is on par with homophilic and heterophilic models on all datasets without using edges during inference. We show that ES-MLP is robust to multiple types of edge noise during inference and that its inference time is two to five times faster than that of commonly used MPNNs. The source code is available at this https URL.

Title: Post-Hoc MOTS: Exploring the Capabilities of Time-Symmetric Multi-Object Tracking

Authors: Gergely Szabó, Zsófia Molnár, András Horváth
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08313
Pdf URL: https://arxiv.org/pdf/2412.08313
Copy Paste: [[2412.08313]] Post-Hoc MOTS: Exploring the Capabilities of Time-Symmetric Multi-Object Tracking(https://arxiv.org/abs/2412.08313)
Keywords: segmentation
Abstract: Temporal forward-tracking has been the dominant approach for multi-object segmentation and tracking (MOTS). However, a novel time-symmetric tracking methodology has recently been introduced for the detection, segmentation, and tracking of budding yeast cells in pre-recorded samples. Although this architecture has demonstrated a unique perspective on stable and consistent tracking, as well as missed instance re-interpolation, its evaluation has so far been largely confined to settings related to videomicroscopic environments. In this work, we aim to reveal the broader capabilities, advantages, and potential challenges of this architecture across various specifically designed scenarios, including a pedestrian tracking dataset. We also conduct an ablation study comparing the model against its restricted variants and the widely used Kalman filter. Furthermore, we present an attention analysis of the tracking architecture for both pretrained and non-pretrained models

Title: Lightweight Method for Interactive 3D Medical Image Segmentation with Multi-Round Result Fusion

Authors: Bingzhi Shen, Lufan Chang, Siqi Chen, Shuxiang Guo, Hao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08315
Pdf URL: https://arxiv.org/pdf/2412.08315
Copy Paste: [[2412.08315]] Lightweight Method for Interactive 3D Medical Image Segmentation with Multi-Round Result Fusion(https://arxiv.org/abs/2412.08315)
Keywords: segmentation
Abstract: In medical imaging, precise annotation of lesions or organs is often required. However, 3D volumetric images typically consist of hundreds or thousands of slices, making the annotation process extremely time-consuming and laborious. Recently, the Segment Anything Model (SAM) has drawn widespread attention due to its remarkable zero-shot generalization capabilities in interactive segmentation. While researchers have explored adapting SAM for medical applications, such as using SAM adapters or constructing 3D SAM models, a key question remains: Can traditional CNN networks achieve the same strong zero-shot generalization in this task? In this paper, we propose the Lightweight Interactive Network for 3D Medical Image Segmentation (LIM-Net), a novel approach demonstrating the potential of compact CNN-based models. Built upon a 2D CNN backbone, LIM-Net initiates segmentation by generating a 2D prompt mask from user hints. This mask is then propagated through the 3D sequence via the Memory Module. To refine and stabilize results during interaction, the Multi-Round Result Fusion (MRF) Module selects and merges optimal masks from multiple rounds. Our extensive experiments across multiple datasets and modalities demonstrate LIM-Net's competitive performance. It exhibits stronger generalization to unseen data compared to SAM-based models, with competitive accuracy while requiring fewer interactions. Notably, LIM-Net's lightweight design offers significant advantages in deployment and inference efficiency, with low GPU memory consumption suitable for resource-constrained environments. These promising results demonstrate LIM-Net can serve as a strong baseline, complementing and contrasting with popular SAM models to further boost effective interactive medical image segmentation. The code will be released at \url{this https URL}.

Title: Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge

Authors: Haotong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08317
Pdf URL: https://arxiv.org/pdf/2412.08317
Copy Paste: [[2412.08317]] Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge(https://arxiv.org/abs/2412.08317)
Keywords: large language model
Abstract: We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.

Title: Digging into Intrinsic Contextual Information for High-fidelity 3D Point Cloud Completion

Authors: Jisheng Chu, Wenrui Li, Xingtao Wang, Kanglin Ning, Yidan Lu, Xiaopeng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08326
Pdf URL: https://arxiv.org/pdf/2412.08326
Copy Paste: [[2412.08326]] Digging into Intrinsic Contextual Information for High-fidelity 3D Point Cloud Completion(https://arxiv.org/abs/2412.08326)
Keywords: diffusion
Abstract: The common occurrence of occlusion-induced incompleteness in point clouds has made point cloud completion (PCC) a highly-concerned task in the field of geometric processing. Existing PCC methods typically produce complete point clouds from partial point clouds in a coarse-to-fine paradigm, with the coarse stage generating entire shapes and the fine stage improving texture details. Though diffusion models have demonstrated effectiveness in the coarse stage, the fine stage still faces challenges in producing high-fidelity results due to the ill-posed nature of PCC. The intrinsic contextual information for texture details in partial point clouds is the key to solving the challenge. In this paper, we propose a high-fidelity PCC method that digs into both short and long-range contextual information from the partial point cloud in the fine stage. Specifically, after generating the coarse point cloud via a diffusion-based coarse generator, a mixed sampling module introduces short-range contextual information from partial point clouds into the fine stage. A surface freezing modules safeguards points from noise-free partial point clouds against disruption. As for the long-range contextual information, we design a similarity modeling module to derive similarity with rigid transformation invariance between points, conducting effective matching of geometric manifold features globally. In this way, the high-quality components present in the partial point cloud serve as valuable references for refining the coarse point cloud with high fidelity. Extensive experiments have demonstrated the superiority of the proposed method over SOTA competitors. Our code is available at this https URL.

Title: SLGaussian: Fast Language Gaussian Splatting in Sparse Views

Authors: Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08331
Pdf URL: https://arxiv.org/pdf/2412.08331
Copy Paste: [[2412.08331]] SLGaussian: Fast Language Gaussian Splatting in Sparse Views(https://arxiv.org/abs/2412.08331)
Keywords: robust, segmentation
Abstract: 3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

Title: SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Authors: Sultan Alrashed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08347
Pdf URL: https://arxiv.org/pdf/2412.08347
Copy Paste: [[2412.08347]] SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs(https://arxiv.org/abs/2412.08347)
Keywords: large language model
Abstract: We present SmolTulu-1.7b-Instruct, referenced in this report as SmolTulu-DPO-1130, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model. Through comprehensive empirical analysis using a 135M parameter model, we demonstrate that the relationship between learning rate and batch size significantly impacts model performance in a task-dependent manner. Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios. These insights informed the development of SmolTulu, which achieves state-of-the-art performance among sub-2B parameter models on instruction following, scoring 67.7% on IFEval ($\Delta$11%), and mathematical reasoning with 51.6% on GSM8K ($\Delta$3.4%), with an alternate version achieving scoring 57.1% on ARC ($\Delta5.4%$). We release our model, training recipes, and ablation studies to facilitate further research in efficient model alignment, demonstrating that careful adaptation of optimization dynamics can help bridge the capability gap between small and large language models.

Title: Video Summarization using Denoising Diffusion Probabilistic Model

Authors: Zirui Shang, Yubo Zhu, Hongxi Li, Shuo yang, Xinxiao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08357
Pdf URL: https://arxiv.org/pdf/2412.08357
Copy Paste: [[2412.08357]] Video Summarization using Denoising Diffusion Probabilistic Model(https://arxiv.org/abs/2412.08357)
Keywords: diffusion, generative
Abstract: Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.

Title: Backdoor attacks on DNN and GBDT -- A Case Study from the insurance domain

Authors: Robin Kühlem (1), Daniel Otten (1), Daniel Ludwig (1), Anselm Hudde (1 and 3), Alexander Rosenbaum (2), Andreas Mauthe (2) ((1) Debeka, Koblenz, Germany, (2) Computer Science, University of Koblenz, Koblenz, Germany, (3) Department of Maths and Technology, Koblenz University of Applied Sciences, Remagen, Germany)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08366
Pdf URL: https://arxiv.org/pdf/2412.08366
Copy Paste: [[2412.08366]] Backdoor attacks on DNN and GBDT -- A Case Study from the insurance domain(https://arxiv.org/abs/2412.08366)
Keywords: attack, robust
Abstract: Machine learning (ML) will likely play a large role in many processes in the future, also for insurance companies. However, ML models are at risk of being attacked and manipulated. In this work, the robustness of Gradient Boosted Decision Tree (GBDT) models and Deep Neural Networks (DNN) within an insurance context will be evaluated. Therefore, two GBDT models and two DNNs are trained on two different tabular datasets from an insurance context. Past research in this domain mainly used homogenous data and there are comparably few insights regarding heterogenous tabular data. The ML tasks performed on the datasets are claim prediction (regression) and fraud detection (binary classification). For the backdoor attacks different samples containing a specific pattern were crafted and added to the training data. It is shown, that this type of attack can be highly successful, even with a few added samples. The backdoor attacks worked well on the models trained on one dataset but poorly on the models trained on the other. In real-world scenarios the attacker will have to face several obstacles but as attacks can work with very few added samples this risk should be evaluated.

Title: HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Authors: Shiding Zhu, Wenhui Dong, Jun Song, Yanan Guo, Bo Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08378
Pdf URL: https://arxiv.org/pdf/2412.08378
Copy Paste: [[2412.08378]] HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models(https://arxiv.org/abs/2412.08378)
Keywords: large language model
Abstract: Recently, there has been growing interest in the capability of multimodal large language models (MLLMs) to process high-resolution images. A common approach currently involves dynamically cropping the original high-resolution image into smaller sub-images, which are then fed into a vision encoder that was pre-trained on lower-resolution images. However, this cropping approach often truncates objects and connected areas in the original image, causing semantic breaks. To address this limitation, we introduce HyViLM, designed to process images of any resolution while retaining the overall context during encoding. Specifically, we: (i) Design a new visual encoder called Hybrid Encoder that not only encodes individual sub-images but also interacts with detailed global visual features, significantly improving the model's ability to encode high-resolution images. (ii) Propose an optimal feature fusion strategy for the dynamic cropping approach, effectively leveraging information from different layers of the vision encoder. Compared with the state-of-the-art MLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine out of ten tasks. Specifically, HyViLM achieves a 9.6% improvement in performance on the TextVQA task and a 6.9% enhancement on the DocVQA task.

Title: NyayaAnumana & INLegalLlama: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis

Authors: Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08385
Pdf URL: https://arxiv.org/pdf/2412.08385
Copy Paste: [[2412.08385]] NyayaAnumana & INLegalLlama: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis(https://arxiv.org/abs/2412.08385)
Keywords: explainability, generative, large language model
Abstract: The integration of artificial intelligence (AI) in legal judgment prediction (LJP) has the potential to transform the legal landscape, particularly in jurisdictions like India, where a significant backlog of cases burdens the legal system. This paper introduces NyayaAnumana, the largest and most diverse corpus of Indian legal cases compiled for LJP, encompassing a total of 7,02,945 preprocessed cases. NyayaAnumana, which combines the words "Nyay" (judgment) and "Anuman" (prediction or inference) respectively for most major Indian languages, includes a wide range of cases from the Supreme Court, High Courts, Tribunal Courts, District Courts, and Daily Orders and, thus, provides unparalleled diversity and coverage. Our dataset surpasses existing datasets like PredEx and ILDC, offering a comprehensive foundation for advanced AI research in the legal domain. In addition to the dataset, we present INLegalLlama, a domain-specific generative large language model (LLM) tailored to the intricacies of the Indian legal system. It is developed through a two-phase training approach over a base LLaMa model. First, Indian legal documents are injected using continual pretraining. Second, task-specific supervised finetuning is done. This method allows the model to achieve a deeper understanding of legal contexts. Our experiments demonstrate that incorporating diverse court data significantly boosts model accuracy, achieving approximately 90% F1-score in prediction tasks. INLegalLlama not only improves prediction accuracy but also offers comprehensible explanations, addressing the need for explainability in AI-assisted legal decisions.

Title: SweetieChat: A Strategy-Enhanced Role-playing Framework for Diverse Scenarios Handling Emotional Support Agent

Authors: Jing Ye, Lu Xiang, Yaping Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08389
Pdf URL: https://arxiv.org/pdf/2412.08389
Copy Paste: [[2412.08389]] SweetieChat: A Strategy-Enhanced Role-playing Framework for Diverse Scenarios Handling Emotional Support Agent(https://arxiv.org/abs/2412.08389)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated promising potential in providing empathetic support during interactions. However, their responses often become verbose or overly formulaic, failing to adequately address the diverse emotional support needs of real-world scenarios. To tackle this challenge, we propose an innovative strategy-enhanced role-playing framework, designed to simulate authentic emotional support conversations. Specifically, our approach unfolds in two steps: (1) Strategy-Enhanced Role-Playing Interactions, which involve three pivotal roles -- Seeker, Strategy Counselor, and Supporter -- engaging in diverse scenarios to emulate real-world interactions and promote a broader range of dialogues; and (2) Emotional Support Agent Training, achieved through fine-tuning LLMs using our specially constructed dataset. Within this framework, we develop the \textbf{ServeForEmo} dataset, comprising an extensive collection of 3.7K+ multi-turn dialogues and 62.8K+ utterances. We further present \textbf{SweetieChat}, an emotional support agent capable of handling diverse open-domain scenarios. Extensive experiments and human evaluations confirm the framework's effectiveness in enhancing emotional support, highlighting its unique ability to provide more nuanced and tailored assistance.

Title: Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Authors: Kaiyuan Chen, Jin Wang, Xuejie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08393
Pdf URL: https://arxiv.org/pdf/2412.08393
Copy Paste: [[2412.08393]] Learning to Reason via Self-Iterative Process Feedback for Small Language Models(https://arxiv.org/abs/2412.08393)
Keywords: large language model
Abstract: Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.

Title: Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds

Authors: Shuhai Zhang, Jiahao Yang, Hui Luo, Jie Chen, Li Wang, Feng Liu, Bo Han, Mingkui Tan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08394
Pdf URL: https://arxiv.org/pdf/2412.08394
Copy Paste: [[2412.08394]] Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds(https://arxiv.org/abs/2412.08394)
Keywords: attack, robust, generative
Abstract: Deep neural networks (DNNs) are vulnerable to adversarial samples crafted by adding imperceptible perturbations to clean data, potentially leading to incorrect and dangerous predictions. Adversarial purification has been an effective means to improve DNNs robustness by removing these perturbations before feeding the data into the model. However, it faces significant challenges in preserving key structural and semantic information of data, as the imperceptible nature of adversarial perturbations makes it hard to avoid over-correcting, which can destroy important information and degrade model performance. In this paper, we break away from traditional adversarial purification methods by focusing on the clean data manifold. To this end, we reveal that samples generated by a well-trained generative model are close to clean ones but far from adversarial ones. Leveraging this insight, we propose Consistency Model-based Adversarial Purification (CMAP), which optimizes vectors within the latent space of a pre-trained consistency model to generate samples for restoring clean data. Specifically, 1) we propose a \textit{Perceptual consistency restoration} mechanism by minimizing the discrepancy between generated samples and input samples in both pixel and perceptual spaces. 2) To maintain the optimized latent vectors within the valid data manifold, we introduce a \textit{Latent distribution consistency constraint} strategy to align generated samples with the clean data distribution. 3) We also apply a \textit{Latent vector consistency prediction} scheme via an ensemble approach to enhance prediction reliability. CMAP fundamentally addresses adversarial perturbations at their source, providing a robust purification. Extensive experiments on CIFAR-10 and ImageNet-100 show that our CMAP significantly enhances robustness against strong adversarial attacks while preserving high natural accuracy.

Title: Pysical Informed Driving World Model

Authors: Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08410
Pdf URL: https://arxiv.org/pdf/2412.08410
Copy Paste: [[2412.08410]] Pysical Informed Driving World Model(https://arxiv.org/abs/2412.08410)
Keywords: robust, extraction, segmentation
Abstract: Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: this https URL

Title: Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

Authors: Songchun Zhang, Chunhui Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08412
Pdf URL: https://arxiv.org/pdf/2412.08412
Copy Paste: [[2412.08412]] Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views(https://arxiv.org/abs/2412.08412)
Keywords: diffusion, generative
Abstract: Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature. Recent methods propose to predict implicit representations directly from unposed inputs in a data-driven manner, achieving promising results. However, these methods do not utilize geometric priors and cannot hallucinate the appearance of unseen regions, thus making it challenging to reconstruct fine geometric and textural details. To tackle this challenge, our key idea is to reformulate this ill-posed problem as conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction. With complete observations, the poses of the input views can be easily recovered and further used to optimize the reconstructed object. To this end, we propose a novel pipeline Pragmatist. First, we generate a complete observation of the object via a multiview conditional diffusion model. Then, we use a feed-forward large reconstruction model to obtain the reconstructed mesh. To further improve the reconstruction quality, we recover the poses of input views by inverting the obtained 3D representations and further optimize the texture using detailed input views. Unlike previous approaches, our pipeline improves reconstruction by efficiently leveraging unposed inputs and generative priors, circumventing the direct resolution of highly ill-posed problems. Extensive experiments show that our approach achieves promising performance in several benchmarks.

Title: Detecting Conversational Mental Manipulation with Intent-Aware Prompting

Authors: Jiayuan Ma, Hongbin Na, Zimu Wang, Yining Hua, Yue Liu, Wei Wang, Ling Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08414
Pdf URL: https://arxiv.org/pdf/2412.08414
Copy Paste: [[2412.08414]] Detecting Conversational Mental Manipulation with Intent-Aware Prompting(https://arxiv.org/abs/2412.08414)
Keywords: large language model
Abstract: Mental manipulation severely undermines mental wellness by covertly and negatively distorting decision-making. While there is an increasing interest in mental health care within the natural language processing community, progress in tackling manipulation remains limited due to the complexity of detecting subtle, covert tactics in conversations. In this paper, we propose Intent-Aware Prompting (IAP), a novel approach for detecting mental manipulations using large language models (LLMs), providing a deeper understanding of manipulative tactics by capturing the underlying intents of participants. Experimental results on the MentalManip dataset demonstrate superior effectiveness of IAP against other advanced prompting strategies. Notably, our approach substantially reduces false negatives, helping detect more instances of mental manipulation with minimal misjudgment of positive cases. The code of this paper is available at this https URL.

Title: Robustness of Graph Classification: failure modes, causes, and noise-resistant loss in Graph Neural Networks

Authors: Farooq Ahmad Wani, Maria Sofia Bucarelli, Andrea Giuseppe Di Francesco, Oleksandr Pryymak, Fabrizio Silvestri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08419
Pdf URL: https://arxiv.org/pdf/2412.08419
Copy Paste: [[2412.08419]] Robustness of Graph Classification: failure modes, causes, and noise-resistant loss in Graph Neural Networks(https://arxiv.org/abs/2412.08419)
Keywords: robust
Abstract: Graph Neural Networks (GNNs) are powerful at solving graph classification tasks, yet applied problems often contain noisy labels. In this work, we study GNN robustness to label noise, demonstrate GNN failure modes when models struggle to generalise on low-order graphs, low label coverage, or when a model is over-parameterized. We establish both empirical and theoretical links between GNN robustness and the reduction of the total Dirichlet Energy of learned node representations, which encapsulates the hypothesized GNN smoothness inductive bias. Finally, we introduce two training strategies to enhance GNN robustness: (1) by incorporating a novel inductive bias in the weight matrices through the removal of negative eigenvalues, connected to Dirichlet Energy minimization; (2) by extending to GNNs a loss penalty that promotes learned smoothness. Importantly, neither approach negatively impacts performance in noise-free settings, supporting our hypothesis that the source of GNNs robustness is their smoothness inductive bias.

Title: PointCFormer: a Relation-based Progressive Feature Extraction Network for Point Cloud Completion

Authors: Yi Zhong, Weize Quan, Dong-ming Yan, Jie Jiang, Yingmei Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08421
Pdf URL: https://arxiv.org/pdf/2412.08421
Copy Paste: [[2412.08421]] PointCFormer: a Relation-based Progressive Feature Extraction Network for Point Cloud Completion(https://arxiv.org/abs/2412.08421)
Keywords: robust, extraction, transformer, segmentation
Abstract: Point cloud completion aims to reconstruct the complete 3D shape from incomplete point clouds, and it is crucial for tasks such as 3D object detection and segmentation. Despite the continuous advances in point cloud analysis techniques, feature extraction methods are still confronted with apparent limitations. The sparse sampling of point clouds, used as inputs in most methods, often results in a certain loss of global structure information. Meanwhile, traditional local feature extraction methods usually struggle to capture the intricate geometric details. To overcome these drawbacks, we introduce PointCFormer, a transformer framework optimized for robust global retention and precise local detail capture in point cloud completion. This framework embraces several key advantages. First, we propose a relation-based local feature extraction method to perceive local delicate geometry characteristics. This approach establishes a fine-grained relationship metric between the target point and its k-nearest neighbors, quantifying each neighboring point's contribution to the target point's local features. Secondly, we introduce a progressive feature extractor that integrates our local feature perception method with self-attention. Starting with a denser sampling of points as input, it iteratively queries long-distance global dependencies and local neighborhood relationships. This extractor maintains enhanced global structure and refined local details, without generating substantial computational overhead. Additionally, we develop a correction module after generating point proxies in the latent space to reintroduce denser information from the input points, enhancing the representation capability of the point proxies. PointCFormer demonstrates state-of-the-art performance on several widely used benchmarks.

Title: Assessing Personalized AI Mentoring with Large Language Models in the Computing Field

Authors: Xiao Luo, Sean O'Connell, Shamima Mithun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08430
Pdf URL: https://arxiv.org/pdf/2412.08430
Copy Paste: [[2412.08430]] Assessing Personalized AI Mentoring with Large Language Models in the Computing Field(https://arxiv.org/abs/2412.08430)
Keywords: large language model
Abstract: This paper provides an in-depth evaluation of three state-of-the-art Large Language Models (LLMs) for personalized career mentoring in the computing field, using three distinct student profiles that consider gender, race, and professional levels. We evaluated the performance of GPT-4, LLaMA 3, and Palm 2 using a zero-shot learning approach without human intervention. A quantitative evaluation was conducted through a custom natural language processing analytics pipeline to highlight the uniqueness of the responses and to identify words reflecting each student's profile, including race, gender, or professional level. The analysis of frequently used words in the responses indicates that GPT-4 offers more personalized mentoring compared to the other two LLMs. Additionally, a qualitative evaluation was performed to see if human experts reached similar conclusions. The analysis of survey responses shows that GPT-4 outperformed the other two LLMs in delivering more accurate and useful mentoring while addressing specific challenges with encouragement languages. Our work establishes a foundation for developing personalized mentoring tools based on LLMs, incorporating human mentors in the process to deliver a more impactful and tailored mentoring experience.

Title: Dynamic Disentangled Fusion Network for RGBT Tracking

Authors: Chenglong Li, Tao Wang, Zhaodong Ding, Yun Xiao, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08441
Pdf URL: https://arxiv.org/pdf/2412.08441
Copy Paste: [[2412.08441]] Dynamic Disentangled Fusion Network for RGBT Tracking(https://arxiv.org/abs/2412.08441)
Keywords: robust
Abstract: RGBT tracking usually suffers from various challenging factors of low resolution, similar appearance, extreme illumination, thermal crossover and occlusion, to name a few. Existing works often study complex fusion models to handle challenging scenarios, but can not well adapt to various challenges, which might limit tracking performance. To handle this problem, we propose a novel Dynamic Disentangled Fusion Network called DDFNet, which disentangles the fusion process into several dynamic fusion models via the challenge attributes to adapt to various challenging scenarios, for robust RGBT tracking. In particular, we design six attribute-based fusion models to integrate RGB and thermal features under the six challenging scenarios this http URL each fusion model is to deal with the corresponding challenges, such disentangled fusion scheme could increase the fusion capacity without the dependence on large-scale training data. Considering that every challenging scenario also has different levels of difficulty, we propose to optimize the combination of multiple fusion units to form each attribute-based fusion model in a dynamic manner, which could well adapt to the difficulty of the corresponding challenging scenario. To address the issue that which fusion models should be activated in the tracking process, we design an adaptive aggregation fusion module to integrate all features from attribute-based fusion models in an adaptive manner with a three-stage training algorithm. In addition, we design an enhancement fusion module to further strengthen the aggregated feature and modality-specific features. Experimental results on benchmark datasets demonstrate the effectiveness of our DDFNet against other state-of-the-art methods.

Title: From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Authors: Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08442
Pdf URL: https://arxiv.org/pdf/2412.08442
Copy Paste: [[2412.08442]] From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons(https://arxiv.org/abs/2412.08442)
Keywords: large language model
Abstract: We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

Title: Federated Learning for Traffic Flow Prediction with Synthetic Data Augmentation

Authors: Fermin Orozco, Pedro Porto Buarque de Gusmão, Hongkai Wen, Johan Wahlström, Man Luo
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2412.08460
Pdf URL: https://arxiv.org/pdf/2412.08460
Copy Paste: [[2412.08460]] Federated Learning for Traffic Flow Prediction with Synthetic Data Augmentation(https://arxiv.org/abs/2412.08460)
Keywords: privacy, federate, diffusion
Abstract: Deep-learning based traffic prediction models require vast amounts of data to learn embedded spatial and temporal dependencies. The inherent privacy and commercial sensitivity of such data has encouraged a shift towards decentralised data-driven methods, such as Federated Learning (FL). Under a traditional Machine Learning paradigm, traffic flow prediction models can capture spatial and temporal relationships within centralised data. In reality, traffic data is likely distributed across separate data silos owned by multiple stakeholders. In this work, a cross-silo FL setting is motivated to facilitate stakeholder collaboration for optimal traffic flow prediction applications. This work introduces an FL framework, referred to as FedTPS, to generate synthetic data to augment each client's local dataset by training a diffusion-based trajectory generation model through FL. The proposed framework is evaluated on a large-scale real world ride-sharing dataset using various FL methods and Traffic Flow Prediction models, including a novel prediction model we introduce, which leverages Temporal and Graph Attention mechanisms to learn the Spatio-Temporal dependencies embedded within regional traffic flow data. Experimental results show that FedTPS outperforms multiple other FL baselines with respect to global model performance.

Title: CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis

Authors: Mu Zhang, Yunfan Liu, Yue Liu, Hongtian Yu, Qixiang Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08464
Pdf URL: https://arxiv.org/pdf/2412.08464
Copy Paste: [[2412.08464]] CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis(https://arxiv.org/abs/2412.08464)
Keywords: diffusion
Abstract: Accurately depicting real-world landscapes in remote sensing (RS) images requires precise alignment between objects and their environment. However, most existing synthesis methods for natural images prioritize foreground control, often reducing the background to plain textures. This neglects the interaction between foreground and background, which can lead to incoherence in RS scenarios. In this paper, we introduce CC-Diff, a Diffusion Model-based approach for RS image generation with enhanced Context Coherence. To capture spatial interdependence, we propose a sequential pipeline where background generation is conditioned on synthesized foreground instances. Distinct learnable queries are also employed to model both the complex background texture and its semantic relation to the foreground. Extensive experiments demonstrate that CC-Diff outperforms state-of-the-art methods in visual fidelity, semantic accuracy, and positional precision, excelling in both RS and natural image domains. CC-Diff also shows strong trainability, improving detection accuracy by 2.04 mAP on DOTA and 2.25 mAP on the COCO benchmark.

Title: Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08467
Pdf URL: https://arxiv.org/pdf/2412.08467
Copy Paste: [[2412.08467]] Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(https://arxiv.org/abs/2412.08467)
Keywords: robust
Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data self-refining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases.

Title: CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization

Authors: Sumaiya Zoha, Jeong-Gun Lee, Young-Woong Ko
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08479
Pdf URL: https://arxiv.org/pdf/2412.08479
Copy Paste: [[2412.08479]] CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization(https://arxiv.org/abs/2412.08479)
Keywords: robust
Abstract: Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts. Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains. However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG. To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm. In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts. Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks. Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.

Title: InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models

Authors: Min Hou, Yueying Wu, Chang Xu, Yu-Hao Huang, Chenxi Bai, Le Wu, Jiang Bian
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08480
Pdf URL: https://arxiv.org/pdf/2412.08480
Copy Paste: [[2412.08480]] InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models(https://arxiv.org/abs/2412.08480)
Keywords: diffusion, generative
Abstract: As one of the most successful generative models, diffusion models have demonstrated remarkable efficacy in synthesizing high-quality images. These models learn the underlying high-dimensional data distribution in an unsupervised manner. Despite their success, diffusion models are highly data-driven and prone to inheriting the imbalances and biases present in real-world data. Some studies have attempted to address these issues by designing text prompts for known biases or using bias labels to construct unbiased data. While these methods have shown improved results, real-world scenarios often contain various unknown biases, and obtaining bias labels is particularly challenging. In this paper, we emphasize the necessity of mitigating bias in pre-trained diffusion models without relying on auxiliary bias annotations. To tackle this problem, we propose a framework, InvDiff, which aims to learn invariant semantic information for diffusion guidance. Specifically, we propose identifying underlying biases in the training data and designing a novel debiasing training objective. Then, we employ a lightweight trainable module that automatically preserves invariant semantic information and uses it to guide the diffusion model's sampling process toward unbiased outcomes simultaneously. Notably, we only need to learn a small number of parameters in the lightweight learnable module without altering the pre-trained diffusion model. Furthermore, we provide a theoretical guarantee that the implementation of InvDiff is equivalent to reducing the error upper bound of generalization. Extensive experimental results on three publicly available benchmarks demonstrate that InvDiff effectively reduces biases while maintaining the quality of image generation. Our code is available at this https URL.

Title: SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation

Authors: Tapas Kumar Dutta, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08482
Pdf URL: https://arxiv.org/pdf/2412.08482
Copy Paste: [[2412.08482]] SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation(https://arxiv.org/abs/2412.08482)
Keywords: transformer, segmentation
Abstract: Polyp segmentation in colonoscopy is crucial for detecting colorectal cancer. However, it is challenging due to variations in the structure, color, and size of polyps, as well as the lack of clear boundaries with surrounding tissues. Traditional segmentation models based on Convolutional Neural Networks (CNNs) struggle to capture detailed patterns and global context, limiting their performance. Vision Transformer (ViT)-based models address some of these issues but have difficulties in capturing local context and lack strong zero-shot generalization. To this end, we propose the Mamba-guided Segment Anything Model (SAM-Mamba) for efficient polyp segmentation. Our approach introduces a Mamba-Prior module in the encoder to bridge the gap between the general pre-trained representation of SAM and polyp-relevant trivial clues. It injects salient cues of polyp images into the SAM image encoder as a domain prior while capturing global dependencies at various scales, leading to more accurate segmentation results. Extensive experiments on five benchmark datasets show that SAM-Mamba outperforms traditional CNN, ViT, and Adapter-based models in both quantitative and qualitative measures. Additionally, SAM-Mamba demonstrates excellent adaptability to unseen datasets, making it highly suitable for real-time clinical use.

Title: Learning Flow Fields in Attention for Controllable Person Image Generation

Authors: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08486
Pdf URL: https://arxiv.org/pdf/2412.08486
Copy Paste: [[2412.08486]] Learning Flow Fields in Attention for Controllable Person Image Generation(https://arxiv.org/abs/2412.08486)
Keywords: diffusion
Abstract: Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

Title: GradStop: Exploring Training Dynamics in Unsupervised Outlier Detection through Gradient Cohesion

Authors: Yuang Zhang, Liping Wang, Yihong Huang, Yuanxing Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08501
Pdf URL: https://arxiv.org/pdf/2412.08501
Copy Paste: [[2412.08501]] GradStop: Exploring Training Dynamics in Unsupervised Outlier Detection through Gradient Cohesion(https://arxiv.org/abs/2412.08501)
Keywords: robust
Abstract: Unsupervised Outlier Detection (UOD) is a critical task in data mining and machine learning, aiming to identify instances that significantly deviate from the majority. Without any label, deep UOD methods struggle with the misalignment between the model's direct optimization goal and the final performance goal of Outlier Detection (OD) task. Through the perspective of training dynamics, this paper proposes an early stopping algorithm to optimize the training of deep UOD models, ensuring they perform optimally in OD rather than overfitting the entire contaminated dataset. Inspired by UOD mechanism and inlier priority phenomenon, where intuitively models fit inliers more quickly than outliers, we propose GradStop, a sampling-based label-free algorithm to estimate model's real-time performance during training. First, a sampling method generates two sets: one likely containing more outliers and the other more inliers, then a metric based on gradient cohesion is applied to probe into current training dynamics, which reflects model's performance on OD task. Experimental results on 4 deep UOD algorithms and 47 real-world datasets and theoretical proofs demonstrate the effectiveness of our proposed early stopping algorithm in enhancing the performance of deep UOD models. Auto Encoder (AE) enhanced by GradStop achieves better performance than itself, other SOTA UOD methods, and even ensemble AEs. Our method provides a robust and effective solution to the problem of performance degradation during training, enabling deep UOD models to achieve better potential in anomaly detection tasks.

Title: Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection

Authors: Mingda Jia, Liming Zhao, Ge Li, Yun Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08506
Pdf URL: https://arxiv.org/pdf/2412.08506
Copy Paste: [[2412.08506]] Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection(https://arxiv.org/abs/2412.08506)
Keywords: transformer
Abstract: Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries in representing diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.

Title: Comparative Opinion Mining in Product Reviews: Multi-perspective Prompt-based Learning

Authors: Hai-Yen Thi Nguyen, Cam-Van Thi Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08508
Pdf URL: https://arxiv.org/pdf/2412.08508
Copy Paste: [[2412.08508]] Comparative Opinion Mining in Product Reviews: Multi-perspective Prompt-based Learning(https://arxiv.org/abs/2412.08508)
Keywords: extraction, generative
Abstract: Comparative reviews are pivotal in understanding consumer preferences and influencing purchasing decisions. Comparative Quintuple Extraction (COQE) aims to identify five key components in text: the target entity, compared entities, compared aspects, opinions on these aspects, and polarity. Extracting precise comparative information from product reviews is challenging due to nuanced language and sequential task errors in traditional methods. To mitigate these problems, we propose MTP-COQE, an end-to-end model designed for COQE. Leveraging multi-perspective prompt-based learning, MTP-COQE effectively guides the generative model in comparative opinion mining tasks. Evaluation on the Camera-COQE (English) and VCOM (Vietnamese) datasets demonstrates MTP-COQE's efficacy in automating COQE, achieving superior performance with a 1.41% higher F1 score than the previous baseline models on the English dataset. Additionally, we designed a strategy to limit the generative model's creativity to ensure the output meets expectations. We also performed data augmentation to address data imbalance and to prevent the model from becoming biased towards dominant samples.

Title: REPEAT: Improving Uncertainty Estimation in Representation Learning Explainability

Authors: Kristoffer K. Wickstrøm, Thea Brüsch, Michael C. Kampffmeyer, Robert Jenssen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08513
Pdf URL: https://arxiv.org/pdf/2412.08513
Copy Paste: [[2412.08513]] REPEAT: Improving Uncertainty Estimation in Representation Learning Explainability(https://arxiv.org/abs/2412.08513)
Keywords: explainability
Abstract: Incorporating uncertainty is crucial to provide trustworthy explanations of deep learning models. Recent works have demonstrated how uncertainty modeling can be particularly important in the unsupervised field of representation learning explainable artificial intelligence (R-XAI). Current R-XAI methods provide uncertainty by measuring variability in the importance score. However, they fail to provide meaningful estimates of whether a pixel is certainly important or not. In this work, we propose a new R-XAI method called REPEAT that addresses the key question of whether or not a pixel is \textit{certainly} important. REPEAT leverages the stochasticity of current R-XAI methods to produce multiple estimates of importance, thus considering each pixel in an image as a Bernoulli random variable that is either important or unimportant. From these Bernoulli random variables we can directly estimate the importance of a pixel and its associated certainty, thus enabling users to determine certainty in pixel importance. Our extensive evaluation shows that REPEAT gives certainty estimates that are more intuitive, better at detecting out-of-distribution data, and more concise.

Title: Enhancing Interpretability Through Loss-Defined Classification Objective in Structured Latent Spaces

Authors: Daniel Geissler, Bo Zhou, Mengxi Liu, Paul Lukowicz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08515
Pdf URL: https://arxiv.org/pdf/2412.08515
Copy Paste: [[2412.08515]] Enhancing Interpretability Through Loss-Defined Classification Objective in Structured Latent Spaces(https://arxiv.org/abs/2412.08515)
Keywords: interpretability
Abstract: Supervised machine learning often operates on the data-driven paradigm, wherein internal model parameters are autonomously optimized to converge predicted outputs with the ground truth, devoid of explicitly programming rules or a priori assumptions. Although data-driven methods have yielded notable successes across various benchmark datasets, they inherently treat models as opaque entities, thereby limiting their interpretability and yielding a lack of explanatory insights into their decision-making processes. In this work, we introduce Latent Boost, a novel approach that integrates advanced distance metric learning into supervised classification tasks, enhancing both interpretability and training efficiency. Thus during training, the model is not only optimized for classification metrics of the discrete data points but also adheres to the rule that the collective representation zones of each class should be sharply clustered. By leveraging the rich structural insights of intermediate model layer latent representations, Latent Boost improves classification interpretability, as demonstrated by higher Silhouette scores, while accelerating training convergence. These performance and latent structural benefits are achieved with minimum additional cost, making it broadly applicable across various datasets without requiring data-specific adjustments. Furthermore, Latent Boost introduces a new paradigm for aligning classification performance with improved model transparency to address the challenges of black-box models.

Title: Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation

Authors: Pengyue Jia, Derong Xu, Xiaopeng Li, Zhaocheng Du, Xiangyang Li, Xiangyu Zhao, Yichao Wang, Yuhao Wang, Huifeng Guo, Ruiming Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08519
Pdf URL: https://arxiv.org/pdf/2412.08519
Copy Paste: [[2412.08519]] Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation(https://arxiv.org/abs/2412.08519)
Keywords: extraction, large language model
Abstract: The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.

Title: GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek

Authors: Lefteris Loukas, Nikolaos Smyrnioudis, Chrysa Dikonomaki, Spyros Barbakos, Anastasios Toumazatos, John Koutsikakis, Manolis Kyriakakis, Mary Georgiou, Stavros Vassos, John Pavlopoulos, Ion Androutsopoulos
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2412.08520
Pdf URL: https://arxiv.org/pdf/2412.08520
Copy Paste: [[2412.08520]] GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek(https://arxiv.org/abs/2412.08520)
Keywords: transformer
Abstract: We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek. The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging, morphological tagging, dependency parsing, named entity recognition, and Greeklishto-Greek transliteration. The toolkit is based on pre-trained Transformers, it is freely available, and can be easily installed in Python (pip install gr-nlp-toolkit). It is also accessible through a demonstration platform on HuggingFace, along with a publicly available API for non-commercial use. We discuss the functionality provided for each task, the underlying methods, experiments against comparable open-source toolkits, and future possible enhancements. The toolkit is available at: this https URL

Title: EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

Authors: Yingxin Li, Ye Li, Yuan Meng, Xinzhu Ma, Zihan Geng, Shutao Xia, Zhi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08521
Pdf URL: https://arxiv.org/pdf/2412.08521
Copy Paste: [[2412.08521]] EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance(https://arxiv.org/abs/2412.08521)
Keywords: large language model
Abstract: As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.

Title: TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction

Authors: Quynh-Mai Thi Nguyen, Lan-Nhi Thi Nguyen, Cam-Van Thi Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08529
Pdf URL: https://arxiv.org/pdf/2412.08529
Copy Paste: [[2412.08529]] TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction(https://arxiv.org/abs/2412.08529)
Keywords: robust, extraction
Abstract: The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.

Title: Training Data Reconstruction: Privacy due to Uncertainty?

Authors: Christina Runkel, Kanchana Vaishnavi Gandikota, Jonas Geiping, Carola-Bibiane Schönlieb, Michael Moeller
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.08544
Pdf URL: https://arxiv.org/pdf/2412.08544
Copy Paste: [[2412.08544]] Training Data Reconstruction: Privacy due to Uncertainty?(https://arxiv.org/abs/2412.08544)
Keywords: privacy
Abstract: Being able to reconstruct training data from the parameters of a neural network is a major privacy concern. Previous works have shown that reconstructing training data, under certain circumstances, is possible. In this work, we analyse such reconstructions empirically and propose a new formulation of the reconstruction as a solution to a bilevel optimisation problem. We demonstrate that our formulation as well as previous approaches highly depend on the initialisation of the training images $x$ to reconstruct. In particular, we show that a random initialisation of $x$ can lead to reconstructions that resemble valid training samples while not being part of the actual training dataset. Thus, our experiments on affine and one-hidden layer networks suggest that when reconstructing natural images, yet an adversary cannot identify whether reconstructed images have indeed been part of the set of training samples.

Title: Improving Satellite Imagery Masking using Multi-task and Transfer Learning

Authors: Rangel Daroya, Luisa Vieira Lucchese, Travis Simmons, Punwath Prum, Tamlin Pavelsky, John Gardner, Colin J. Gleason, Subhransu Maji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08545
Pdf URL: https://arxiv.org/pdf/2412.08545
Copy Paste: [[2412.08545]] Improving Satellite Imagery Masking using Multi-task and Transfer Learning(https://arxiv.org/abs/2412.08545)
Keywords: transformer
Abstract: Many remote sensing applications employ masking of pixels in satellite imagery for subsequent measurements. For example, estimating water quality variables, such as Suspended Sediment Concentration (SSC) requires isolating pixels depicting water bodies unaffected by clouds, their shadows, terrain shadows, and snow and ice formation. A significant bottleneck is the reliance on a variety of data products (e.g., satellite imagery, elevation maps), and a lack of precision in individual steps affecting estimation accuracy. We propose to improve both the accuracy and computational efficiency of masking by developing a system that predicts all required masks from Harmonized Landsat and Sentinel (HLS) imagery. Our model employs multi-tasking to share computation and enable higher accuracy across tasks. We experiment with recent advances in deep network architectures and show that masking models can benefit from these, especially when combined with pre-training on large satellite imagery datasets. We present a collection of models offering different speed/accuracy trade-offs for masking. MobileNet variants are the fastest, and perform competitively with larger architectures. Transformer-based architectures are the slowest, but benefit the most from pre-training on large satellite imagery datasets. Our models provide a 9% F1 score improvement compared to previous work on water pixel identification. When integrated with an SSC estimation system, our models result in a 30x speedup while reducing estimation error by 2.64 mg/L, allowing for global-scale analysis. We also evaluate our model on a recently proposed cloud and cloud shadow estimation benchmark, where we outperform the current state-of-the-art model by at least 6% in F1 score.

Title: Watermarking Training Data of Music Generation Models

Authors: Pascal Epple, Igor Shilov, Bozhidar Stevanovski, Yves-Alexandre de Montjoye
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.08549
Pdf URL: https://arxiv.org/pdf/2412.08549
Copy Paste: [[2412.08549]] Watermarking Training Data of Music Generation Models(https://arxiv.org/abs/2412.08549)
Keywords: robust, watermark, generative
Abstract: Generative Artificial Intelligence (Gen-AI) models are increasingly used to produce content across domains, including text, images, and audio. While these models represent a major technical breakthrough, they gain their generative capabilities from being trained on enormous amounts of human-generated content, which often includes copyrighted material. In this work, we investigate whether audio watermarking techniques can be used to detect an unauthorized usage of content to train a music generation model. We compare outputs generated by a model trained on watermarked data to a model trained on non-watermarked data. We study factors that impact the model's generation behaviour: the watermarking technique, the proportion of watermarked samples in the training set, and the robustness of the watermarking technique against the model's tokenizer. Our results show that audio watermarking techniques, including some that are imperceptible to humans, can lead to noticeable shifts in the model's outputs. We also study the robustness of a state-of-the-art watermarking technique to removal techniques.

Title: Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending against Poisoning Attacks

Authors: Ao Liu, Wenshan Li, Beibei Li, Wengang Ma, Tao Li, Pan Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08555
Pdf URL: https://arxiv.org/pdf/2412.08555
Copy Paste: [[2412.08555]] Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending against Poisoning Attacks(https://arxiv.org/abs/2412.08555)
Keywords: defense, attack
Abstract: End-to-end training with global optimization have popularized graph neural networks (GNNs) for node classification, yet inadvertently introduced vulnerabilities to adversarial edge-perturbing attacks. Adversaries can exploit the inherent opened interfaces of GNNs' input and output, perturbing critical edges and thus manipulating the classification results. Current defenses, due to their persistent utilization of global-optimization-based end-to-end training schemes, inherently encapsulate the vulnerabilities of GNNs. This is specifically evidenced in their inability to defend against targeted secondary attacks. In this paper, we propose the Graph Agent Network (GAgN) to address the aforementioned vulnerabilities of GNNs. GAgN is a graph-structured agent network in which each node is designed as an 1-hop-view agent. Through the decentralized interactions between agents, they can learn to infer global perceptions to perform tasks including inferring embeddings, degrees and neighbor relationships for given nodes. This empowers nodes to filtering adversarial edges while carrying out classification tasks. Furthermore, agents' limited view prevents malicious messages from propagating globally in GAgN, thereby resisting global-optimization-based secondary attacks. We prove that single-hidden-layer multilayer perceptrons (MLPs) are theoretically sufficient to achieve these functionalities. Experimental results show that GAgN effectively implements all its intended capabilities and, compared to state-of-the-art defenses, achieves optimal classification accuracy on the perturbed datasets.

Title: Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Authors: Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreačić, Yifan Li, Xiang Yue, Bo Li, Vamsi K. Potluru, Pan Li, Eli Chien
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.08559
Pdf URL: https://arxiv.org/pdf/2412.08559
Copy Paste: [[2412.08559]] Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning(https://arxiv.org/abs/2412.08559)
Keywords: privacy, attack, membership infer, large language model
Abstract: Large Language Models are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable assessments of LLM unlearning efficacy.

Title: GenPlan: Generative sequence models as adaptive planners

Authors: Akash Karthikeyan, Yash Vardhan Pant
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08565
Pdf URL: https://arxiv.org/pdf/2412.08565
Copy Paste: [[2412.08565]] GenPlan: Generative sequence models as adaptive planners(https://arxiv.org/abs/2412.08565)
Keywords: generative
Abstract: Offline reinforcement learning has shown tremendous success in behavioral planning by learning from previously collected demonstrations. However, decision-making in multitask missions still presents significant challenges. For instance, a mission might require an agent to explore an unknown environment, discover goals, and navigate to them, even if it involves interacting with obstacles along the way. Such behavioral planning problems are difficult to solve due to: a) agents failing to adapt beyond the single task learned through their reward function, and b) the inability to generalize to new environments not covered in the training demonstrations, e.g., environments where all doors were unlocked in the demonstrations. Consequently, state-of-the-art decision making methods are limited to missions where the required tasks are well-represented in the training demonstrations and can be solved within a short (temporal) planning horizon. To address this, we propose GenPlan: a stochastic and adaptive planner that leverages discrete-flow models for generative sequence modeling, enabling sample-efficient exploration and exploitation. This framework relies on an iterative denoising procedure to generate a sequence of goals and actions. This approach captures multi-modal action distributions and facilitates goal and task discovery, thereby enhancing generalization to out-of-distribution tasks and environments, i.e., missions not part of the training data. We demonstrate the effectiveness of our method through multiple simulation environments. Notably, GenPlan outperforms the state-of-the-art methods by over 10% on adaptive planning tasks, where the agent adapts to multi-task missions while leveraging demonstrations on single-goal-reaching tasks.

Title: TryOffAnyone: Tiled Cloth Generation from a Dressed Person

Authors: Ioannis Xarchakos, Theodoros Koukopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08573
Pdf URL: https://arxiv.org/pdf/2412.08573
Copy Paste: [[2412.08573]] TryOffAnyone: Tiled Cloth Generation from a Dressed Person(https://arxiv.org/abs/2412.08573)
Keywords: diffusion, transformer
Abstract: The fashion industry is increasingly leveraging computer vision and deep learning technologies to enhance online shopping experiences and operational efficiencies. In this paper, we address the challenge of generating high-fidelity tiled garment images essential for personalized recommendations, outfit composition, and virtual try-on systems from photos of garments worn by models. Inspired by the success of Latent Diffusion Models (LDMs) in image-to-image translation, we propose a novel approach utilizing a fine-tuned StableDiffusion model. Our method features a streamlined single-stage network design, which integrates garmentspecific masks to isolate and process target clothing items effectively. By simplifying the network architecture through selective training of transformer blocks and removing unnecessary crossattention layers, we significantly reduce computational complexity while achieving state-of-the-art performance on benchmark datasets like VITON-HD. Experimental results demonstrate the effectiveness of our approach in producing high-quality tiled garment images for both full-body and half-body inputs. Code and model are available at: this https URL

Title: Annotation-Efficient Task Guidance for Medical Segment Anything

Authors: Tyler Ward, Abdullah-Al-Zubaer Imran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08575
Pdf URL: https://arxiv.org/pdf/2412.08575
Copy Paste: [[2412.08575]] Annotation-Efficient Task Guidance for Medical Segment Anything(https://arxiv.org/abs/2412.08575)
Keywords: segmentation
Abstract: Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose SAM-Mix, a novel multitask learning framework for medical image segmentation that uses class activation maps produced by an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the SAM framework. Experimental evaluations on the public LiTS dataset confirm the effectiveness of SAM-Mix for simultaneous classification and segmentation of the liver from abdominal computed tomography (CT) scans. When trained for 90% fewer epochs on only 50 labeled 2D slices, representing just 0.04% of the available labeled training data, SAM-Mix achieves a Dice improvement of 5.1% over the best baseline model. The generalization results for SAM-Mix are even more impressive, with the same model configuration yielding a 25.4% Dice improvement on a cross-domain segmentation task. Our code is available at this https URL.

Title: Machine Learning Information Retrieval and Summarisation to Support Systematic Review on Outcomes Based Contracting

Authors: Iman Munire Bilal, Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Eleanor Carter, Mara Airoldi, Rob Procter
Subjects: cs.CL, cs.CY, cs.DL, cs.HC
Abstract URL: https://arxiv.org/abs/2412.08578
Pdf URL: https://arxiv.org/pdf/2412.08578
Copy Paste: [[2412.08578]] Machine Learning Information Retrieval and Summarisation to Support Systematic Review on Outcomes Based Contracting(https://arxiv.org/abs/2412.08578)
Keywords: explainability
Abstract: As academic literature proliferates, traditional review methods are increasingly challenged by the sheer volume and diversity of available research. This article presents a study that aims to address these challenges by enhancing the efficiency and scope of systematic reviews in the social sciences through advanced machine learning (ML) and natural language processing (NLP) tools. In particular, we focus on automating stages within the systematic reviewing process that are time-intensive and repetitive for human annotators and which lend themselves to immediate scalability through tools such as information retrieval and summarisation guided by expert advice. The article concludes with a summary of lessons learnt regarding the integrated approach towards systematic reviews and future directions for improvement, including explainability.

Title: Utilizing Multi-step Loss for Single Image Reflection Removal

Authors: Abdelrahman Elnenaey, Marwan Torki
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.08582
Pdf URL: https://arxiv.org/pdf/2412.08582
Copy Paste: [[2412.08582]] Utilizing Multi-step Loss for Single Image Reflection Removal(https://arxiv.org/abs/2412.08582)
Keywords: segmentation
Abstract: Image reflection removal is crucial for restoring image quality. Distorted images can negatively impact tasks like object detection and image segmentation. In this paper, we present a novel approach for image reflection removal using a single image. Instead of focusing on model architecture, we introduce a new training technique that can be generalized to image-to-image problems, with input and output being similar in nature. This technique is embodied in our multi-step loss mechanism, which has proven effective in the reflection removal task. Additionally, we address the scarcity of reflection removal training data by synthesizing a high-quality, non-linear synthetic dataset called RefGAN using Pix2Pix GAN. This dataset significantly enhances the model's ability to learn better patterns for reflection removal. We also utilize a ranged depth map, extracted from the depth estimation of the ambient image, as an auxiliary feature, leveraging its property of lacking depth estimations for reflections. Our approach demonstrates superior performance on the SIR^2 benchmark and other real-world datasets, proving its effectiveness by outperforming other state-of-the-art models.

Title: TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

Authors: Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2412.08585
Pdf URL: https://arxiv.org/pdf/2412.08585
Copy Paste: [[2412.08585]] TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs(https://arxiv.org/abs/2412.08585)
Keywords: large language model
Abstract: Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

Title: Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning

Authors: Hang Zhao, Qile P. Chen, Yijing Barry Zhang, Gang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08587
Pdf URL: https://arxiv.org/pdf/2412.08587
Copy Paste: [[2412.08587]] Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning(https://arxiv.org/abs/2412.08587)
Keywords: large language model
Abstract: Both encoder-only models (e.g., BERT, RoBERTa) and large language models (LLMs, e.g., Llama3) have been widely used for text classification tasks. However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets. Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance.

Title: ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection

Authors: Tiago Roxo, Joana C. Costa, Pedro Inácio, Hugo Proença
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08594
Pdf URL: https://arxiv.org/pdf/2412.08594
Copy Paste: [[2412.08594]] ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection(https://arxiv.org/abs/2412.08594)
Keywords: robust, extraction
Abstract: State-of-the-art Active Speaker Detection (ASD) approaches mainly use audio and facial features as input. However, the main hypothesis in this paper is that body dynamics is also highly correlated to "speaking" (and "listening") actions and should be particularly useful in wild conditions (e.g., surveillance settings), where face cannot be reliably accessed. We propose ASDnB, a model that singularly integrates face with body information by merging the inputs at different steps of feature extraction. Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance, and is trained with adaptive weight feature importance for improved complement of face with body data. Our experiments show that ASDnB achieves state-of-the-art results in the benchmark dataset (AVA-ActiveSpeaker), in the challenging data of WASD, and in cross-domain settings using Columbia. This way, ASDnB can perform in multiple settings, which is positively regarded as a strong baseline for robust ASD models (code available at this https URL).

Title: Fair Primal Dual Splitting Method for Image Inverse Problems

Authors: Yunfei Qu, Deren Han
Subjects: cs.CV, math.OC
Abstract URL: https://arxiv.org/abs/2412.08613
Pdf URL: https://arxiv.org/pdf/2412.08613
Copy Paste: [[2412.08613]] Fair Primal Dual Splitting Method for Image Inverse Problems(https://arxiv.org/abs/2412.08613)
Keywords: fair
Abstract: Image inverse problems have numerous applications, including image processing, super-resolution, and computer vision, which are important areas in image science. These application models can be seen as a three-function composite optimization problem solvable by a variety of primal dual-type methods. We propose a fair primal dual algorithmic framework that incorporates the smooth term not only into the primal subproblem but also into the dual subproblem. We unify the global convergence and establish the convergence rates of our proposed fair primal dual method. Experiments on image denoising and super-resolution reconstruction demonstrate the superiority of the proposed method over the current state-of-the-art.

Title: Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Authors: Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08614
Pdf URL: https://arxiv.org/pdf/2412.08614
Copy Paste: [[2412.08614]] Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning(https://arxiv.org/abs/2412.08614)
Keywords: segmentation
Abstract: Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs.

Title: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Authors: Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, Yu Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.08615
Pdf URL: https://arxiv.org/pdf/2412.08615
Copy Paste: [[2412.08615]] Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models(https://arxiv.org/abs/2412.08615)
Keywords: security, attack, large language model
Abstract: Despite the advancements in training Large Language Models (LLMs) with alignment techniques to enhance the safety of generated content, these models remain susceptible to jailbreak, an adversarial attack method that exposes security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG) method has demonstrated the ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. However, the optimization process involved in GCG is highly time-consuming, rendering the jailbreaking pipeline inefficient. In this paper, we investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. To this end, we propose the Model Attack Gradient Index GCG (MAGIC), that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens, thereby accelerating the procedure by having less computation and fewer iterations. Our experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of 74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on GPT-3.5. Code is available at this https URL.

Title: Image Retrieval Methods in the Dissimilarity Space

Authors: Madhu Kiran, Kartikey Vishnu, Rafael M. O. Cruz, Eric Granger
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08618
Pdf URL: https://arxiv.org/pdf/2412.08618
Copy Paste: [[2412.08618]] Image Retrieval Methods in the Dissimilarity Space(https://arxiv.org/abs/2412.08618)
Keywords: extraction
Abstract: Image retrieval methods rely on metric learning to train backbone feature extraction models that can extract discriminant queries and reference (gallery) feature representations for similarity matching. Although state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large datasets, image retrieval remains challenging in many real-world video analytics and surveillance applications, e.g., person re-identification. Using the Euclidean space for matching limits the performance in real-world applications due to the curse of dimensionality, overfitting, and sensitivity to noisy data. We argue that the feature dissimilarity space is more suitable for similarity matching, and propose a dichotomy transformation to project query and reference embeddings into a single embedding in the dissimilarity space. We also advocate for end-to-end training of a backbone and binary classification models for pair-wise matching. As opposed to comparing the distance between queries and reference embeddings, we show the benefits of classifying the single dissimilarity space embedding (as similar or dissimilar), especially when trained end-to-end. We propose a method to train the max-margin classifier together with the backbone feature extractor by applying constraints to the L2 norm of the classifier weights along with the hinge loss. Our extensive experiments on challenging image retrieval datasets and using diverse feature extraction backbones highlight the benefits of similarity matching in the dissimilarity space. In particular, when jointly training the feature extraction backbone and regularised classifier for matching, the dissimilarity space provides a higher level of accuracy.

Title: Synthetic Vision: Training Vision-Language Models to Understand Physics

Authors: Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08619
Pdf URL: https://arxiv.org/pdf/2412.08619
Copy Paste: [[2412.08619]] Synthetic Vision: Training Vision-Language Models to Understand Physics(https://arxiv.org/abs/2412.08619)
Keywords: robust, large language model
Abstract: Physical reasoning, which involves the interpretation, understanding, and prediction of object behavior in dynamic environments, remains a significant challenge for current Vision-Language Models (VLMs). In this work, we propose two methods to enhance VLMs' physical reasoning capabilities using simulated data. First, we fine-tune a pre-trained VLM using question-answer (QA) pairs generated from simulations relevant to physical reasoning tasks. Second, we introduce Physics Context Builders (PCBs), specialized VLMs fine-tuned to create scene descriptions enriched with physical properties and processes. During physical reasoning tasks, these PCBs can be leveraged as context to assist a Large Language Model (LLM) to improve its performance. We evaluate both of our approaches using multiple benchmarks, including a new stability detection QA dataset called Falling Tower, which includes both simulated and real-world scenes, and CLEVRER. We demonstrate that a small QA fine-tuned VLM can significantly outperform larger state-of-the-art foundational models. We also show that integrating PCBs boosts the performance of foundational LLMs on physical reasoning tasks. Using the real-world scenes from the Falling Tower dataset, we also validate the robustness of both approaches in Sim2Real transfer. Our results highlight the utility that simulated data can have in the creation of learning systems capable of advanced physical reasoning.

Title: EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Authors: Hongwei Niu, Jie Hu, Jianghang Lin, Shengchuan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08628
Pdf URL: https://arxiv.org/pdf/2412.08628
Copy Paste: [[2412.08628]] EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation(https://arxiv.org/abs/2412.08628)
Keywords: extraction, transformer, segmentation
Abstract: Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatial-aware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.2 PQ, 31.6 mIoU, and 12.7 FPS on the ADE20K dataset for panoptic and semantic segmentation tasks and the inference time of EOV-Seg is 4-21 times faster than state-of-the-art methods. Especially, equipped with ResNet-50 backbone, EOV-Seg runs 25 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at \url{this https URL}.

Title: FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Authors: Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, Tomer Michaeli
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08629
Pdf URL: https://arxiv.org/pdf/2412.08629
Copy Paste: [[2412.08629]] FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(https://arxiv.org/abs/2412.08629)
Keywords: diffusion
Abstract: Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX. Code and examples are available on the project's webpage.

Title: Multimodal Latent Language Modeling with Next-Token Diffusion

Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08635
Pdf URL: https://arxiv.org/pdf/2412.08635
Copy Paste: [[2412.08635]] Multimodal Latent Language Modeling with Next-Token Diffusion(https://arxiv.org/abs/2412.08635)
Keywords: robust, diffusion, transformer, generative, large language model
Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

Title: DMin: Scalable Training Data Influence Estimation for Diffusion Models

Authors: Huawei Lin, Yingjie Lao, Weijie Zhao
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08637
Pdf URL: https://arxiv.org/pdf/2412.08637
Copy Paste: [[2412.08637]] DMin: Scalable Training Data Influence Estimation for Diffusion Models(https://arxiv.org/abs/2412.08637)
Keywords: diffusion
Abstract: Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models, yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. As diffusion models scale up, these methods become impractical. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. By leveraging efficient gradient compression and retrieval techniques, DMin reduces storage requirements from 339.39 TB to only 726 MB and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

Title: Fast Prompt Alignment for Text-to-Image Generation

Authors: Khalil Mrini, Hanlin Lu, Linjie Yang, Weilin Huang, Heng Wang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2412.08639
Pdf URL: https://arxiv.org/pdf/2412.08639
Copy Paste: [[2412.08639]] Fast Prompt Alignment for Text-to-Image Generation(https://arxiv.org/abs/2412.08639)
Keywords: robust, large language model
Abstract: Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA's improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: this https URL

Title: GPD-1: Generative Pre-training for Driving

Authors: Zixun Xie, Sicheng Zuo, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jie Zhou, Jiwen Lu, Shanghang Zhang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.08643
Pdf URL: https://arxiv.org/pdf/2412.08643
Copy Paste: [[2412.08643]] GPD-1: Generative Pre-training for Driving(https://arxiv.org/abs/2412.08643)
Keywords: transformer, generative
Abstract: Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: this https URL.

Title: ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Authors: Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08645
Pdf URL: https://arxiv.org/pdf/2412.08645
Copy Paste: [[2412.08645]] ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation(https://arxiv.org/abs/2412.08645)
Keywords: diffusion
Abstract: This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.

Title: SegFace: Face Segmentation of Long-Tail Classes

Authors: Kartik Narayan, Vibashan VS, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08647
Pdf URL: https://arxiv.org/pdf/2412.08647
Copy Paste: [[2412.08647]] SegFace: Face Segmentation of Long-Tail Classes(https://arxiv.org/abs/2412.08647)
Keywords: transformer, segmentation
Abstract: Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called long-tail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN-based, tend to be dominated by head classes during training, resulting in suboptimal representation for long-tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long-tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer-based model which utilizes learnable class-specific tokens. The transformer decoder leverages class-specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long-tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low-compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask-HQ dataset and 93.03 (+0.65) on the LaPa dataset. Code: this https URL