2024-11-19

Title: Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey

Authors: Yang Gu, Hengyu You, Jian Cao, Muran Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10478
Pdf URL: https://arxiv.org/pdf/2411.10478
Copy Paste: [[2411.10478]] Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey(https://arxiv.org/abs/2411.10478)
Keywords: large language model
Abstract: Building effective machine learning (ML) workflows to address complex tasks is a primary focus of the Automatic ML (AutoML) community and a critical step toward achieving artificial general intelligence (AGI). Recently, the integration of Large Language Models (LLMs) into ML workflows has shown great potential for automating and enhancing various stages of the ML pipeline. This survey provides a comprehensive and up-to-date review of recent advancements in using LLMs to construct and optimize ML workflows, focusing on key components encompassing data and feature engineering, model selection and hyperparameter optimization, and workflow evaluation. We discuss both the advantages and limitations of LLM-driven approaches, emphasizing their capacity to streamline and enhance ML workflow modeling process through language understanding, reasoning, interaction, and generation. Finally, we highlight open challenges and propose future research directions to advance the effective application of LLMs in ML workflows.

Title: Challenges in the Differential Classification of Individual Diagnoses from Co-Occurring Autism and ADHD Using Survey Data

Authors: Aditi Jaiswal, Dennis P. Wall, Peter Washington
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10479
Pdf URL: https://arxiv.org/pdf/2411.10479
Copy Paste: [[2411.10479]] Challenges in the Differential Classification of Individual Diagnoses from Co-Occurring Autism and ADHD Using Survey Data(https://arxiv.org/abs/2411.10479)
Keywords: extraction
Abstract: Autism and Attention-Deficit Hyperactivity Disorder (ADHD) are two of the most commonly observed neurodevelopmental conditions in childhood. Providing a specific computational assessment to distinguish between the two can prove difficult and time intensive. Given the high prevalence of their co-occurrence, there is a need for scalable and accessible methods for distinguishing the co-occurrence of autism and ADHD from individual diagnoses. The first step is to identify a core set of features that can serve as the basis for behavioral feature extraction. We trained machine learning models on data from the National Survey of Children's Health to identify behaviors to target as features in automated clinical decision support systems. A model trained on the binary task of distinguishing either developmental delay (autism or ADHD) vs. neither achieved sensitivity >92% and specificity >94%, while a model trained on the 4-way classification task of autism vs. ADHD vs. both vs. none demonstrated >65% sensitivity and >66% specificity. While the performance of the binary model was respectable, the relatively low performance in the differential classification of autism and ADHD highlights the challenges that persist in achieving specificity within clinical decision support tools for developmental delays. Nevertheless, this study demonstrates the potential of applying behavioral questionnaires not traditionally used for clinical purposes towards supporting digital screening assessments for pediatric developmental delays.

Title: Biometrics in Extended Reality: A Review

Authors: Ayush Agarwal, Raghavendra Ramachandra, Sushma Venkatesh, S. R. Mahadeva Prasanna
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10489
Pdf URL: https://arxiv.org/pdf/2411.10489
Copy Paste: [[2411.10489]] Biometrics in Extended Reality: A Review(https://arxiv.org/abs/2411.10489)
Keywords: security, privacy, robust, biometric
Abstract: In the domain of Extended Reality (XR), particularly Virtual Reality (VR), extensive research has been devoted to harnessing this transformative technology in various real-world applications. However, a critical challenge that must be addressed before unleashing the full potential of XR in practical scenarios is to ensure robust security and safeguard user privacy. This paper presents a systematic survey of the utility of biometric characteristics applied in the XR environment. To this end, we present a comprehensive overview of the different types of biometric modalities used for authentication and representation of users in a virtual environment. We discuss different biometric vulnerability gateways in general XR systems for the first time in the literature along with taxonomy. A comprehensive discussion on generating and authenticating biometric-based photorealistic avatars in XR environments is presented with a stringent taxonomy. We also discuss the availability of different datasets that are widely employed in evaluating biometric authentication in XR environments together with performance evaluation metrics. Finally, we discuss the open challenges and potential future work that need to be addressed in the field of biometrics in XR.

Title: MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds

Authors: Jinge Ma, Xiaoyan Zhang, Gautham Vinod, Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10492
Pdf URL: https://arxiv.org/pdf/2411.10492
Copy Paste: [[2411.10492]] MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds(https://arxiv.org/abs/2411.10492)
Keywords: extraction
Abstract: Food portion estimation is crucial for monitoring health and tracking dietary intake. Image-based dietary assessment, which involves analyzing eating occasion images using computer vision techniques, is increasingly replacing traditional methods such as 24-hour recalls. However, accurately estimating the nutritional content from images remains challenging due to the loss of 3D information when projecting to the 2D image plane. Existing portion estimation methods are challenging to deploy in real-world scenarios due to their reliance on specific requirements, such as physical reference objects, high-quality depth information, or multi-view images and videos. In this paper, we introduce MFP3D, a new framework for accurate food portion estimation using only a single monocular image. Specifically, MFP3D consists of three key modules: (1) a 3D Reconstruction Module that generates a 3D point cloud representation of the food from the 2D image, (2) a Feature Extraction Module that extracts and concatenates features from both the 3D point cloud and the 2D RGB image, and (3) a Portion Regression Module that employs a deep regression model to estimate the food's volume and energy content based on the extracted features. Our MFP3D is evaluated on MetaFood3D dataset, demonstrating its significant improvement in accurate portion estimation over existing methods.

Title: Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Authors: Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10495
Pdf URL: https://arxiv.org/pdf/2411.10495
Copy Paste: [[2411.10495]] Boundary Attention Constrained Zero-Shot Layout-To-Image Generation(https://arxiv.org/abs/2411.10495)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, several studies developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require either fine-tuning pretrained parameters or training additional control modules for the diffusion models. In this work, we propose a novel zero-shot L2I approach, BACON (Boundary Attention Constrained generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing zero-shot L2I techniuqes both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

Title: Structure Tensor Representation for Robust Oriented Object Detection

Authors: Xavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel, Thibaud Ehret
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10497
Pdf URL: https://arxiv.org/pdf/2411.10497
Copy Paste: [[2411.10497]] Structure Tensor Representation for Robust Oriented Object Detection(https://arxiv.org/abs/2411.10497)
Keywords: robust
Abstract: Oriented object detection predicts orientation in addition to object location and bounding box. Precisely predicting orientation remains challenging due to angular periodicity, which introduces boundary discontinuity issues and symmetry ambiguities. Inspired by classical works on edge and corner detection, this paper proposes to represent orientation in oriented bounding boxes as a structure tensor. This representation combines the strengths of Gaussian-based methods and angle-coder solutions, providing a simple yet efficient approach that is robust to angular periodicity issues without additional hyperparameters. Extensive evaluations across five datasets demonstrate that the proposed structure tensor representation outperforms previous methods in both fully-supervised and weakly supervised tasks, achieving high precision in angular prediction with minimal computational overhead. Thus, this work establishes structure tensors as a robust and modular alternative for encoding orientation in oriented object detection. We make our code publicly available, allowing for seamless integration into existing object detectors.

Title: Prompt-Guided Environmentally Consistent Adversarial Patch

Authors: Chaoqun Li, Huanqian Yan, Lifeng Zhou, Tairan Chen, Zhuodong Liu, Hang Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10498
Pdf URL: https://arxiv.org/pdf/2411.10498
Copy Paste: [[2411.10498]] Prompt-Guided Environmentally Consistent Adversarial Patch(https://arxiv.org/abs/2411.10498)
Keywords: security, attack, diffusion
Abstract: Adversarial attacks in the physical world pose a significant threat to the security of vision-based systems, such as facial recognition and autonomous driving. Existing adversarial patch methods primarily focus on improving attack performance, but they often produce patches that are easily detectable by humans and struggle to achieve environmental consistency, i.e., blending patches into the environment. This paper introduces a novel approach for generating adversarial patches, which addresses both the visual naturalness and environmental consistency of the patches. We propose Prompt-Guided Environmentally Consistent Adversarial Patch (PG-ECAP), a method that aligns the patch with the environment to ensure seamless integration into the environment. The approach leverages diffusion models to generate patches that are both environmental consistency and effective in evading detection. To further enhance the naturalness and consistency, we introduce two alignment losses: Prompt Alignment Loss and Latent Space Alignment Loss, ensuring that the generated patch maintains its adversarial properties while fitting naturally within its environment. Extensive experiments in both digital and physical domains demonstrate that PG-ECAP outperforms existing methods in attack success rate and environmental consistency.

Title: FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

Authors: Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10499
Pdf URL: https://arxiv.org/pdf/2411.10499
Copy Paste: [[2411.10499]] FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on(https://arxiv.org/abs/2411.10499)
Keywords: robust, diffusion, transformer
Abstract: Although image-based virtual try-on has made considerable progress, emerging approaches still encounter challenges in producing high-fidelity and robust fitting images across diverse scenarios. These methods often struggle with issues such as texture-aware maintenance and size-aware fitting, which hinder their overall effectiveness. To address these limitations, we propose a novel garment perception enhancement technique, termed FitDiT, designed for high-fidelity virtual try-on using Diffusion Transformers (DiT) allocating more parameters and attention to high-resolution features. First, to further improve texture-aware maintenance, we introduce a garment texture extractor that incorporates garment priors evolution to fine-tune garment feature, facilitating to better capture rich details such as stripes, patterns, and text. Additionally, we introduce frequency-domain learning by customizing a frequency distance loss to enhance high-frequency garment details. To tackle the size-aware fitting issue, we employ a dilated-relaxed mask strategy that adapts to the correct length of garments, preventing the generation of garments that fill the entire mask area during cross-category try-on. Equipped with the above design, FitDiT surpasses all baselines in both qualitative and quantitative evaluations. It excels in producing well-fitting garments with photorealistic and intricate details, while also achieving competitive inference times of 4.57 seconds for a single 1024x768 image after DiT structure slimming, outperforming existing methods.

Title: Edge-Only Universal Adversarial Attacks in Distributed Learning

Authors: Giulio Rossolini, Tommaso Baldi, Alessandro Biondi, Giorgio Buttazzo
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10500
Pdf URL: https://arxiv.org/pdf/2411.10500
Copy Paste: [[2411.10500]] Edge-Only Universal Adversarial Attacks in Distributed Learning(https://arxiv.org/abs/2411.10500)
Keywords: attack, robust
Abstract: Distributed learning frameworks, which partition neural network models across multiple computing nodes, enhance efficiency in collaborative edge-cloud systems but may also introduce new vulnerabilities. In this work, we explore the feasibility of generating universal adversarial attacks when an attacker has access to the edge part of the model only, which consists in the first network layers. Unlike traditional universal adversarial perturbations (UAPs) that require full model knowledge, our approach shows that adversaries can induce effective mispredictions in the unknown cloud part by leveraging key features on the edge side. Specifically, we train lightweight classifiers from intermediate features available at the edge, i.e., before the split point, and use them in a novel targeted optimization to craft effective UAPs. Our results on ImageNet demonstrate strong attack transferability to the unknown cloud part. Additionally, we analyze the capability of an attacker to achieve targeted adversarial effect with edge-only knowledge, revealing intriguing behaviors. By introducing the first adversarial attacks with edge-only knowledge in split inference, this work underscores the importance of addressing partial model access in adversarial robustness, encouraging further research in this area.

Title: OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

Authors: Mathis Koroglu, Hugo Caselles-Dupré, Guillaume Jeanneret Sanmiguel, Matthieu Cord
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10501
Pdf URL: https://arxiv.org/pdf/2411.10501
Copy Paste: [[2411.10501]] OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models(https://arxiv.org/abs/2411.10501)
Keywords: diffusion
Abstract: We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

Title: USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting

Authors: Kang Chen, Jiyuan Zhang, Zecheng Hao, Yajing Zheng, Tiejun Huang, Zhaofei Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10504
Pdf URL: https://arxiv.org/pdf/2411.10504
Copy Paste: [[2411.10504]] USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting(https://arxiv.org/abs/2411.10504)
Keywords: robust
Abstract: Spike cameras, as an innovative neuromorphic camera that captures scenes with the 0-1 bit stream at 40 kHz, are increasingly employed for the 3D reconstruction task via Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Previous spike-based 3D reconstruction approaches often employ a casecased pipeline: starting with high-quality image reconstruction from spike streams based on established spike-to-image reconstruction algorithms, then progressing to camera pose estimation and 3D reconstruction. However, this cascaded approach suffers from substantial cumulative errors, where quality limitations of initial image reconstructions negatively impact pose estimation, ultimately degrading the fidelity of the 3D reconstruction. To address these issues, we propose a synergistic optimization framework, \textbf{USP-Gaussian}, that unifies spike-based image reconstruction, pose correction, and Gaussian splatting into an end-to-end framework. Leveraging the multi-view consistency afforded by 3DGS and the motion capture capability of the spike camera, our framework enables a joint iterative optimization that seamlessly integrates information between the spike-to-image network and 3DGS. Experiments on synthetic datasets with accurate poses demonstrate that our method surpasses previous approaches by effectively eliminating cascading errors. Moreover, we integrate pose optimization to achieve robust 3D reconstruction in real-world scenarios with inaccurate initial poses, outperforming alternative methods by effectively reducing noise and preserving fine texture details. Our code, data and trained models will be available at \url{this https URL}.

Title: DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration

Authors: Xinmin Qiu, Bonan Li, Zicheng Zhang, Congying Han, Tiande Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10508
Pdf URL: https://arxiv.org/pdf/2411.10508
Copy Paste: [[2411.10508]] DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration(https://arxiv.org/abs/2411.10508)
Keywords: diffusion
Abstract: Blind face restoration (BFR) is fundamentally challenged by the extensive range of degradation types and degrees that impact model generalization. Recent advancements in diffusion models have made considerable progress in this field. Nevertheless, a critical limitation is their lack of awareness of specific degradation, leading to potential issues such as unnatural details and inaccurate textures. In this paper, we equip diffusion models with the capability to decouple various degradation as a degradation prompt from low-quality (LQ) face images via unsupervised contrastive learning with reconstruction loss, and demonstrate that this capability significantly improves performance, particularly in terms of the naturalness of the restored images. Our novel restoration scheme, named DR-BFR, guides the denoising of Latent Diffusion Models (LDM) by incorporating Degradation Representation (DR) and content features from LQ images. DR-BFR comprises two modules: 1) Degradation Representation Module (DRM): This module extracts degradation representation with content-irrelevant features from LQ faces and estimates a reasonable distribution in the degradation space through contrastive learning and a specially designed LQ reconstruction. 2) Latent Diffusion Restoration Module (LDRM): This module perceives both degradation features and content features in the latent space, enabling the restoration of high-quality images from LQ inputs. Our experiments demonstrate that the proposed DR-BFR significantly outperforms state-of-the-art methods quantitatively and qualitatively across various datasets. The DR effectively distinguishes between various degradations in blind face inverse problems and provides a reasonably powerful prompt to LDM.

Title: TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding

Authors: Quang P. M. Pham, Khoi T. N. Nguyen, Lan C. Ngo, Dezhen Song, Truong Do, Truong Son Hy
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10509
Pdf URL: https://arxiv.org/pdf/2411.10509
Copy Paste: [[2411.10509]] TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding(https://arxiv.org/abs/2411.10509)
Keywords: robust
Abstract: Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. This work, to the best of our knowledge, presents the first implementation of an Equivariant Scene Graph Neural Network (ESGNN) to generate semantic scene graphs from 3D point clouds, specifically for enhanced scene understanding. Furthermore, a significant limitation of prior methods is the absence of temporal modeling to capture time-dependent relationships among dynamically evolving entities within a scene. To address this gap, we introduce a novel temporal layer that leverages the symmetry-preserving properties of ESGNN to fuse scene graphs across multiple sequences into a unified global representation by an approximate graph-matching algorithm. Our combined architecture, termed the Temporal Equivariant Scene Graph Neural Network (TESGNN), not only surpasses existing state-of-the-art methods in scene estimation accuracy but also achieves faster convergence. Importantly, TESGNN is computationally efficient and straightforward to implement using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges. Our source code is publicly available at: this https URL

Title: SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Authors: Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10510
Pdf URL: https://arxiv.org/pdf/2411.10510
Copy Paste: [[2411.10510]] SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers(https://arxiv.org/abs/2411.10510)
Keywords: diffusion, transformer, generative
Abstract: Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.

Title: On the Privacy Risk of In-context Learning

Authors: Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.10512
Pdf URL: https://arxiv.org/pdf/2411.10512
Copy Paste: [[2411.10512]] On the Privacy Risk of In-context Learning(https://arxiv.org/abs/2411.10512)
Keywords: privacy, attack, membership infer, large language model
Abstract: Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task -- often the private dataset of a party, e.g., a company that wants to leverage the LLM for their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model's sensitivity to their prompts -- in the form of a significantly higher prediction confidence on the prompted data -- as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.

Title: Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction

Authors: Po-han Li, Yunhao Yang, Mohammad Omama, Sandeep Chinchali, Ufuk Topcu
Subjects: cs.CV, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2411.10513
Pdf URL: https://arxiv.org/pdf/2411.10513
Copy Paste: [[2411.10513]] Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction(https://arxiv.org/abs/2411.10513)
Keywords: generative
Abstract: Autonomous agents perceive and interpret their surroundings by integrating multimodal inputs, such as vision, audio, and LiDAR. These perceptual modalities support retrieval tasks, such as place recognition in robotics. However, current multimodal retrieval systems encounter difficulties when parts of the data are missing due to sensor failures or inaccessibility, such as silent videos or LiDAR scans lacking RGB information. We propose Any2Any-a novel retrieval framework that addresses scenarios where both query and reference instances have incomplete modalities. Unlike previous methods limited to the imputation of two modalities, Any2Any handles any number of modalities without training generative models. It calculates pairwise similarities with cross-modal encoders and employs a two-stage calibration process with conformal prediction to align the similarities. Any2Any enables effective retrieval across multimodal datasets, e.g., text-LiDAR and text-time series. It achieves a Recall@5 of 35% on the KITTI dataset, which is on par with baseline models with complete modalities.

Title: "On the goals of linguistic theory": Revisiting Chomskyan theories in the era of AI

Authors: Eva Portelance, Masoud Jasbi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10533
Pdf URL: https://arxiv.org/pdf/2411.10533
Copy Paste: [[2411.10533]] "On the goals of linguistic theory": Revisiting Chomskyan theories in the era of AI(https://arxiv.org/abs/2411.10533)
Keywords: generative, large language model
Abstract: Theoretical linguistics seeks to explain what human language is, and why. Linguists and cognitive scientists have proposed different theoretical models of what language is, as well as cognitive factors that shape it, and allow humans to 'produce', 'understand', and 'acquire' natural languages. However, humans may no longer be the only ones learning to 'generate', 'parse', and 'learn' natural language: artificial intelligence (AI) models such as large language models are proving to have impressive linguistic capabilities. Many are thus questioning what role, if any, such models should play in helping theoretical linguistics reach its ultimate research goals? In this paper, we propose to answer this question, by reiterating the tenets of generative linguistics, a leading school of thought in the field, and by considering how AI models as theories of language relate to each of these important concepts. Specifically, we consider three foundational principles, finding roots in the early works of Noam Chomsky: (1) levels of theoretical adequacy; (2) procedures for linguistic theory development; (3) language learnability and Universal Grammar. In our discussions of each principle, we give special attention to two types of AI models: neural language models and neural grammar induction models. We will argue that such models, in particular neural grammar induction models, do have a role to play, but that this role is largely modulated by the stance one takes regarding each of these three guiding principles.

Title: Does Prompt Formatting Have Any Impact on LLM Performance?

Authors: Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, Sadid Hasan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10541
Pdf URL: https://arxiv.org/pdf/2411.10541
Copy Paste: [[2411.10541]] Does Prompt Formatting Have Any Impact on LLM Performance?(https://arxiv.org/abs/2411.10541)
Keywords: robust, large language model
Abstract: In the realm of Large Language Models (LLMs), prompt optimization is crucial for model performance. Although previous research has explored aspects like rephrasing prompt contexts, using various prompting techniques (like in-context learning and chain-of-thought), and ordering few-shot examples, our understanding of LLM sensitivity to prompt templates remains limited. Therefore, this paper examines the impact of different prompt templates on LLM performance. We formatted the same contexts into various human-readable templates, including plain text, Markdown, JSON, and YAML, and evaluated their impact across tasks like natural language reasoning, code generation, and translation using OpenAI's GPT models. Experiments show that GPT-3.5-turbo's performance varies by up to 40\% in a code translation task depending on the prompt template, while larger models like GPT-4 are more robust to these variations. Our analysis highlights the need to reconsider the use of fixed prompt templates, as different formats can significantly affect model performance.

Title: SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

Authors: Priyansh Bhatnagar, Linfeng Wen, Mingu Kang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10543
Pdf URL: https://arxiv.org/pdf/2411.10543
Copy Paste: [[2411.10543]] SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism(https://arxiv.org/abs/2411.10543)
Keywords: transformer, generative
Abstract: Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.

Title: Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications

Authors: Ankita Agarwal, Tanvi Banerjee, William Romine, Mia Cajita
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2411.10544
Pdf URL: https://arxiv.org/pdf/2411.10544
Copy Paste: [[2411.10544]] Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications(https://arxiv.org/abs/2411.10544)
Keywords: fair
Abstract: Artificial intelligence based predictive models trained on the clinical notes can be demographically biased. This could lead to adverse healthcare disparities in predicting outcomes like length of stay of the patients. Thus, it is necessary to mitigate the demographic biases within these models. We proposed an implicit in-processing debiasing method to combat disparate treatment which occurs when the machine learning model predict different outcomes for individuals based on the sensitive attributes like gender, ethnicity, race, and likewise. For this purpose, we used clinical notes of heart failure patients and used diagnostic codes, procedure reports and physiological vitals of the patients. We used Clinical BERT to obtain feature embeddings within the diagnostic codes and procedure reports, and LSTM autoencoders to obtain feature embeddings within the physiological vitals. Then, we trained two separate deep learning contrastive learning frameworks, one for gender and the other for ethnicity to obtain debiased representations within those demographic traits. We called this debiasing framework Debias-CLR. We leveraged clinical phenotypes of the patients identified in the diagnostic codes and procedure reports in the previous study to measure fairness statistically. We found that Debias-CLR was able to reduce the Single-Category Word Embedding Association Test (SC-WEAT) effect size score when debiasing for gender and ethnicity. We further found that to obtain fair representations in the embedding space using Debias-CLR, the accuracy of the predictive models on downstream tasks like predicting length of stay of the patients did not get reduced as compared to using the un-debiased counterparts for training the predictive models. Hence, we conclude that our proposed approach, Debias-CLR is fair and representative in mitigating demographic biases and can reduce health disparities.

Title: Efficient Alignment of Large Language Models via Data Sampling

Authors: Amrit Khera, Rajat Ghosh, Debojyoti Dutta
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10545
Pdf URL: https://arxiv.org/pdf/2411.10545
Copy Paste: [[2411.10545]] Efficient Alignment of Large Language Models via Data Sampling(https://arxiv.org/abs/2411.10545)
Keywords: large language model
Abstract: LLM alignment ensures that large language models behave safely and effectively by aligning their outputs with human values, goals, and intentions. Aligning LLMs employ huge amounts of data, computation, and time. Moreover, curating data with human feedback is expensive and takes time. Recent research depicts the benefit of data engineering in the fine-tuning and pre-training paradigms to bring down such costs. However, alignment differs from the afore-mentioned paradigms and it is unclear if data efficient alignment is feasible. In this work, we first aim to understand how the performance of LLM alignment scales with data. We find out that LLM alignment performance follows an exponential plateau pattern which tapers off post a rapid initial increase. Based on this, we identify data subsampling as a viable method to reduce resources required for alignment. Further, we propose an information theory-based methodology for efficient alignment by identifying a small high quality subset thereby reducing the computation and time required by alignment. We evaluate the proposed methodology over multiple datasets and compare the results. We find that the model aligned using our proposed methodology outperforms other sampling methods and performs comparable to the model aligned with the full dataset while using less than 10% data, leading to greater than 90% savings in costs, resources, and faster LLM alignment.

Title: Low-Rank Optimal Transport through Factor Relaxation with Latent Coupling

Authors: Peter Halmos, Xinhao Liu, Julian Gold, Benjamin J Raphael
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.10555
Pdf URL: https://arxiv.org/pdf/2411.10555
Copy Paste: [[2411.10555]] Low-Rank Optimal Transport through Factor Relaxation with Latent Coupling(https://arxiv.org/abs/2411.10555)
Keywords: interpretability
Abstract: Optimal transport (OT) is a general framework for finding a minimum-cost transport plan, or coupling, between probability distributions, and has many applications in machine learning. A key challenge in applying OT to massive datasets is the quadratic scaling of the coupling matrix with the size of the dataset. [Forrow et al. 2019] introduced a factored coupling for the k-Wasserstein barycenter problem, which [Scetbon et al. 2021] adapted to solve the primal low-rank OT problem. We derive an alternative parameterization of the low-rank problem based on the $\textit{latent coupling}$ (LC) factorization previously introduced by [Lin et al. 2021] generalizing [Forrow et al. 2019]. The LC factorization has multiple advantages for low-rank OT including decoupling the problem into three OT problems and greater flexibility and interpretability. We leverage these advantages to derive a new algorithm $\textit{Factor Relaxation with Latent Coupling}$ (FRLC), which uses $\textit{coordinate}$ mirror descent to compute the LC factorization. FRLC handles multiple OT objectives (Wasserstein, Gromov-Wasserstein, Fused Gromov-Wasserstein), and marginal constraints (balanced, unbalanced, and semi-relaxed) with linear space complexity. We provide theoretical results on FRLC, and demonstrate superior performance on diverse applications -- including graph clustering and spatial transcriptomics -- while demonstrating its interpretability.

Title: mlan: language-based instruction tuning improves zero-shot generalization of multimodal large language models

Authors: Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10557
Pdf URL: https://arxiv.org/pdf/2411.10557
Copy Paste: [[2411.10557]] mlan: language-based instruction tuning improves zero-shot generalization of multimodal large language models(https://arxiv.org/abs/2411.10557)
Keywords: large language model
Abstract: We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models. In contrast to existing instruction tuning mechanisms that heavily rely on visual instructions, our approach focuses on language-based instruction tuning, offering a distinct and more training efficient path for multimodal instruction tuning. We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities. Our results show that our language-only instruction tuning is able to significantly improve the performance of two pretrained multimodal models based on Llama 2 and Vicuna on those unseen datasets. Interestingly, the language instruction following ability also helps unlock the models to follow vision instructions without explicit training. Compared to the state of the art multimodal instruction tuning approaches that are mainly based on visual instructions, our language-based method not only achieves superior performance but also significantly enhances training efficiency. For instance, the language-only instruction tuning produces competitive average performance across the evaluated datasets (with even better performance on language datasets) with significant training efficiency improvements (on average 4x), thanks to the striking reduction in the need for vision data. With a small number of visual instructions, this emerging language instruction following ability transfers well to the unseen vision datasets, outperforming the state of the art with greater training efficiency.

Title: Vision Eagle Attention: A New Lens for Advancing Image Classification

Authors: Mahmudul Hasan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10564
Pdf URL: https://arxiv.org/pdf/2411.10564
Copy Paste: [[2411.10564]] Vision Eagle Attention: A New Lens for Advancing Image Classification(https://arxiv.org/abs/2411.10564)
Keywords: extraction, segmentation
Abstract: In computer vision tasks, the ability to focus on relevant regions within an image is crucial for improving model performance, particularly when key features are small, subtle, or spatially dispersed. Convolutional neural networks (CNNs) typically treat all regions of an image equally, which can lead to inefficient feature extraction. To address this challenge, I have introduced Vision Eagle Attention, a novel attention mechanism that enhances visual feature extraction using convolutional spatial attention. The model applies convolution to capture local spatial features and generates an attention map that selectively emphasizes the most informative regions of the image. This attention mechanism enables the model to focus on discriminative features while suppressing irrelevant background information. I have integrated Vision Eagle Attention into a lightweight ResNet-18 architecture, demonstrating that this combination results in an efficient and powerful model. I have evaluated the performance of the proposed model on three widely used benchmark datasets: FashionMNIST, Intel Image Classification, and OracleMNIST, with a primary focus on image classification. Experimental results show that the proposed approach improves classification accuracy. Additionally, this method has the potential to be extended to other vision tasks, such as object detection, segmentation, and visual tracking, offering a computationally efficient solution for a wide range of vision-based applications. Code is available at: this https URL

Title: Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

Authors: Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10582
Pdf URL: https://arxiv.org/pdf/2411.10582
Copy Paste: [[2411.10582]] Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera(https://arxiv.org/abs/2411.10582)
Keywords: diffusion
Abstract: Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured with a dynamic camera, in-the-wild. This is a challenging task as the monocular input has inherent depth ambiguity, and the moving camera adds additional complexity as the rendered human motion is now a product of both human and camera movement. Not accounting for this confusion, existing GHMR methods often output motions that are unrealistic, e.g. unaccounted root translation of the human causes foot sliding. We present DiffOpt, a novel 3D global HMR method using Diffusion Optimization. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions. We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and Egobody, and demonstrate superior global human motion recovery capability over other state-of-the-art global HMR methods most prominently in long video settings.

Title: Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction

Authors: Christoph Brosch, Alexander Bouwens, Sebastian Bast, Swen Haab, Rolf Krieger
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10591
Pdf URL: https://arxiv.org/pdf/2411.10591
Copy Paste: [[2411.10591]] Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction(https://arxiv.org/abs/2411.10591)
Keywords: extraction
Abstract: Price forecasting for used construction equipment is a challenging task due to spatial and temporal price fluctuations. It is thus of high interest to automate the forecasting process based on current market data. Even though applying machine learning (ML) to these data represents a promising approach to predict the residual value of certain tools, it is hard to implement for small and medium-sized enterprises due to their insufficient ML expertise. To this end, we demonstrate the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions, which automatically generate the underlying pipelines. We combine AutoML methods with the domain knowledge of the companies. Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part. To take all complex industrial requirements into account and to demonstrate the applicability of our new approach, we designed a novel metric named method evaluation score, which incorporates the most important technical and non-technical metrics for quality and usability. Based on this metric, we show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts for innovative small and medium-sized enterprises which are interested in conducting such solutions.

Title: FedAli: Personalized Federated Learning with Aligned Prototypes through Optimal Transport

Authors: Sannara Ek, Kaile Wang, François Portet, Philippe Lalanda, Jiannong Cao
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10595
Pdf URL: https://arxiv.org/pdf/2411.10595
Copy Paste: [[2411.10595]] FedAli: Personalized Federated Learning with Aligned Prototypes through Optimal Transport(https://arxiv.org/abs/2411.10595)
Keywords: federate
Abstract: Federated Learning (FL) enables collaborative, personalized model training across multiple devices without sharing raw data, making it ideal for pervasive computing applications that optimize user-centric performances in diverse environments. However, data heterogeneity among clients poses a significant challenge, leading to inconsistencies among trained client models and reduced performance. To address this, we introduce the Alignment with Prototypes (ALP) layers, which align incoming embeddings closer to learnable prototypes through an optimal transport plan. During local training, the ALP layer updates local prototypes and aligns embeddings toward global prototypes aggregated from all clients using our novel FL framework, Federated Alignment (FedAli). For model inferences, embeddings are guided toward local prototypes to better reflect the client's local data distribution. We evaluate FedAli on heterogeneous sensor-based human activity recognition and vision benchmark datasets, demonstrating that it outperforms existing FL strategies. We publicly release our source code to facilitate reproducibility and furthered research.

Title: AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

Authors: Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10606
Pdf URL: https://arxiv.org/pdf/2411.10606
Copy Paste: [[2411.10606]] AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment(https://arxiv.org/abs/2411.10606)
Keywords: large language model
Abstract: Motivated by the transformative capabilities of large language models (LLMs) across various natural language tasks, there has been a growing demand to deploy these models effectively across diverse real-world applications and platforms. However, the challenge of efficiently deploying LLMs has become increasingly pronounced due to the varying application-specific performance requirements and the rapid evolution of computational platforms, which feature diverse resource constraints and deployment flows. These varying requirements necessitate LLMs that can adapt their structures (depth and width) for optimal efficiency across different platforms and application specifications. To address this critical gap, we propose AmoebaLLM, a novel framework designed to enable the instant derivation of LLM subnets of arbitrary shapes, which achieve the accuracy-efficiency frontier and can be extracted immediately after a one-time fine-tuning. In this way, AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications. Specifically, AmoebaLLM integrates three innovative components: (1) a knowledge-preserving subnet selection strategy that features a dynamic-programming approach for depth shrinking and an importance-driven method for width shrinking; (2) a shape-aware mixture of LoRAs to mitigate gradient conflicts among subnets during fine-tuning; and (3) an in-place distillation scheme with loss-magnitude balancing as the fine-tuning objective. Extensive experiments validate that AmoebaLLM not only sets new standards in LLM adaptability but also successfully delivers subnets that achieve state-of-the-art trade-offs between accuracy and efficiency.

Title: Contextualizing Security and Privacy of Software-Defined Vehicles: State of the Art and Industry Perspectives

Authors: Marco De Vincenzi, Mert D. Pesé, Chiara Bodei, Ilaria Matteucci, Richard R. Brooks, Monowar Hasan, Andrea Saracino, Mohammad Hamad, Sebastian Steinhorst
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2411.10612
Pdf URL: https://arxiv.org/pdf/2411.10612
Copy Paste: [[2411.10612]] Contextualizing Security and Privacy of Software-Defined Vehicles: State of the Art and Industry Perspectives(https://arxiv.org/abs/2411.10612)
Keywords: security, privacy, defense, attack, robust
Abstract: The growing reliance on software in vehicles has given rise to the concept of Software-Defined Vehicles (SDVs), fundamentally reshaping the vehicles and the automotive industry. This survey explores the cybersecurity and privacy challenges posed by SDVs, which increasingly integrate features like Over-the-Air (OTA) updates and Vehicle-to-Everything (V2X) communication. While these advancements enhance vehicle capabilities and flexibility, they also come with a flip side: increased exposure to security risks including API vulnerabilities, third-party software risks, and supply-chain threats. The transition to SDVs also raises significant privacy concerns, with vehicles collecting vast amounts of sensitive data, such as location and driver behavior, that could be exploited using inference attacks. This work aims to provide a detailed overview of security threats, mitigation strategies, and privacy risks in SDVs, primarily through a literature review, enriched with insights from a targeted questionnaire with industry experts. Key topics include defining SDVs, comparing them to Connected Vehicles (CVs) and Autonomous Vehicles (AVs), discussing the security challenges associated with OTA updates and the impact of SDV features on data privacy. Our findings highlight the need for robust security frameworks, standardized communication protocols, and privacy-preserving techniques to address the issues of SDVs. This work ultimately emphasizes the importance of a multi-layered defense strategy,integrating both in-vehicle and cloud-based security solutions, to safeguard future SDVs and increase user trust.

Title: To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling

Authors: Meenatchi Sundaram Muthu Selva Annamalai, Borja Balle, Emiliano De Cristofaro, Jamie Hayes
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10614
Pdf URL: https://arxiv.org/pdf/2411.10614
Copy Paste: [[2411.10614]] To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling(https://arxiv.org/abs/2411.10614)
Keywords: privacy
Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) is a popular method for training machine learning models with formal Differential Privacy (DP) guarantees. As DP-SGD processes the training data in batches, it uses Poisson sub-sampling to select batches at each step. However, due to computational and compatibility benefits, replacing sub-sampling with shuffling has become common practice. Yet, since tight theoretical guarantees for shuffling are currently unknown, prior work using shuffling reports DP guarantees as though Poisson sub-sampling was used. This prompts the need to verify whether this discrepancy is reflected in a gap between the theoretical guarantees from state-of-the-art models and the actual privacy leakage. To do so, we introduce a novel DP auditing procedure to analyze DP-SGD with shuffling. We show that state-of-the-art DP models trained with shuffling appreciably overestimated privacy guarantees (up to 4x). In the process, we assess the impact of several parameters, such as batch size, privacy budget, and threat model, on privacy leakage. Finally, we study two variations of the shuffling procedure found in the wild, which result in further privacy leakage. Overall, our work empirically attests to the risk of using shuffling instead of Poisson sub-sampling vis-à-vis the actual privacy leakage of DP-SGD.

Title: Electrical Load Forecasting in Smart Grid: A Personalized Federated Learning Approach

Authors: Ratun Rahman, Neeraj Kumar, Dinh C. Nguyen
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10619
Pdf URL: https://arxiv.org/pdf/2411.10619
Copy Paste: [[2411.10619]] Electrical Load Forecasting in Smart Grid: A Personalized Federated Learning Approach(https://arxiv.org/abs/2411.10619)
Keywords: privacy, federate
Abstract: Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) are used to record household energy consumption. Traditional machine learning (ML) methods are often employed for load forecasting but require data sharing which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method to load prediction under non-independent and identically distributed (non-IID) metering data settings. Specifically, we introduce meta-learning, where the learning rates are manipulated using the meta-learning idea to maximize the gradient for each client in each global round. Clients with varying processing capacities, data sizes, and batch sizes can participate in global model aggregation and improve their local load forecasting via personalized learning. Simulation results show that our approach outperforms state-of-the-art ML and FL methods in terms of better load forecasting accuracy.

Title: Is thermography a viable solution for detecting pressure injuries in dark skin patients?

Authors: Miriam Asare-Baiden, Kathleen Jordan, Andrew Chung, Sharon Eve Sonenblum, Joyce C. Ho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10627
Pdf URL: https://arxiv.org/pdf/2411.10627
Copy Paste: [[2411.10627]] Is thermography a viable solution for detecting pressure injuries in dark skin patients?(https://arxiv.org/abs/2411.10627)
Keywords: robust
Abstract: Pressure injury (PI) detection is challenging, especially in dark skin tones, due to the unreliability of visual inspection. Thermography has been suggested as a viable alternative as temperature differences in the skin can indicate impending tissue damage. Although deep learning models have demonstrated considerable promise toward reliably detecting PI, the existing work fails to evaluate the performance on darker skin tones and varying data collection protocols. In this paper, we introduce a new thermal and optical imaging dataset of 35 participants focused on darker skin tones where temperature differences are induced through cooling and cupping protocols. We vary the image collection process to include different cameras, lighting, patient pose, and camera distance. We compare the performance of a small convolutional neural network (CNN) trained on either the thermal or the optical images on all skin tones. Our preliminary results suggest that thermography-based CNN is robust to data collection protocols for all skin tones.

Title: Leveraging large language models for efficient representation learning for entity resolution

Authors: Xiaowei Xu, Bi T. Foua, Xingqiao Wang, Vivek Gunasekaran, John R. Talburt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10629
Pdf URL: https://arxiv.org/pdf/2411.10629
Copy Paste: [[2411.10629]] Leveraging large language models for efficient representation learning for entity resolution(https://arxiv.org/abs/2411.10629)
Keywords: robust, transformer, large language model
Abstract: In this paper, the authors propose TriBERTa, a supervised entity resolution system that utilizes a pre-trained large language model and a triplet loss function to learn representations for entity matching. The system consists of two steps: first, name entity records are fed into a Sentence Bidirectional Encoder Representations from Transformers (SBERT) model to generate vector representations, which are then fine-tuned using contrastive learning based on a triplet loss function. Fine-tuned representations are used as input for entity matching tasks, and the results show that the proposed approach outperforms state-of-the-art representations, including SBERT without fine-tuning and conventional Term Frequency-Inverse Document Frequency (TF-IDF), by a margin of 3 - 19%. Additionally, the representations generated by TriBERTa demonstrated increased robustness, maintaining consistently higher performance across a range of datasets. The authors also discussed the importance of entity resolution in today's data-driven landscape and the challenges that arise when identifying and reconciling duplicate data across different sources. They also described the ER process, which involves several crucial steps, including blocking, entity matching, and clustering.

Title: MTA: Multimodal Task Alignment for BEV Perception and Captioning

Authors: Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Feng Tao, Abhirup Mallik, Ziran Wang, Liu Ren
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10639
Pdf URL: https://arxiv.org/pdf/2411.10639
Copy Paste: [[2411.10639]] MTA: Multimodal Task Alignment for BEV Perception and Captioning(https://arxiv.org/abs/2411.10639)
Keywords: large language model
Abstract: Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one of the tasks and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA integrates into state-of-the-art baselines during training, adding no extra computational complexity at runtime. Extensive experiments on the nuScenes and TOD3Cap datasets show that MTA significantly outperforms state-of-the-art baselines, achieving a 4.9% improvement in perception and a 9.2% improvement in captioning. These results underscore the effectiveness of unified alignment in reconciling BEV-based perception and captioning.

Title: BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Authors: Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10640
Pdf URL: https://arxiv.org/pdf/2411.10640
Copy Paste: [[2411.10640]] BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices(https://arxiv.org/abs/2411.10640)
Keywords: large language model
Abstract: The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).

Title: Enhancing PTSD Outcome Prediction with Ensemble Models in Disaster Contexts

Authors: Ayesha Siddiqua, Atib Mohammad Oni, Abu Saleh Musa Miah, Jungpil Shin
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10661
Pdf URL: https://arxiv.org/pdf/2411.10661
Copy Paste: [[2411.10661]] Enhancing PTSD Outcome Prediction with Ensemble Models in Disaster Contexts(https://arxiv.org/abs/2411.10661)
Keywords: robust
Abstract: Post-traumatic stress disorder (PTSD) is a significant mental health challenge that affects individuals exposed to traumatic events. Early detection and effective intervention for PTSD are crucial, as it can lead to long-term psychological distress if untreated. Accurate detection of PTSD is essential for timely and targeted mental health interventions, especially in disaster-affected populations. Existing research has explored machine learning approaches for classifying PTSD, but many face limitations in terms of model performance and generalizability. To address these issues, we implemented a comprehensive preprocessing pipeline. This included data cleaning, missing value treatment using the SimpleImputer, label encoding of categorical variables, data augmentation using SMOTE to balance the dataset, and feature scaling with StandardScaler. The dataset was split into 80\% training and 20\% testing. We developed an ensemble model using a majority voting technique among several classifiers, including Logistic Regression, Support Vector Machines (SVM), Random Forest, XGBoost, LightGBM, and a customized Artificial Neural Network (ANN). The ensemble model achieved an accuracy of 96.76\% with a benchmark dataset, significantly outperforming individual models. The proposed method's advantages include improved robustness through the combination of multiple models, enhanced ability to generalize across diverse data points, and increased accuracy in detecting PTSD. Additionally, the use of SMOTE for data augmentation ensured better handling of imbalanced datasets, leading to more reliable predictions. The proposed approach offers valuable insights for policymakers and healthcare providers by leveraging predictive analytics to address mental health issues in vulnerable populations, particularly those affected by disasters.

Title: AutoIoT: Automated IoT Platform Using Large Language Models

Authors: Ye Cheng, Minghui Xu, Yue Zhang, Kun Li, Ruoxi Wang, Lian Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10665
Pdf URL: https://arxiv.org/pdf/2411.10665
Copy Paste: [[2411.10665]] AutoIoT: Automated IoT Platform Using Large Language Models(https://arxiv.org/abs/2411.10665)
Keywords: extraction, large language model
Abstract: IoT platforms, particularly smart home platforms providing significant convenience to people's lives such as Apple HomeKit and Samsung SmartThings, allow users to create automation rules through trigger-action programming. However, some users may lack the necessary knowledge to formulate automation rules, thus preventing them from fully benefiting from the conveniences offered by smart home technology. To address this, smart home platforms provide pre-defined automation policies based on the smart home devices registered by the user. Nevertheless, these policies, being pre-generated and relatively simple, fail to adequately cover the diverse needs of users. Furthermore, conflicts may arise between automation rules, and integrating conflict detection into the IoT platform increases the burden on developers. In this paper, we propose AutoIoT, an automated IoT platform based on Large Language Models (LLMs) and formal verification techniques, designed to achieve end-to-end automation through device information extraction, LLM-based rule generation, conflict detection, and avoidance. AutoIoT can help users generate conflict-free automation rules and assist developers in generating codes for conflict detection, thereby enhancing their experience. A code adapter has been designed to separate logical reasoning from the syntactic details of code generation, enabling LLMs to generate code for programming languages beyond their training data. Finally, we evaluated the performance of AutoIoT and presented a case study demonstrating how AutoIoT can integrate with existing IoT platforms.

Title: SAM Decoding: Speculative Decoding via Suffix Automaton

Authors: Yuxuan Hu, Ke Wang, Jing Zhang, Cuiping Li, Hong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10666
Pdf URL: https://arxiv.org/pdf/2411.10666
Copy Paste: [[2411.10666]] SAM Decoding: Speculative Decoding via Suffix Automaton(https://arxiv.org/abs/2411.10666)
Keywords: large language model
Abstract: Large Language Models (LLMs) have revolutionized natural language processing by unifying tasks into text generation, yet their large parameter sizes and autoregressive nature limit inference speed. SAM-Decoding addresses this by introducing a novel retrieval-based speculative decoding method that uses a suffix automaton for efficient and accurate draft generation. Unlike n-gram matching used by the existing method, SAM-Decoding finds the longest suffix match in generating text and text corpuss, achieving an average time complexity of $O(1)$ per generation step. SAM-Decoding constructs static and dynamic suffix automatons for the text corpus and input prompts, respectively, enabling fast and precise draft generation. Meanwhile, it is designed as an approach that can be combined with existing methods, allowing SAM-Decoding to adaptively select a draft generation strategy based on the matching length, thus increasing the inference speed of the LLM. When combined with Token Recycling, evaluations show SAM-Decoding outperforms existing model-free methods, achieving a speedup of $2.27\times$ over autoregressive decoding on Spec-Bench. When combined with EAGLE2, it reaches a speedup of $2.49\times$, surpassing all current approaches. Our code is available at this https URL.

Title: Segmentation of Ink and Parchment in Dead Sea Scroll Fragments

Authors: Berat Kurar-Barakat, Nachum Dershowitz
Subjects: cs.CV, cs.DL
Abstract URL: https://arxiv.org/abs/2411.10668
Pdf URL: https://arxiv.org/pdf/2411.10668
Copy Paste: [[2411.10668]] Segmentation of Ink and Parchment in Dead Sea Scroll Fragments(https://arxiv.org/abs/2411.10668)
Keywords: segmentation
Abstract: The discovery of the Dead Sea Scrolls over 60 years ago is widely regarded as one of the greatest archaeological breakthroughs in modern history. Recent study of the scrolls presents ongoing computational challenges, including determining the provenance of fragments, clustering fragments based on their degree of similarity, and pairing fragments that originate from the same manuscript -- all tasks that require focusing on individual letter and fragment shapes. This paper presents a computational method for segmenting ink and parchment regions in multispectral images of Dead Sea Scroll fragments. Using the newly developed Qumran Segmentation Dataset (QSD) consisting of 20 fragments, we apply multispectral thresholding to isolate ink and parchment regions based on their unique spectral signatures. To refine segmentation accuracy, we introduce an energy minimization technique that leverages ink contours, which are more distinguishable from the background and less noisy than inner ink regions. Experimental results demonstrate that this Multispectral Thresholding and Energy Minimization (MTEM) method achieves significant improvements over traditional binarization approaches like Otsu and Sauvola in parchment segmentation and is successful at delineating ink borders, in distinction from holes and background regions.

Title: Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Authors: Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10669
Pdf URL: https://arxiv.org/pdf/2411.10669
Copy Paste: [[2411.10669]] Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts(https://arxiv.org/abs/2411.10669)
Keywords: large language model
Abstract: As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: this https URL.

Title: IntentGPT: Few-shot Intent Discovery with Large Language Models

Authors: Juan A. Rodriguez, Nicholas Botzer, David Vazquez, Christopher Pal, Marco Pedersoli, Issam Laradji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10670
Pdf URL: https://arxiv.org/pdf/2411.10670
Copy Paste: [[2411.10670]] IntentGPT: Few-shot Intent Discovery with Large Language Models(https://arxiv.org/abs/2411.10670)
Keywords: large language model
Abstract: In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual assistants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training-free method that effectively prompts Large Language Models (LLMs) such as GPT-4 to discover new intents with minimal labeled data. IntentGPT comprises an \textit{In-Context Prompt Generator}, which generates informative prompts for In-Context Learning, an \textit{Intent Predictor} for classifying and discovering user intents from utterances, and a \textit{Semantic Few-Shot Sampler} that selects relevant few-shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain-specific data and fine-tuning, in popular benchmarks, including CLINC and BANKING, among others.

Title: How to Defend Against Large-scale Model Poisoning Attacks in Federated Learning: A Vertical Solution

Authors: Jinbo Wang, Ruijin Wang, Fengli Zhang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.10673
Pdf URL: https://arxiv.org/pdf/2411.10673
Copy Paste: [[2411.10673]] How to Defend Against Large-scale Model Poisoning Attacks in Federated Learning: A Vertical Solution(https://arxiv.org/abs/2411.10673)
Keywords: defense, attack, federate
Abstract: Federated learning (FL) is vulnerable to model poisoning attacks due to its distributed nature. The current defenses start from all user gradients (model updates) in each communication round and solve for the optimal aggregation gradients (horizontal solution). This horizontal solution will completely fail when facing large-scale (>50%) model poisoning attacks. In this work, based on the key insight that the convergence process of the model is a highly predictable process, we break away from the traditional horizontal solution of defense and innovatively transform the problem of solving the optimal aggregation gradients into a vertical solution problem. We propose VERT, which uses global communication rounds as the vertical axis, trains a predictor using historical gradients information to predict user gradients, and compares the similarity with actual user gradients to precisely and efficiently select the optimal aggregation gradients. In order to reduce the computational complexity of VERT, we design a low dimensional vector projector to project the user gradients to a computationally acceptable length, and then perform subsequent predictor training and prediction tasks. Exhaustive experiments show that VERT is efficient and scalable, exhibiting excellent large-scale (>=80%) model poisoning defense effects under different FL scenarios. In addition, we can design projector with different structures for different model structures to adapt to aggregation servers with different computing power.

Title: Two-layer consensus based on master-slave consortium chain data sharing for Internet of Vehicles

Authors: Feng Zhao, Benchang Yang, Chunhai Li, Chuan Zhang, Liehuang Zhu, Guoling Liang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10680
Pdf URL: https://arxiv.org/pdf/2411.10680
Copy Paste: [[2411.10680]] Two-layer consensus based on master-slave consortium chain data sharing for Internet of Vehicles(https://arxiv.org/abs/2411.10680)
Keywords: security
Abstract: Due to insufficient scalability, the existing consortium chain cannot meet the requirements of low latency, high throughput, and high security when applied to Internet of Vehicles (IoV) data sharing. Therefore, we propose a two-layer consensus algorithm based on the master-slave consortium chain - Weighted Raft and Byzantine Fault Tolerance (WRBFT). The intra-group consensus of the WRBFT algorithm adopts weighted Raft, and the best node is selected as the master node to lead the intra-group consensus by comprehensively evaluating the signal-to-noise ratio (SNR), data processing capacity and storage capacity of the nodes. The inter-group consensus adopts practical Byzantine fault tolerance (PBFT) based on BLS aggregate signature with nonlinear coefficients to ensure that the inter-group consensus can tolerate 1/3 of Byzantine nodes. At the same time, the verifiable random function (VRF) is used to select the master node of the inter-group consensus to ensure the randomness of the master node. A large number of experimental results show that the proposed WRBFT algorithm reduces delay, and improves throughput and system security.

Title: Structured Dialogue System for Mental Health: An LLM Chatbot Leveraging the PM+ Guidelines

Authors: Yixiang Chen, Xinyu Zhang, Jinran Wang, Xurong Xie, Nan Yan, Hui Chen, Lan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10681
Pdf URL: https://arxiv.org/pdf/2411.10681
Copy Paste: [[2411.10681]] Structured Dialogue System for Mental Health: An LLM Chatbot Leveraging the PM+ Guidelines(https://arxiv.org/abs/2411.10681)
Keywords: large language model
Abstract: The Structured Dialogue System, referred to as SuDoSys, is an innovative Large Language Model (LLM)-based chatbot designed to provide psychological counseling. SuDoSys leverages the World Health Organization (WHO)'s Problem Management Plus (PM+) guidelines to deliver stage-aware multi-turn dialogues. Existing methods for employing an LLM in multi-turn psychological counseling typically involve direct fine-tuning using generated dialogues, often neglecting the dynamic stage shifts of counseling sessions. Unlike previous approaches, SuDoSys considers the different stages of counseling and stores essential information throughout the counseling process, ensuring coherent and directed conversations. The system employs an LLM, a stage-aware instruction generator, a response unpacker, a topic database, and a stage controller to maintain dialogue flow. In addition, we propose a novel technique that simulates counseling clients to interact with the evaluated system and evaluate its performance automatically. When assessed using both objective and subjective evaluations, SuDoSys demonstrates its effectiveness in generating logically coherent responses. The system's code and program scripts for evaluation are open-sourced.

Title: I'm Spartacus, No, I'm Spartacus: Measuring and Understanding LLM Identity Confusion

Authors: Kun Li, Shichao Zhuang, Yue Zhang, Minghui Xu, Ruoxi Wang, Kaidi Xu, Xinwen Fu, Xiuzhen Cheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10683
Pdf URL: https://arxiv.org/pdf/2411.10683
Copy Paste: [[2411.10683]] I'm Spartacus, No, I'm Spartacus: Measuring and Understanding LLM Identity Confusion(https://arxiv.org/abs/2411.10683)
Keywords: security, large language model
Abstract: Large Language Models (LLMs) excel in diverse tasks such as text generation, data analysis, and software development, making them indispensable across domains like education, business, and creative industries. However, the rapid proliferation of LLMs (with over 560 companies developing or deploying them as of 2024) has raised concerns about their originality and trustworthiness. A notable issue, termed identity confusion, has emerged, where LLMs misrepresent their origins or identities. This study systematically examines identity confusion through three research questions: (1) How prevalent is identity confusion among LLMs? (2) Does it arise from model reuse, plagiarism, or hallucination? (3) What are the security and trust-related impacts of identity confusion? To address these, we developed an automated tool combining documentation analysis, self-identity recognition testing, and output similarity comparisons--established methods for LLM fingerprinting--and conducted a structured survey via Credamo to assess its impact on user trust. Our analysis of 27 LLMs revealed that 25.93% exhibit identity confusion. Output similarity analysis confirmed that these issues stem from hallucinations rather than replication or reuse. Survey results further highlighted that identity confusion significantly erodes trust, particularly in critical tasks like education and professional use, with declines exceeding those caused by logical errors or inconsistencies. Users attributed these failures to design flaws, incorrect training data, and perceived plagiarism, underscoring the systemic risks posed by identity confusion to LLM reliability and trustworthiness.

Title: MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations

Authors: Qixuan Jin, Walter Gerych, Marzyeh Ghassemi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10686
Pdf URL: https://arxiv.org/pdf/2411.10686
Copy Paste: [[2411.10686]] MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations(https://arxiv.org/abs/2411.10686)
Keywords: diffusion
Abstract: Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don't generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as clinicians can directly validate the modified images. While Denoising Diffusion Probabilistic Models (Diffusion Models) show promise for natural images, they are impractical for medical use due to the difficulty of describing spurious medical features. To address this, we propose Masked Medical Image Inpainting (MaskMedPaint), which uses text-to-image diffusion models to augment training images by inpainting areas outside key classification regions to match the target domain. We demonstrate that MaskMedPaint enhances generalization to target domains across both natural (Waterbirds, iWildCam) and medical (ISIC 2018, Chest X-ray) datasets, given limited unlabeled target images.

Title: DEBUG-HD: Debugging TinyML models on-device using Hyper-Dimensional computing

Authors: Nikhil P Ghanathe, Steven J E Wilton
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2411.10692
Pdf URL: https://arxiv.org/pdf/2411.10692
Copy Paste: [[2411.10692]] DEBUG-HD: Debugging TinyML models on-device using Hyper-Dimensional computing(https://arxiv.org/abs/2411.10692)
Keywords: privacy
Abstract: TinyML models often operate in remote, dynamic environments without cloud connectivity, making them prone to failures. Ensuring reliability in such scenarios requires not only detecting model failures but also identifying their root causes. However, transient failures, privacy concerns, and the safety-critical nature of many applications-where systems cannot be interrupted for debugging-complicate the use of raw sensor data for offline analysis. We propose DEBUG-HD, a novel, resource-efficient on-device debugging approach optimized for KB-sized tinyML devices that utilizes hyper-dimensional computing (HDC). Our method introduces a new HDC encoding technique that leverages conventional neural networks, allowing DEBUG-HD to outperform prior binary HDC methods by 27% on average in detecting input corruptions across various image and audio datasets.

Title: HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization

Authors: Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10696
Pdf URL: https://arxiv.org/pdf/2411.10696
Copy Paste: [[2411.10696]] HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization(https://arxiv.org/abs/2411.10696)
Keywords: large language model
Abstract: Fine-tuning large language models (LLMs) poses significant memory challenges, as the back-propagation process demands extensive resources, especially with growing model sizes. Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method, which reduces memory consumption by matching the usage to the inference phase. However, MeZO experiences slow convergence due to varying curvatures across model parameters. To overcome this limitation, we introduce HELENE, a novel scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with a diagonal Hessian estimation and layer-wise clipping, serving as a second-order pre-conditioner. This combination allows for faster and more stable convergence. Our theoretical analysis demonstrates that HELENE improves convergence rates, particularly for models with heterogeneous layer dimensions, by reducing the dependency on the total parameter space dimension. Instead, the method scales with the largest layer dimension, making it highly suitable for modern LLM architectures. Experimental results on RoBERTa-large and OPT-1.3B across multiple tasks show that HELENE achieves up to a 20x speedup compared to MeZO, with average accuracy improvements of 1.5%. Furthermore, HELENE remains compatible with both full parameter tuning and parameter-efficient fine-tuning (PEFT), outperforming several state-of-the-art optimizers. The codes will be released after reviewing.

Title: Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection

Authors: Ying Yang, De Cheng, Chaowei Fang, Yubiao Wang, Changzhe Jiao, Lechao Cheng, Nannan Wang
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10701
Pdf URL: https://arxiv.org/pdf/2411.10701
Copy Paste: [[2411.10701]] Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection(https://arxiv.org/abs/2411.10701)
Keywords: extraction, diffusion, generative
Abstract: Unsupervised out-of-distribution (OOD) detection aims to identify out-of-domain data by learning only from unlabeled In-Distribution (ID) training samples, which is crucial for developing a safe real-world machine learning system. Current reconstruction-based methods provide a good alternative approach by measuring the reconstruction error between the input and its corresponding generative counterpart in the pixel/feature space. However, such generative methods face a key dilemma: improving the reconstruction power of the generative model while keeping a compact representation of the ID data. To address this issue, we propose the diffusion-based layer-wise semantic reconstruction approach for unsupervised OOD detection. The innovation of our approach is that we leverage the diffusion model's intrinsic data reconstruction ability to distinguish ID samples from OOD samples in the latent feature space. Moreover, to set up a comprehensive and discriminative feature representation, we devise a multi-layer semantic feature extraction strategy. By distorting the extracted features with Gaussian noise and applying the diffusion model for feature reconstruction, the separation of ID and OOD samples is implemented according to the reconstruction errors. Extensive experimental results on multiple benchmarks built upon various datasets demonstrate that our method achieves state-of-the-art performance in terms of detection accuracy and speed. Code is available at

Title: Hybrid Attention Model Using Feature Decomposition and Knowledge Distillation for Glucose Forecasting

Authors: Ebrahim Farahmand, Shovito Barua Soumma, Nooshin Taheri Chatrudi, Hassan Ghasemzadeh
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10703
Pdf URL: https://arxiv.org/pdf/2411.10703
Copy Paste: [[2411.10703]] Hybrid Attention Model Using Feature Decomposition and Knowledge Distillation for Glucose Forecasting(https://arxiv.org/abs/2411.10703)
Keywords: robust, transformer
Abstract: The availability of continuous glucose monitors as over-the-counter commodities have created a unique opportunity to monitor a person's blood glucose levels, forecast blood glucose trajectories and provide automated interventions to prevent devastating chronic complications that arise from poor glucose control. However, forecasting blood glucose levels is challenging because blood glucose changes consistently in response to food intake, medication intake, physical activity, sleep, and stress. It is particularly difficult to accurately predict BGL from multimodal and irregularly sampled data and over long prediction horizons. Furthermore, these forecasting models must operate in real-time on edge devices to provide in-the-moment interventions. To address these challenges, we propose GlucoNet, an AI-powered sensor system for continuously monitoring behavioral and physiological health and robust forecasting of blood glucose patterns. GlucoNet devises a feature decomposition-based transformer model that incorporates patients' behavioral and physiological data and transforms sparse and irregular patient data (e.g., diet and medication intake data) into continuous features using a mathematical model, facilitating better integration with the BGL data. Given the non-linear and non-stationary nature of BG signals, we propose a decomposition method to extract both low and high-frequency components from the BGL signals, thus providing accurate forecasting. To reduce the computational complexity, we also propose to employ knowledge distillation to compress the transformer model. GlucoNet achieves a 60% improvement in RMSE and a 21% reduction in the number of parameters, using data obtained involving 12 participants with T1-Diabetes. These results underscore GlucoNet's potential as a compact and reliable tool for real-world diabetes prevention and management.

Title: AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations

Authors: Jiawei Mao, Yu Yang, Xuesong Yin, Ling Shao, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10708
Pdf URL: https://arxiv.org/pdf/2411.10708
Copy Paste: [[2411.10708]] AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations(https://arxiv.org/abs/2411.10708)
Keywords: transformer
Abstract: Image restoration models often face the simultaneous interaction of multiple degradations in real-world scenarios. Existing approaches typically handle single or composite degradations based on scene descriptors derived from text or image embeddings. However, due to the varying proportions of different degradations within an image, these scene descriptors may not accurately differentiate between degradations, leading to suboptimal restoration in practical applications. To address this issue, we propose a novel Transformer-based restoration framework, AllRestorer. In AllRestorer, we enable the model to adaptively consider all image impairments, thereby avoiding errors from scene descriptor misdirection. Specifically, we introduce an All-in-One Transformer Block (AiOTB), which adaptively removes all degradations present in a given image by modeling the relationships between all degradations and the image embedding in latent space. To accurately address different variations potentially present within the same type of degradation and minimize ambiguity, AiOTB utilizes a composite scene descriptor consisting of both image and text embeddings to define the degradation. Furthermore, AiOTB includes an adaptive weight for each degradation, allowing for precise control of the restoration intensity. By leveraging AiOTB, AllRestorer avoids misdirection caused by inaccurate scene descriptors, achieving a 5.00 dB increase in PSNR compared to the baseline on the CDD-11 dataset.

Title: A Regularized LSTM Method for Detecting Fake News Articles

Authors: Tanjina Sultana Camelia, Faizur Rahman Fahim, Md. Musfique Anwar
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2411.10713
Pdf URL: https://arxiv.org/pdf/2411.10713
Copy Paste: [[2411.10713]] A Regularized LSTM Method for Detecting Fake News Articles(https://arxiv.org/abs/2411.10713)
Keywords: robust, diffusion
Abstract: Nowadays, the rapid diffusion of fake news poses a significant problem, as it can spread misinformation and confusion. This paper aims to develop an advanced machine learning solution for detecting fake news articles. Leveraging a comprehensive dataset of news articles, including 23,502 fake news articles and 21,417 accurate news articles, we implemented and evaluated three machine-learning models. Our dataset, curated from diverse sources, provides rich textual content categorized into title, text, subject, and Date features. These features are essential for training robust classification models to distinguish between fake and authentic news articles. The initial model employed a Long Short-Term Memory (LSTM) network, achieving an accuracy of 94%. The second model improved upon this by incorporating additional regularization techniques and fine-tuning hyperparameters, resulting in a 97% accuracy. The final model combined the strengths of previous architectures with advanced optimization strategies, achieving a peak accuracy of 98%. These results demonstrate the effectiveness of our approach in identifying fake news with high precision. Implementing these models showcases significant advancements in natural language processing and machine learning techniques, contributing valuable tools for combating misinformation. Our work highlights the potential for deploying such models in real-world applications, providing a reliable method for automated fake news detection and enhancing the credibility of news dissemination.

Title: EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

Authors: Yongjin Lee, Hyeon-Mun Jeong, Yurim Jeon, Sanghyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10715
Pdf URL: https://arxiv.org/pdf/2411.10715
Copy Paste: [[2411.10715]] EVT: Efficient View Transformation for Multi-Modal 3D Object Detection(https://arxiv.org/abs/2411.10715)
Keywords: transformer
Abstract: Multi-modal sensor fusion in bird's-eye-view (BEV) representation has become the leading approach in 3D object detection. However, existing methods often rely on depth estimators or transformer encoders for view transformation, incurring substantial computational overhead. Furthermore, the lack of precise geometric correspondence between 2D and 3D spaces leads to spatial and ray-directional misalignments, restricting the effectiveness of BEV representations. To address these challenges, we propose a novel 3D object detector via efficient view transformation (EVT), which leverages a well-structured BEV representation to enhance accuracy and efficiency. EVT focuses on two main areas. First, it employs Adaptive Sampling and Adaptive Projection (ASAP), using LiDAR guidance to generate 3D sampling points and adaptive kernels. The generated points and kernels are then used to facilitate the transformation of image features into BEV space and refine the BEV features. Second, EVT includes an improved transformer-based detection framework, which contains a group-wise query initialization method and an enhanced query update framework. It is designed to effectively utilize the obtained multi-modal BEV features within the transformer decoder. By leveraging the geometric properties of object queries, this framework significantly enhances detection performance, especially in a multi-layer transformer decoder structure. EVT achieves state-of-the-art performance on the nuScenes test set with real-time inference speed.

Title: FlowScope: Enhancing Decision Making by Time Series Forecasting based on Prediction Optimization using HybridFlow Forecast Framework

Authors: Nitin Sagar Boyeena, Begari Susheel Kumar
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10716
Pdf URL: https://arxiv.org/pdf/2411.10716
Copy Paste: [[2411.10716]] FlowScope: Enhancing Decision Making by Time Series Forecasting based on Prediction Optimization using HybridFlow Forecast Framework(https://arxiv.org/abs/2411.10716)
Keywords: robust
Abstract: Time series forecasting is crucial in several sectors, such as meteorology, retail, healthcare, and finance. Accurately forecasting future trends and patterns is crucial for strategic planning and making well-informed decisions. In this case, it is crucial to include many forecasting methodologies. The strengths of Auto-regressive Integrated Moving Average (ARIMA) for linear time series, Seasonal ARIMA models (SARIMA) for seasonal time series, Exponential Smoothing State Space Models (ETS) for handling errors and trends, and Long Short-Term Memory (LSTM) Neural Network model for complex pattern recognition have been combined to create a comprehensive framework called FlowScope. SARIMA excels in capturing seasonal variations, whereas ARIMA ensures effective handling of linear time series. ETS models excel in capturing trends and correcting errors, whereas LSTM networks excel in reflecting intricate temporal connections. By combining these methods from both machine learning and deep learning, we propose a deep-hybrid learning approach FlowScope which offers a versatile and robust platform for predicting time series data. This empowers enterprises to make informed decisions and optimize long-term strategies for maximum performance. Keywords: Time Series Forecasting, HybridFlow Forecast Framework, Deep-Hybrid Learning, Informed Decisions.

Title: Multi Scale Graph Neural Network for Alzheimer's Disease

Authors: Anya Chauhan, Ayush Noori, Zhaozhi Li, Yingnan He, Michelle M Li, Marinka Zitnik, Sudeshna Das
Subjects: cs.LG, q-bio.NC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.10720
Pdf URL: https://arxiv.org/pdf/2411.10720
Copy Paste: [[2411.10720]] Multi Scale Graph Neural Network for Alzheimer's Disease(https://arxiv.org/abs/2411.10720)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a complex, progressive neurodegenerative disorder characterized by extracellular A\b{eta} plaques, neurofibrillary tau tangles, glial activation, and neuronal degeneration, involving multiple cell types and pathways. Current models often overlook the cellular context of these pathways. To address this, we developed a multiscale graph neural network (GNN) model, ALZ PINNACLE, using brain omics data from donors spanning the entire aging to AD spectrum. ALZ PINNACLE is based on the PINNACLE GNN framework, which learns context-aware protein, cell type, and tissue representations within a unified latent space. ALZ PINNACLE was trained on 14,951 proteins, 206,850 protein interactions, 7 cell types, and 48 cell subtypes or states. After pretraining, we investigated the learned embedding of APOE, the largest genetic risk factor for AD, across different cell types. Notably, APOE embeddings showed high similarity in microglial, neuronal, and CD8 cells, suggesting a similar role of APOE in these cell types. Fine tuning the model on AD risk genes revealed cell type contexts predictive of the role of APOE in AD. Our results suggest that ALZ PINNACLE may provide a valuable framework for uncovering novel insights into AD neurobiology.

Title: HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

Authors: Anton Alekseev, Gulnara Kabaeva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10724
Pdf URL: https://arxiv.org/pdf/2411.10724
Copy Paste: [[2411.10724]] HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings(https://arxiv.org/abs/2411.10724)
Keywords: extraction
Abstract: One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed 'similarity'. This work introduces the first 'silver standard' dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset's suitability through quality evaluation metrics.

Title: On-device Anomaly Detection in Conveyor Belt Operations

Authors: Luciano S. Martinez-Rau, Yuxuan Zhang, Bengt Oelmann, Sebastian Bader
Subjects: cs.LG, cs.CE, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10729
Pdf URL: https://arxiv.org/pdf/2411.10729
Copy Paste: [[2411.10729]] On-device Anomaly Detection in Conveyor Belt Operations(https://arxiv.org/abs/2411.10729)
Keywords: robust, extraction
Abstract: Mining 4.0 leverages advancements in automation, digitalization, and interconnected technologies from Industry 4.0 to address the unique challenges of the mining sector, enhancing efficiency, safety, and sustainability. Conveyor belts are crucial in mining operations by enabling the continuous and efficient movement of bulk materials over long distances, which directly impacts productivity. While detecting anomalies in specific conveyor belt components, such as idlers, pulleys, and belt surfaces, has been widely studied, identifying the root causes of these failures remains critical due to factors like changing production conditions and operator errors. Continuous monitoring of mining conveyor belt work cycles for anomaly detection is still at an early stage and requires robust solutions. This study proposes two distinctive pattern recognition approaches for real-time anomaly detection in the operational cycles of mining conveyor belts, combining feature extraction, threshold-based cycle detection, and tiny machine-learning classification. Both approaches outperformed a state-of-the-art technique on two datasets for duty cycle classification in terms of F1-scores. The first approach, with 97.3% and 80.2% for normal and abnormal cycles, respectively, reaches the highest performance in the first dataset while the second approach excels on the second dataset, scoring 91.3% and 67.9%. Implemented on two low-power microcontrollers, the methods demonstrated efficient, real-time operation with energy consumption of 13.3 and 20.6 ${\mu}$J during inference. These results offer valuable insights for detecting mechanical failure sources, supporting targeted preventive maintenance, and optimizing production cycles.

Title: MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

Authors: Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, Guoqi Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10741
Pdf URL: https://arxiv.org/pdf/2411.10741
Copy Paste: [[2411.10741]] MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map(https://arxiv.org/abs/2411.10741)
Keywords: transformer
Abstract: Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.

Title: It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment

Authors: Jinkai Zheng, Xinchen Liu, Boyue Zhang, Chenggang Yan, Jiyong Zhang, Wu Liu, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10742
Pdf URL: https://arxiv.org/pdf/2411.10742
Copy Paste: [[2411.10742]] It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment(https://arxiv.org/abs/2411.10742)
Keywords: robust, segmentation
Abstract: Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.

Title: TDSM:Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition

Authors: Jeonghyeok Do, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10745
Pdf URL: https://arxiv.org/pdf/2411.10745
Copy Paste: [[2411.10745]] TDSM:Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition(https://arxiv.org/abs/2411.10745)
Keywords: robust, diffusion, generative
Abstract: We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

Title: LTCXNet: Advancing Chest X-Ray Analysis with Solutions for Long-Tailed Multi-Label Classification and Fairness Challenges

Authors: Chin-Wei Huang, Mu-Yi Shen, Kuan-Chang Shih, Shih-Chih Lin, Chi-Yu Chen, Po-Chih Kuo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10746
Pdf URL: https://arxiv.org/pdf/2411.10746
Copy Paste: [[2411.10746]] LTCXNet: Advancing Chest X-Ray Analysis with Solutions for Long-Tailed Multi-Label Classification and Fairness Challenges(https://arxiv.org/abs/2411.10746)
Keywords: fair
Abstract: Chest X-rays (CXRs) often display various diseases with disparate class frequencies, leading to a long-tailed, multi-label data distribution. In response to this challenge, we explore the Pruned MIMIC-CXR-LT dataset, a curated collection derived from the MIMIC-CXR dataset, specifically designed to represent a long-tailed and multi-label data scenario. We introduce LTCXNet, a novel framework that integrates the ConvNeXt model, ML-Decoder, and strategic data augmentation, further enhanced by an ensemble approach. We demonstrate that LTCXNet improves the performance of CXR interpretation across all classes, especially enhancing detection in rarer classes like `Pneumoperitoneum' and `Pneumomediastinum' by 79\% and 48\%, respectively. Beyond performance metrics, our research extends into evaluating fairness, highlighting that some methods, while improving model accuracy, could inadvertently affect fairness across different demographic groups negatively. This work contributes to advancing the understanding and management of long-tailed, multi-label data distributions in medical imaging, paving the way for more equitable and effective diagnostic tools.

Title: Can Generic LLMs Help Analyze Child-adult Interactions Involving Children with Autism in Clinical Observation?

Authors: Tiantian Feng, Anfeng Xu, Rimita Lahiri, Helen Tager-Flusberg, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10761
Pdf URL: https://arxiv.org/pdf/2411.10761
Copy Paste: [[2411.10761]] Can Generic LLMs Help Analyze Child-adult Interactions Involving Children with Autism in Clinical Observation?(https://arxiv.org/abs/2411.10761)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown significant potential in understanding human communication and interaction. However, their performance in the domain of child-inclusive interactions, including in clinical settings, remains less explored. In this work, we evaluate generic LLMs' ability to analyze child-adult dyadic interactions in a clinically relevant context involving children with ASD. Specifically, we explore LLMs in performing four tasks: classifying child-adult utterances, predicting engaged activities, recognizing language skills and understanding traits that are clinically relevant. Our evaluation shows that generic LLMs are highly capable of analyzing long and complex conversations in clinical observation sessions, often surpassing the performance of non-expert human evaluators. The results show their potential to segment interactions of interest, assist in language skills evaluation, identify engaged activities, and offer clinical-relevant context for assessments.

Title: Steam Turbine Anomaly Detection: An Unsupervised Learning Approach Using Enhanced Long Short-Term Memory Variational Autoencoder

Authors: Weiming Xu, Peng Zhang
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.10765
Pdf URL: https://arxiv.org/pdf/2411.10765
Copy Paste: [[2411.10765]] Steam Turbine Anomaly Detection: An Unsupervised Learning Approach Using Enhanced Long Short-Term Memory Variational Autoencoder(https://arxiv.org/abs/2411.10765)
Keywords: robust
Abstract: As core thermal power generation equipment, steam turbines incur significant expenses and adverse effects on operation when facing interruptions like downtime, maintenance, and damage. Accurate anomaly detection is the prerequisite for ensuring the safe and stable operation of steam turbines. However, challenges in steam turbine anomaly detection, including inherent anomalies, lack of temporal information analysis, and high-dimensional data complexity, limit the effectiveness of existing methods. To address these challenges, we proposed an Enhanced Long Short-Term Memory Variational Autoencoder using Deep Advanced Features and Gaussian Mixture Model (ELSTMVAE-DAF-GMM) for precise unsupervised anomaly detection in unlabeled datasets. Specifically, LSTMVAE, integrating LSTM with VAE, was used to project high-dimensional time-series data to a low-dimensional phase space. The Deep Autoencoder-Local Outlier Factor (DAE-LOF) sample selection mechanism was used to eliminate inherent anomalies during training, further improving the model's precision and reliability. The novel deep advanced features (DAF) hybridize latent embeddings and reconstruction discrepancies from the LSTMVAE model and provide a more comprehensive data representation within a continuous and structured phase space, significantly enhancing anomaly detection by synergizing temporal dynamics with data pattern variations. These DAF were incorporated into GMM to ensure robust and effective unsupervised anomaly detection. We utilized real operating data from industry steam turbines and conducted both comparison and ablation experiments, demonstrating superior anomaly detection outcomes characterized by high accuracy and minimal false alarm rates compared with existing methods.

Title: Task Offloading for Vehicular Edge Computing Based on Improved Hotstuff under Parking Assistance

Authors: Guoling Liang, Chunhai Li, Feng Zhao, Chuan Zhang, Liehuang Zhu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10770
Pdf URL: https://arxiv.org/pdf/2411.10770
Copy Paste: [[2411.10770]] Task Offloading for Vehicular Edge Computing Based on Improved Hotstuff under Parking Assistance(https://arxiv.org/abs/2411.10770)
Keywords: secure, security
Abstract: Parked-assisted vehicular edge computing (PVEC) fully leverages communication and computing resources of parking vehicles, thereby significantly alleviating the pressure on edge servers. However, resource sharing and trading for vehicular task offloading in the PVEC environment usually occur between untrustworthy entities, which compromises the security of data sharing and transactions by vehicles and edge devices. To address these concerns, blockchain is introduced to provide a secure and trustworthy environment for offloading and transactions in PVEC. Nevertheless, due to the mobility of the vehicles, the processes of computing offloading and blockchain transactions are interrupted, which greatly reduces the reliability of the blockchain in edge computing process. In this paper, we propose a blockchain-based PVEC (BPVEC) offloading framework to enhance the security and reliability of the task offloading and transaction. Specifically, a consensus node selection algorithm based on the connected dominating set (CDS) is designed to improve the Hotstuff consensus according to parking time, computing capability and communication quality, which enhances blockchain reliability in computing offloading and transactions. Meanwhile, a Stackelberg game model, establishing the roadside units (RSUs) and parking vehicles (PVs) as leaders and the requesting vehicles (RVs) as follower, is utilized to optimize the offloading strategy and pricing. Subsequently, a BPVEC offloading strategy algorithm with gradient descent method is designed to maximize system revenue. Simulation results show that the proposed BPVEC offloading scheme is secure and reliable while ensuring maximum benefits.

Title: Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

Authors: Shitong Shao, Zikai Zhou, Tian Ye, Lichen Bai, Zhiqiang Xu, Zeke Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10781
Pdf URL: https://arxiv.org/pdf/2411.10781
Copy Paste: [[2411.10781]] Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer(https://arxiv.org/abs/2411.10781)
Keywords: diffusion, transformer, generative
Abstract: Text-to-image diffusion models (DMs) develop at an unprecedented pace, supported by thorough theoretical exploration and empirical analysis. Unfortunately, the discrepancy between DMs and autoregressive models (ARMs) complicates the path toward achieving the goal of unified vision and language generation. Recently, the masked generative Transformer (MGT) serves as a promising intermediary between DM and ARM by predicting randomly masked image tokens (i.e., masked image modeling), combining the efficiency of DM with the discrete token nature of ARM. However, we find that the comprehensive analyses regarding the inference for MGT are virtually non-existent, and thus we aim to present positive design choices to fill this gap. We modify and re-design a set of DM-based inference techniques for MGT and further elucidate their performance on MGT. We also discuss the approach to correcting token's distribution to enhance inference. Extensive experiments and empirical analyses lead to concrete and effective design choices, and these design choices can be merged to achieve further performance gains. For instance, in terms of enhanced inference, we achieve winning rates of approximately 70% compared to vanilla sampling on HPS v2 with the recent SOTA MGT Meissonic. Our contributions have the potential to further enhance the capabilities and future development of MGTs.

Title: C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation

Authors: Jeonghyeok Do, Jaehyup Lee, Munchurl Kim
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10788
Pdf URL: https://arxiv.org/pdf/2411.10788
Copy Paste: [[2411.10788]] C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation(https://arxiv.org/abs/2411.10788)
Keywords: robust, diffusion
Abstract: Synthetic Aperture Radar (SAR) imagery provides robust environmental and temporal coverage (e.g., during clouds, seasons, day-night cycles), yet its noise and unique structural patterns pose interpretation challenges, especially for non-experts. SAR-to-EO (Electro-Optical) image translation (SET) has emerged to make SAR images more perceptually interpretable. However, traditional approaches trained from scratch on limited SAR-EO datasets are prone to overfitting. To address these challenges, we introduce Confidence Diffusion for SAR-to-EO Translation, called C-DiffSET, a framework leveraging pretrained Latent Diffusion Model (LDM) extensively trained on natural images, thus enabling effective adaptation to the EO domain. Remarkably, we find that the pretrained VAE encoder aligns SAR and EO images in the same latent space, even with varying noise levels in SAR inputs. To further improve pixel-wise fidelity for SET, we propose a confidence-guided diffusion (C-Diff) loss that mitigates artifacts from temporal discrepancies, such as appearing or disappearing objects, thereby enhancing structural accuracy. C-DiffSET achieves state-of-the-art (SOTA) results on multiple datasets, significantly outperforming the very recent image-to-image translation methods and SET methods with large margins.

Title: Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts

Authors: Yijian Gao, Dominic Marshall, Xiaodan Xing, Junzhi Ning, Giorgos Papanastasiou, Guang Yang, Matthieu Komorowski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10789
Pdf URL: https://arxiv.org/pdf/2411.10789
Copy Paste: [[2411.10789]] Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts(https://arxiv.org/abs/2411.10789)
Keywords: generative
Abstract: Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care. However, achieving high clinical accuracy is challenging, as radiological images often feature subtle lesions and intricate structures. Existing systems often fall short, largely due to their reliance on fixed size, patch-level image features and insufficient incorporation of pathological information. This can result in the neglect of such subtle patterns and inconsistent descriptions of crucial pathologies. To address these challenges, we propose an innovative approach that leverages pathology-aware regional prompts to explicitly integrate anatomical and pathological information of various scales, significantly enhancing the precision and clinical relevance of generated reports. We develop an anatomical region detector that extracts features from distinct anatomical areas, coupled with a novel multi-label lesion detector that identifies global pathologies. Our approach emulates the diagnostic process of radiologists, producing clinically accurate reports with comprehensive diagnostic capabilities. Experimental results show that our model outperforms previous state-of-the-art methods on most natural language generation and clinical efficacy metrics, with formal expert evaluations affirming its potential to enhance radiology practice.

Title: Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Authors: Tripti Shukla, Srikrishna Karanam, Balaji Vasan Srinivasan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10800
Pdf URL: https://arxiv.org/pdf/2411.10800
Copy Paste: [[2411.10800]] Test-time Conditional Text-to-Image Synthesis Using Diffusion Models(https://arxiv.org/abs/2411.10800)
Keywords: diffusion
Abstract: We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.

Title: Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Authors: Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10803
Pdf URL: https://arxiv.org/pdf/2411.10803
Copy Paste: [[2411.10803]] Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model(https://arxiv.org/abs/2411.10803)
Keywords: large language model
Abstract: The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5\% FLOPs on LLaVA with a compression ratio of 92.2\% while maintaining comparable accuracy. Our codes are available at \url{this https URL}.

Title: Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay

Authors: Feng Chen, Fuguang Han, Cong Guan, Lei Yuan, Zhilong Zhang, Yang Yu, Zongzhang Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10809
Pdf URL: https://arxiv.org/pdf/2411.10809
Copy Paste: [[2411.10809]] Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay(https://arxiv.org/abs/2411.10809)
Keywords: privacy, diffusion, generative
Abstract: Given the inherent non-stationarity prevalent in real-world applications, continual Reinforcement Learning (RL) aims to equip the agent with the capability to address a series of sequentially presented decision-making tasks. Within this problem setting, a pivotal challenge revolves around \textit{catastrophic forgetting} issue, wherein the agent is prone to effortlessly erode the decisional knowledge associated with past encountered tasks when learning the new one. In recent progresses, the \textit{generative replay} methods have showcased substantial potential by employing generative models to replay data distribution of past tasks. Compared to storing the data from past tasks directly, this category of methods circumvents the growing storage overhead and possible data privacy concerns. However, constrained by the expressive capacity of generative models, existing \textit{generative replay} methods face challenges in faithfully reconstructing the data distribution of past tasks, particularly in scenarios with a myriad of tasks or high-dimensional data. Inspired by the success of diffusion models in various generative tasks, this paper introduces a novel continual RL algorithm DISTR (Diffusion-based Trajectory Replay) that employs a diffusion model to memorize the high-return trajectory distribution of each encountered task and wakeups these distributions during the policy learning on new tasks. Besides, considering the impracticality of replaying all past data each time, a prioritization mechanism is proposed to prioritize the trajectory replay of pivotal tasks in our method. Empirical experiments on the popular continual RL benchmark \texttt{Continual World} demonstrate that our proposed method obtains a favorable balance between \textit{stability} and \textit{plasticity}, surpassing various existing continual RL baselines in average success rate.

Title: Information Anxiety in Large Language Models

Authors: Prasoon Bajpai, Sarah Masud, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10813
Pdf URL: https://arxiv.org/pdf/2411.10813
Copy Paste: [[2411.10813]] Information Anxiety in Large Language Models(https://arxiv.org/abs/2411.10813)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated strong performance as knowledge repositories, enabling models to understand user queries and generate accurate and context-aware responses. Extensive evaluation setups have corroborated the positive correlation between the retrieval capability of LLMs and the frequency of entities in their pretraining corpus. We take the investigation further by conducting a comprehensive analysis of the internal reasoning and retrieval mechanisms of LLMs. Our work focuses on three critical dimensions - the impact of entity popularity, the models' sensitivity to lexical variations in query formulation, and the progression of hidden state representations across LLM layers. Our preliminary findings reveal that popular questions facilitate early convergence of internal states toward the correct answer. However, as the popularity of a query increases, retrieved attributes across lexical variations become increasingly dissimilar and less accurate. Interestingly, we find that LLMs struggle to disentangle facts, grounded in distinct relations, from their parametric memory when dealing with highly popular subjects. Through a case study, we explore these latent strains within LLMs when processing highly popular queries, a phenomenon we term information anxiety. The emergence of information anxiety in LLMs underscores the adversarial injection in the form of linguistic variations and calls for a more holistic evaluation of frequently occurring entities.

Title: DEAL: Decoupled Classifier with Adaptive Linear Modulation for Group Robust Early Diagnosis of MCI to AD Conversion

Authors: Donggyu Lee, Juhyeon Park, Taesup Moon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10814
Pdf URL: https://arxiv.org/pdf/2411.10814
Copy Paste: [[2411.10814]] DEAL: Decoupled Classifier with Adaptive Linear Modulation for Group Robust Early Diagnosis of MCI to AD Conversion(https://arxiv.org/abs/2411.10814)
Keywords: robust
Abstract: While deep learning-based Alzheimer's disease (AD) diagnosis has recently made significant advancements, particularly in predicting the conversion of mild cognitive impairment (MCI) to AD based on MRI images, there remains a critical gap in research regarding the group robustness of the diagnosis. Although numerous studies pointed out that deep learning-based classifiers may exhibit poor performance in certain groups by relying on unimportant attributes, this issue has been largely overlooked in the early diagnosis of MCI to AD conversion. In this paper, we present the first comprehensive investigation of the group robustness in the early diagnosis of MCI to AD conversion using MRI images, focusing on disparities in accuracy between groups, specifically sMCI and pMCI individuals divided by age. Our experiments reveal that standard classifiers consistently underperform for certain groups across different architectures, highlighting the need for more tailored approaches. To address this, we propose a novel method, dubbed DEAL (DEcoupled classifier with Adaptive Linear modulation), comprising two key components: (1) a linear modulation of features from the penultimate layer, incorporating easily obtainable age and cognitive indicative tabular features, and (2) a decoupled classifier that provides more tailored decision boundaries for each group, further improving performance. Through extensive experiments and evaluations across different architectures, we demonstrate the efficacy of DEAL in improving the group robustness of the MCI to AD conversion prediction.

Title: Conformation Generation using Transformer Flows

Authors: Sohil Atul Shah, Vladlen Koltun
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2411.10817
Pdf URL: https://arxiv.org/pdf/2411.10817
Copy Paste: [[2411.10817]] Conformation Generation using Transformer Flows(https://arxiv.org/abs/2411.10817)
Keywords: transformer, generative
Abstract: Estimating three-dimensional conformations of a molecular graph allows insight into the molecule's biological and chemical functions. Fast generation of valid conformations is thus central to molecular modeling. Recent advances in graph-based deep networks have accelerated conformation generation from hours to seconds. However, current network architectures do not scale well to large molecules. Here we present ConfFlow, a flow-based model for conformation generation based on transformer networks. In contrast with existing approaches, ConfFlow directly samples in the coordinate space without enforcing any explicit physical constraints. The generative procedure is highly interpretable and is akin to force field updates in molecular dynamics simulation. When applied to the generation of large molecule conformations, ConfFlow improve accuracy by up to $40\%$ relative to state-of-the-art learning-based methods. The source code is made available at this https URL.

Title: An Oversampling-enhanced Multi-class Imbalanced Classification Framework for Patient Health Status Prediction Using Patient-reported Outcomes

Authors: Yang Yan, Zhong Chen, Cai Xu, Xinglei Shen, Jay Shiao, John Einck, Ronald C Chen, Hao Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.10819
Pdf URL: https://arxiv.org/pdf/2411.10819
Copy Paste: [[2411.10819]] An Oversampling-enhanced Multi-class Imbalanced Classification Framework for Patient Health Status Prediction Using Patient-reported Outcomes(https://arxiv.org/abs/2411.10819)
Keywords: robust
Abstract: Patient-reported outcomes (PROs) directly collected from cancer patients being treated with radiation therapy play a vital role in assisting clinicians in counseling patients regarding likely toxicities. Precise prediction and evaluation of symptoms or health status associated with PROs are fundamental to enhancing decision-making and planning for the required services and support as patients transition into survivorship. However, the raw PRO data collected from hospitals exhibits some intrinsic challenges such as incomplete item reports and imbalance patient toxicities. To the end, in this study, we explore various machine learning techniques to predict patient outcomes related to health status such as pain levels and sleep discomfort using PRO datasets from a cancer photon/proton therapy center. Specifically, we deploy six advanced machine learning classifiers -- Random Forest (RF), XGBoost, Gradient Boosting (GB), Support Vector Machine (SVM), Multi-Layer Perceptron with Bagging (MLP-Bagging), and Logistic Regression (LR) -- to tackle a multi-class imbalance classification problem across three prevalent cancer types: head and neck, prostate, and breast cancers. To address the class imbalance issue, we employ an oversampling strategy, adjusting the training set sample sizes through interpolations of in-class neighboring samples, thereby augmenting minority classes without deviating from the original skewed class distribution. Our experimental findings across multiple PRO datasets indicate that the RF and XGB methods achieve robust generalization performance, evidenced by weighted AUC and detailed confusion matrices, in categorizing outcomes as mild, intermediate, and severe post-radiation therapy. These results underscore the models' effectiveness and potential utility in clinical settings.

Title: A Data-Efficient Sequential Learning Framework for Melt Pool Defect Classification in Laser Powder Bed Fusion

Authors: Ahmed Shoyeb Raihan, Austin Harper, Israt Zarin Era, Omar Al-Shebeeb, Thorsten Wuest, Srinjoy Das, Imtiaz Ahmed
Subjects: cs.LG, cond-mat.mtrl-sci, cs.CE
Abstract URL: https://arxiv.org/abs/2411.10822
Pdf URL: https://arxiv.org/pdf/2411.10822
Copy Paste: [[2411.10822]] A Data-Efficient Sequential Learning Framework for Melt Pool Defect Classification in Laser Powder Bed Fusion(https://arxiv.org/abs/2411.10822)
Keywords: robust
Abstract: Ensuring the quality and reliability of Metal Additive Manufacturing (MAM) components is crucial, especially in the Laser Powder Bed Fusion (L-PBF) process, where melt pool defects such as keyhole, balling, and lack of fusion can significantly compromise structural integrity. This study presents SL-RF+ (Sequentially Learned Random Forest with Enhanced Sampling), a novel Sequential Learning (SL) framework for melt pool defect classification designed to maximize data efficiency and model accuracy in data-scarce environments. SL-RF+ utilizes RF classifier combined with Least Confidence Sampling (LCS) and Sobol sequence-based synthetic sampling to iteratively select the most informative samples to learn from, thereby refining the model's decision boundaries with minimal labeled data. Results show that SL-RF+ outperformed traditional machine learning models across key performance metrics, including accuracy, precision, recall, and F1 score, demonstrating significant robustness in identifying melt pool defects with limited data. This framework efficiently captures complex defect patterns by focusing on high-uncertainty regions in the process parameter space, ultimately achieving superior classification performance without the need for extensive labeled datasets. While this study utilizes pre-existing experimental data, SL-RF+ shows strong potential for real-world applications in pure sequential learning settings, where data is acquired and labeled incrementally, mitigating the high costs and time constraints of sample acquisition.

Title: ARM: Appearance Reconstruction Model for Relightable 3D Generation

Authors: Xiang Feng, Chang Yu, Zoubin Bi, Yintong Shang, Feng Gao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.10825
Pdf URL: https://arxiv.org/pdf/2411.10825
Copy Paste: [[2411.10825]] ARM: Appearance Reconstruction Model for Relightable 3D Generation(https://arxiv.org/abs/2411.10825)
Keywords: robust
Abstract: Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.

Title: One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

Authors: Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, Mengdi Wang
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2411.10830
Pdf URL: https://arxiv.org/pdf/2411.10830
Copy Paste: [[2411.10830]] One-Layer Transformer Provably Learns One-Nearest Neighbor In Context(https://arxiv.org/abs/2411.10830)
Keywords: transformer
Abstract: Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.

Title: Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation

Authors: Jaisidh Singh, Sonam Singh, Amit Arvind Kale, Harsh K Gandhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10845
Pdf URL: https://arxiv.org/pdf/2411.10845
Copy Paste: [[2411.10845]] Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation(https://arxiv.org/abs/2411.10845)
Keywords: segmentation
Abstract: This paper presents a novel method for discovering systematic errors in segmentation models. For instance, a systematic error in the segmentation model can be a sufficiently large number of misclassifications from the model as a parking meter for a target class of pedestrians. With the rapid deployment of these models in critical applications such as autonomous driving, it is vital to detect and interpret these systematic errors. However, the key challenge is automatically discovering such failures on unlabelled data and forming interpretable semantic sub-groups for intervention. For this, we leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors. We demonstrate that such errors are present in SOTA segmentation models (UperNet ConvNeXt and UperNet Swin) trained on the Berkeley Deep Drive and benchmark the approach qualitatively and quantitatively, showing its effectiveness by discovering coherent systematic errors for these models. Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.

Title: NeuroNURBS: Learning Efficient Surface Representations for 3D Solids

Authors: Jiajie Fan, Babak Gholami, Thomas Bäck, Hao Wang
Subjects: cs.CV, cs.CE
Abstract URL: https://arxiv.org/abs/2411.10848
Pdf URL: https://arxiv.org/pdf/2411.10848
Copy Paste: [[2411.10848]] NeuroNURBS: Learning Efficient Surface Representations for 3D Solids(https://arxiv.org/abs/2411.10848)
Keywords: segmentation
Abstract: Boundary Representation (B-Rep) is the de facto representation of 3D solids in Computer-Aided Design (CAD). B-Rep solids are defined with a set of NURBS (Non-Uniform Rational B-Splines) surfaces forming a closed volume. To represent a surface, current works often employ the UV-grid approximation, i.e., sample points uniformly on the surface. However, the UV-grid method is not efficient in surface representation and sometimes lacks precision and regularity. In this work, we propose NeuroNURBS, a representation learning method to directly encode the parameters of NURBS surfaces. Our evaluation in solid generation and segmentation tasks indicates that the NeuroNURBS performs comparably and, in some cases, superior to UV-grids, but with a significantly improved efficiency: for training the surface autoencoder, GPU consumption is reduced by 86.7%; memory requirement drops by 79.9% for storing 3D solids. Moreover, adapting BrepGen for solid generation with our NeuroNURBS improves the FID from 30.04 to 27.24, and resolves the undulating issue in generated surfaces.

Title: On the Verification of Control Flow Attestation Evidence

Authors: Adam Caulfield, Norrathep Rattanavipanon, Ivan De Oliveira Nunes
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10855
Pdf URL: https://arxiv.org/pdf/2411.10855
Copy Paste: [[2411.10855]] On the Verification of Control Flow Attestation Evidence(https://arxiv.org/abs/2411.10855)
Keywords: secure, security, attack
Abstract: Remote run-time attestation methods, including Control Flow Attestation (CFA) and Data Flow Attestation (DFA), have been proposed to generate precise evidence of execution's control flow path (in CFA) and optionally execution data inputs (in DFA) on a remote and potentially compromised embedded device, hereby referred to as a Prover (Prv). Recent advances in run-time attestation architectures are also able to guarantee that a remote Verifier (Vrf) reliably receives this evidence from Prv, even when Prv's software state is fully compromised. This, in theory, enables secure "run-time auditing" in addition to best-effort attestation, i.e., it guarantees that Vrf can examine execution evidence to identify previously unknown compromises as soon as they are exploited, pinpoint their root cause(s), and remediate them. However, prior work has for the most part focused on securely implementing Prv's root of trust (responsible for generating authentic run-time evidence), leaving Vrf 's perspective in this security service unexplored. In this work, we argue that run-time attestation and auditing are only truly useful if Vrf can effectively analyze received evidence. From this premise, we characterize different types of evidence produced by existing run-time attestation/auditing architectures in terms of Vrf 's ability to detect and remediate (previously unknown) vulnerabilities. As a case study for practical uses of run-time evidence by Vrf, we propose SABRE: a Security Analysis and Binary Repair Engine. SABRE showcases how Vrf can systematically leverage run-time evidence to detect control flow attacks, pinpoint corrupted control data and specific instructions used to corrupt them, and leverage this evidence to automatically generate binary patches to buffer overflow and use-after-free vulnerabilities without source code knowledge.

Title: Large Vision-Language Models for Remote Sensing Visual Question Answering

Authors: Surasakdi Siripong, Apirak Chaiyapan, Thanakorn Phonchai
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10857
Pdf URL: https://arxiv.org/pdf/2411.10857
Copy Paste: [[2411.10857]] Large Vision-Language Models for Remote Sensing Visual Question Answering(https://arxiv.org/abs/2411.10857)
Keywords: generative
Abstract: Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art baselines. Additionally, a human evaluation study shows that our method produces answers that are more accurate, relevant, and fluent. The results highlight the potential of generative LVLMs in advancing the field of remote sensing analysis.

Title: See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI

Authors: Ruslan Idelfonso Magaña Vsevolodovna
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2411.10861
Pdf URL: https://arxiv.org/pdf/2411.10861
Copy Paste: [[2411.10861]] See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI(https://arxiv.org/abs/2411.10861)
Keywords: generative
Abstract: The generation of complex, large-scale code projects using generative AI models presents challenges due to token limitations, dependency management, and iterative refinement requirements. This paper introduces the See-Saw generative mechanism, a novel methodology for dynamic and recursive code generation. The proposed approach alternates between main code updates and dependency generation to ensure alignment and functionality. By dynamically optimizing token usage and incorporating key elements of the main code into the generation of dependencies, the method enables efficient and scalable code generation for projects requiring hundreds of interdependent files. The mechanism ensures that all code components are synchronized and functional, enabling scalable and efficient project generation. Experimental validation demonstrates the method's capability to manage dependencies effectively while maintaining coherence and minimizing computational overhead.

Title: Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model

Authors: Arnab Kumar Roy, Hemant Kumar Kathania, Adhitiya Sharma
Subjects: cs.CV, cs.HC, eess.IV
Abstract URL: https://arxiv.org/abs/2411.10863
Pdf URL: https://arxiv.org/pdf/2411.10863
Copy Paste: [[2411.10863]] Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model(https://arxiv.org/abs/2411.10863)
Keywords: diffusion, generative
Abstract: Facial Emotion Recognition (FER) plays a crucial role in computer vision, with significant applications in human-computer interaction, affective computing, and areas such as mental health monitoring and personalized learning environments. However, a major challenge in FER task is the class imbalance commonly found in available datasets, which can hinder both model performance and generalization. In this paper, we tackle the issue of data imbalance by incorporating synthetic data augmentation and leveraging the ResEmoteNet model to enhance the overall performance on facial emotion recognition task. We employed Stable Diffusion 2 and Stable Diffusion 3 Medium models to generate synthetic facial emotion data, augmenting the training sets of the FER2013 and RAF-DB benchmark datasets. Training ResEmoteNet with these augmented datasets resulted in substantial performance improvements, achieving accuracies of 96.47% on FER2013 and 99.23% on RAF-DB. These findings shows an absolute improvement of 16.68% in FER2013, 4.47% in RAF-DB and highlight the efficacy of synthetic data augmentation in strengthening FER models and underscore the potential of advanced generative models in FER research and applications. The source code for ResEmoteNet is available at this https URL

Title: ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Authors: Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10867
Pdf URL: https://arxiv.org/pdf/2411.10867
Copy Paste: [[2411.10867]] ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models(https://arxiv.org/abs/2411.10867)
Keywords: robust
Abstract: Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

Title: Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm

Authors: Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Subjects: cs.CL, cs.CE, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2411.10869
Pdf URL: https://arxiv.org/pdf/2411.10869
Copy Paste: [[2411.10869]] Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm(https://arxiv.org/abs/2411.10869)
Keywords: large language model
Abstract: This study introduces a novel approach for traffic control systems by using Large Language Models (LLMs) as traffic controllers. The study utilizes their logical reasoning, scene understanding, and decision-making capabilities to optimize throughput and provide feedback based on traffic conditions in real-time. LLMs centralize traditionally disconnected traffic control processes and can integrate traffic data from diverse sources to provide context-aware decisions. LLMs can also deliver tailored outputs using various means such as wireless signals and visuals to drivers, infrastructures, and autonomous vehicles. To evaluate LLMs ability as traffic controllers, this study proposed a four-stage methodology. The methodology includes data creation and environment initialization, prompt engineering, conflict identification, and fine-tuning. We simulated multi-lane four-leg intersection scenarios and generates detailed datasets to enable conflict detection using LLMs and Python simulation as a ground truth. We used chain-of-thought prompts to lead LLMs in understanding the context, detecting conflicts, resolving them using traffic rules, and delivering context-sensitive traffic management solutions. We evaluated the prformance GPT-mini, Gemini, and Llama as traffic controllers. Results showed that the fine-tuned GPT-mini achieved 83% accuracy and an F1-score of 0.84. GPT-mini model exhibited a promising performance in generating actionable traffic management insights, with high ROUGE-L scores across conflict identification of 0.95, decision-making of 0.91, priority assignment of 0.94, and waiting time optimization of 0.92. We demonstrated that LLMs can offer precise recommendations to drivers in real-time including yielding, slowing, or stopping based on vehicle dynamics.

Title: Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis

Authors: Jawad Ibn Ahad, Rafeed Mohammad Sultan, Abraham Kaikobad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2411.10878
Pdf URL: https://arxiv.org/pdf/2411.10878
Copy Paste: [[2411.10878]] Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis(https://arxiv.org/abs/2411.10878)
Keywords: robust, extraction, large language model
Abstract: This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs). Meta-analysis is a robust statistical method that synthesizes the findings of multiple studies support articles to provide a comprehensive understanding. We know that a meta-article provides a structured analysis of several articles. However, conducting meta-analysis by hand is labor-intensive, time-consuming, and susceptible to human error, highlighting the need for automated pipelines to streamline the process. Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction. We automate and optimize the meta-analysis process by integrating Retrieval Augmented Generation (RAG). Tailored through prompt engineering and a new loss metric, Inverse Cosine Distance (ICD), designed for fine-tuning on large contextual datasets, LLMs efficiently generate structured meta-analysis content. Human evaluation then assesses relevance and provides information on model performance in key metrics. This research demonstrates that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts. The relevance of the context, based on human evaluation, shows a reduction in irrelevancy from 4.56% to 1.9%. These experiments were conducted in a low-resource environment, highlighting the study's contribution to enhancing the efficiency and reliability of meta-analysis automation.

Title: BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization

Authors: Md. Nazmus Sadat Samin, Jawad Ibn Ahad, Tanjila Ahmed Medha, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.10879
Pdf URL: https://arxiv.org/pdf/2411.10879
Copy Paste: [[2411.10879]] BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization(https://arxiv.org/abs/2411.10879)
Keywords: large language model
Abstract: This study focuses on recognizing Bangladeshi dialects and converting diverse Bengali accents into standardized formal Bengali speech. Dialects, often referred to as regional languages, are distinctive variations of a language spoken in a particular location and are identified by their phonetics, pronunciations, and lexicon. Subtle changes in pronunciation and intonation are also influenced by geographic location, educational attainment, and socioeconomic status. Dialect standardization is needed to ensure effective communication, educational consistency, access to technology, economic opportunities, and the preservation of linguistic resources while respecting cultural diversity. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. However, limited research exists due to a lack of comprehensive datasets and the challenges of handling diverse dialects. With the advancement in multilingual Large Language Models (mLLMs), emerging possibilities have been created to address the challenges of dialectal Automated Speech Recognition (ASR) and Machine Translation (MT). This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. This investigation includes constructing a large-scale diverse dataset with dialectal speech signals that tailored the fine-tuning process in ASR and LLM for transcribing the dialect speech to dialect text and translating the dialect text to standard Bangla text. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.

Title: FIAS: Feature Imbalance-Aware Medical Image Segmentation with Dynamic Fusion and Mixing Attention

Authors: Xiwei Liu, Min Xu, Qirong Ho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10881
Pdf URL: https://arxiv.org/pdf/2411.10881
Copy Paste: [[2411.10881]] FIAS: Feature Imbalance-Aware Medical Image Segmentation with Dynamic Fusion and Mixing Attention(https://arxiv.org/abs/2411.10881)
Keywords: extraction, transformer, segmentation
Abstract: With the growing application of transformer in computer vision, hybrid architecture that combine convolutional neural networks (CNNs) and transformers demonstrates competitive ability in medical image segmentation. However, direct fusion of features from CNNs and transformers often leads to feature imbalance and redundant information. To address these issues, we propose a Feaure Imbalance-Aware Segmentation (FIAS) network, which incorporates a dual-path encoder and a novel Mixing Attention (MixAtt) decoder. The dual-branches encoder integrates a DilateFormer for long-range global feature extraction and a Depthwise Multi-Kernel (DMK) convolution for capturing fine-grained local details. A Context-Aware Fusion (CAF) block dynamically balances the contribution of these global and local features, preventing feature imbalance. The MixAtt decoder further enhances segmentation accuracy by combining self-attention and Monte Carlo attention, enabling the model to capture both small details and large-scale dependencies. Experimental results on the Synapse multi-organ and ACDC datasets demonstrate the strong competitiveness of our approach in medical image segmentation tasks.

Title: I Know What You Sync: Covert and Side Channel Attacks on File Systems via syncfs

Authors: Cheng Gu, Yicheng Zhang, Nael Abu-Ghazaleh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10883
Pdf URL: https://arxiv.org/pdf/2411.10883
Copy Paste: [[2411.10883]] I Know What You Sync: Covert and Side Channel Attacks on File Systems via syncfs(https://arxiv.org/abs/2411.10883)
Keywords: protect, attack
Abstract: Operating Systems enforce logical isolation using abstractions such as processes, containers, and isolation technologies to protect a system from malicious or buggy code. In this paper, we show new types of side channels through the file system that break this logical isolation. The file system plays a critical role in the operating system, managing all I/O activities between the application layer and the physical storage device. We observe that the file system implementation is shared, leading to timing leakage when using common I/O system calls. Specifically, we found that modern operating systems take advantage of any flush operation (which saves cached blocks in memory to the SSD or disk) to flush all of the I/O buffers, even those used by other isolation domains. Thus, by measuring the delay of syncfs, the attacker can infer the I/O behavior of victim programs. We then demonstrate a syncfs covert channel attack on multiple file systems, including both Linux native file systems and the Windows file system, achieving a maximum bandwidth of 5 Kbps with an error rate of 0.15% on Linux and 7.6 Kbps with an error rate of 1.9% on Windows. In addition, we construct three side-channel attacks targeting both Linux and Android devices. On Linux devices, we implement a website fingerprinting attack and a video fingerprinting attack by tracking the write patterns of temporary buffering files. On Android devices, we design an application fingerprinting attack that leaks application write patterns during boot-up. The attacks achieve over 90% F1 score, precision, and recall. Finally, we demonstrate that these attacks can be exploited across containers implementing a container detection technique and a cross-container covert channel attack.

Title: MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

Authors: Ansh Shah, K Madhava Krishna
Subjects: cs.CV, cs.AI, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2411.10886
Pdf URL: https://arxiv.org/pdf/2411.10886
Copy Paste: [[2411.10886]] MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation(https://arxiv.org/abs/2411.10886)
Keywords: robust, diffusion, generative
Abstract: Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.

Title: Practitioner Paper: Decoding Intellectual Property: Acoustic and Magnetic Side-channel Attack on a 3D Printer

Authors: Amirhossein Jamarani, Yazhou Tu, Xiali Hei
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.10887
Pdf URL: https://arxiv.org/pdf/2411.10887
Copy Paste: [[2411.10887]] Practitioner Paper: Decoding Intellectual Property: Acoustic and Magnetic Side-channel Attack on a 3D Printer(https://arxiv.org/abs/2411.10887)
Keywords: protect, attack
Abstract: The widespread accessibility and ease of use of additive manufacturing (AM), widely recognized as 3D printing, has put Intellectual Property (IP) at great risk of theft. As 3D printers emit acoustic and magnetic signals while printing, the signals can be captured and analyzed using a smartphone for the purpose of IP attack. This is an instance of physical-to-cyber exploitation, as there is no direct contact with the 3D printer. Although cyber vulnerabilities in 3D printers are becoming more apparent, the methods for protecting IPs are yet to be fully investigated. The threat scenarios in previous works have mainly rested on advanced recording devices for data collection and entailed placing the device very close to the 3D printer. However, our work demonstrates the feasibility of reconstructing G-codes by performing side-channel attacks on a 3D printer using a smartphone from greater distances. By training models using Gradient Boosted Decision Trees, our prediction results for each axial movement, stepper, nozzle, and rotor speed achieve high accuracy, with a mean of 98.80%, without any intrusiveness. We effectively deploy the model in a real-world examination, achieving a Mean Tendency Error (MTE) of 4.47% on a plain G-code design.

Title: Watermarking Generative Categorical Data

Authors: Bochao Gu, Hengzhi He, Guang Cheng
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10898
Pdf URL: https://arxiv.org/pdf/2411.10898
Copy Paste: [[2411.10898]] Watermarking Generative Categorical Data(https://arxiv.org/abs/2411.10898)
Keywords: watermark, generative
Abstract: In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.

Title: Attention-based U-Net Method for Autonomous Lane Detection

Authors: Mohammadhamed Tangestanizadeh, Mohammad Dehghani Tezerjani, Saba Yousefian Jazi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10902
Pdf URL: https://arxiv.org/pdf/2411.10902
Copy Paste: [[2411.10902]] Attention-based U-Net Method for Autonomous Lane Detection(https://arxiv.org/abs/2411.10902)
Keywords: segmentation
Abstract: Lane detection involves identifying lanes on the road and accurately determining their location and shape. This is a crucial technique for modern assisted and autonomous driving systems. However, several unique properties of lanes pose challenges for detection methods. The lack of distinctive features can cause lane detection algorithms to be confused by other objects with similar appearances. Additionally, the varying number of lanes and the diversity in lane line patterns, such as solid, broken, single, double, merging, and splitting lines, further complicate the task. To address these challenges, Deep Learning (DL) approaches can be employed in various ways. Merging DL models with an attention mechanism has recently surfaced as a new approach. In this context, two deep learning-based lane recognition methods are proposed in this study. The first method employs the Feature Pyramid Network (FPN) model, delivering an impressive 87.59% accuracy in detecting road lanes. The second method, which incorporates attention layers into the U-Net model, significantly boosts the performance of semantic segmentation tasks. The advanced model, achieving an extraordinary 98.98% accuracy and far surpassing the basic U-Net model, clearly showcases its superiority over existing methods in a comparative analysis. The groundbreaking findings of this research pave the way for the development of more effective and reliable road lane detection methods, significantly advancing the capabilities of modern assisted and autonomous driving systems.

Title: SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment

Authors: Quan Ze Chen, K.J. Kevin Feng, Chan Young Park, Amy X. Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10912
Pdf URL: https://arxiv.org/pdf/2411.10912
Copy Paste: [[2411.10912]] SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment(https://arxiv.org/abs/2411.10912)
Keywords: large language model
Abstract: Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs to facilitate pluralistic alignment: scenario banks, group-informed metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ($n = 544$), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ($n = 80$), we observe that SPICA-aligned models are higher rated than a baseline similarity-only retrieval approach, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can effectively align to aggregated values, it is not most suited for aligning to divergent groups.

Title: Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Authors: Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10913
Pdf URL: https://arxiv.org/pdf/2411.10913
Copy Paste: [[2411.10913]] Generating Compositional Scenes via Text-to-image RGBA Instance Generation(https://arxiv.org/abs/2411.10913)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Title: BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

Authors: Sizhe Wang, Yongqi Tong, Hengyuan Zhang, Dawei Li, Xin Zhang, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10914
Pdf URL: https://arxiv.org/pdf/2411.10914
Copy Paste: [[2411.10914]] BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment(https://arxiv.org/abs/2411.10914)
Keywords: large language model
Abstract: Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.

Title: Bias in Large Language Models: Origin, Evaluation, and Mitigation

Authors: Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, Shuo Shuo Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.10915
Pdf URL: https://arxiv.org/pdf/2411.10915
Copy Paste: [[2411.10915]] Bias in Large Language Models: Origin, Evaluation, and Mitigation(https://arxiv.org/abs/2411.10915)
Keywords: robust, fair, large language model
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their susceptibility to biases poses significant challenges. This comprehensive review examines the landscape of bias in LLMs, from its origins to current mitigation strategies. We categorize biases as intrinsic and extrinsic, analyzing their manifestations in various NLP tasks. The review critically assesses a range of bias evaluation methods, including data-level, model-level, and output-level approaches, providing researchers with a robust toolkit for bias detection. We further explore mitigation strategies, categorizing them into pre-model, intra-model, and post-model techniques, highlighting their effectiveness and limitations. Ethical and legal implications of biased LLMs are discussed, emphasizing potential harms in real-world applications such as healthcare and criminal justice. By synthesizing current knowledge on bias in LLMs, this review contributes to the ongoing effort to develop fair and responsible AI systems. Our work serves as a comprehensive resource for researchers and practitioners working towards understanding, evaluating, and mitigating bias in LLMs, fostering the development of more equitable AI technologies.

Title: LLM-assisted Physical Invariant Extraction for Cyber-Physical Systems Anomaly Detection

Authors: Danial Abshari, Chenglong Fu, Meera Sridhar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10918
Pdf URL: https://arxiv.org/pdf/2411.10918
Copy Paste: [[2411.10918]] LLM-assisted Physical Invariant Extraction for Cyber-Physical Systems Anomaly Detection(https://arxiv.org/abs/2411.10918)
Keywords: security, attack, extraction, generative
Abstract: Modern industrial infrastructures rely heavily on Cyber-Physical Systems (CPS), but these are vulnerable to cyber-attacks with potentially catastrophic effects. To reduce these risks, anomaly detection methods based on physical invariants have been developed. However, these methods often require domain-specific expertise to manually define invariants, making them costly and difficult to scale. To address this limitation, we propose a novel approach to extract physical invariants from CPS testbeds for anomaly detection. Our insight is that CPS design documentation often contains semantically rich descriptions of physical procedures, which can profile inter-correlated dynamics among system components. Leveraging the built-in physics and engineering knowledge of recent generative AI models, we aim to automate this traditionally manual process, improving scalability and reducing costs. This work focuses on designing and optimizing a Retrieval-Augmented-Generation (RAG) workflow with a customized prompting system tailored for CPS documentation, enabling accurate extraction of semantic information and inference of physical invariants from complex, multimodal content. Then, rather than directly applying the inferred invariants for anomaly detection, we introduce an innovative statistics-based learning approach that integrates these invariants into the training dataset. This method addresses limitations such as hallucination and concept drift, enhancing the reliability of the model. We evaluate our approach on real-world public CPS security dataset which contains 86 data points and 58 attacking cases. The results show that our approach achieves a high precision of 0.923, accurately detecting anomalies while minimizing false alarms.

Title: Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Authors: Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10922
Pdf URL: https://arxiv.org/pdf/2411.10922
Copy Paste: [[2411.10922]] Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection(https://arxiv.org/abs/2411.10922)
Keywords: transformer
Abstract: Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at this https URL.

Title: Hyperspectral Imaging-Based Grain Quality Assessment With Limited Labelled Data

Authors: Priyabrata Karmakar, Manzur Murshed, Shyh Wei Teng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10924
Pdf URL: https://arxiv.org/pdf/2411.10924
Copy Paste: [[2411.10924]] Hyperspectral Imaging-Based Grain Quality Assessment With Limited Labelled Data(https://arxiv.org/abs/2411.10924)
Keywords: robust
Abstract: Recently hyperspectral imaging (HSI)-based grain quality assessment has gained research attention. However, unlike other imaging modalities, HSI data lacks sufficient labelled samples required to effectively train deep convolutional neural network (DCNN)-based classifiers. In this paper, we present a novel approach to grain quality assessment using HSI combined with few-shot learning (FSL) techniques. Traditional methods for grain quality evaluation, while reliable, are invasive, time-consuming, and costly. HSI offers a non-invasive, real-time alternative by capturing both spatial and spectral information. However, a significant challenge in applying DCNNs for HSI-based grain classification is the need for large labelled databases, which are often difficult to obtain. To address this, we explore the use of FSL, which enables models to perform well with limited labelled data, making it a practical solution for real-world applications where rapid deployment is required. We also explored the application of FSL for the classification of hyperspectral images of bulk grains to enable rapid quality assessment at various receival points in the grain supply chain. We evaluated the performance of few-shot classifiers in two scenarios: first, classification of grain types seen during training, and second, generalisation to unseen grain types, a crucial feature for real-world applications. In the first scenario, we introduce a novel approach using pre-computed collective class prototypes (CCPs) to enhance inference efficiency and robustness. In the second scenario, we assess the model's ability to classify novel grain types using limited support examples. Our experimental results show that despite using very limited labelled data for training, our FSL classifiers accuracy is comparable to that of a fully trained classifier trained using a significantly larger labelled database.

Title: Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning

Authors: Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10928
Pdf URL: https://arxiv.org/pdf/2411.10928
Copy Paste: [[2411.10928]] Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning(https://arxiv.org/abs/2411.10928)
Keywords: large language model
Abstract: Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.

Title: Constrained Diffusion with Trust Sampling

Authors: William Huang, Yifeng Jiang, Tom Van Wouwe, C. Karen Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10932
Pdf URL: https://arxiv.org/pdf/2411.10932
Copy Paste: [[2411.10932]] Constrained Diffusion with Trust Sampling(https://arxiv.org/abs/2411.10932)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated significant promise in various generative tasks; however, they often struggle to satisfy challenging constraints. Our approach addresses this limitation by rethinking training-free loss-guided diffusion from an optimization perspective. We formulate a series of constrained optimizations throughout the inference process of a diffusion model. In each optimization, we allow the sample to take multiple steps along the gradient of the proxy constraint function until we can no longer trust the proxy, according to the variance at each diffusion level. Additionally, we estimate the state manifold of diffusion model to allow for early termination when the sample starts to wander away from the state manifold at each diffusion step. Trust sampling effectively balances between following the unconditional diffusion model and adhering to the loss guidance, enabling more flexible and accurate constrained generation. We demonstrate the efficacy of our method through extensive experiments on complex tasks, and in drastically different domains of images and 3D motion generation, showing significant improvements over existing methods in terms of generation quality. Our implementation is available at this https URL.

Title: Analyzing Pok\'emon and Mario Streamers' Twitch Chat with LLM-based User Embeddings

Authors: Mika Hämäläinen, Jack Rueter, Khalid Alnajjar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10934
Pdf URL: https://arxiv.org/pdf/2411.10934
Copy Paste: [[2411.10934]] Analyzing Pok\'emon and Mario Streamers' Twitch Chat with LLM-based User Embeddings(https://arxiv.org/abs/2411.10934)
Keywords: large language model
Abstract: We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.

Title: Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion

Authors: Ni Ou, Zhuo Chen, Xinru Zhang, Junzheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10936
Pdf URL: https://arxiv.org/pdf/2411.10936
Copy Paste: [[2411.10936]] Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion(https://arxiv.org/abs/2411.10936)
Keywords: extraction, diffusion
Abstract: Cameras and LiDAR are essential sensors for autonomous vehicles. Camera-LiDAR data fusion compensate for deficiencies of stand-alone sensors but relies on precise extrinsic calibration. Many learning-based calibration methods predict extrinsic parameters in a single step. Driven by the growing demand for higher accuracy, a few approaches utilize multi-range models or integrate multiple methods to improve extrinsic parameter predictions, but these strategies incur extended training times and require additional storage for separate models. To address these issues, we propose a single-model iterative approach based on surrogate diffusion to significantly enhance the capacity of individual calibration methods. By applying a buffering technique proposed by us, the inference time of our surrogate diffusion is 43.7% less than that of multi-range models. Additionally, we create a calibration network as our denoiser, featuring both projection-first and encoding-first branches for effective point feature extraction. Extensive experiments demonstrate that our diffusion model outperforms other single-model iterative methods and delivers competitive results compared to multi-range models. Our denoiser exceeds state-of-the-art calibration methods, reducing the rotation error by 24.5% compared to the second-best method. Furthermore, with the proposed diffusion applied, it achieves 20.4% less rotation error and 9.6% less translation error.

Title: Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry

Authors: Wenjun Hou, Yi Cheng, Kaishuai Xu, Yan Hu, Wenjie Li, Jiang Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.10937
Pdf URL: https://arxiv.org/pdf/2411.10937
Copy Paste: [[2411.10937]] Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry(https://arxiv.org/abs/2411.10937)
Keywords: robust
Abstract: Comprehensively understanding surgical scenes in Surgical Visual Question Answering (Surgical VQA) requires reasoning over multiple objects. Previous approaches address this task using cross-modal fusion strategies to enhance reasoning ability. However, these methods often struggle with limited scene understanding and question comprehension, and some rely on external resources (e.g., pre-extracted object features), which can introduce errors and generalize poorly across diverse surgical environments. To address these challenges, we propose SCAN, a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry. SCAN operates autonomously, generating two types of memory for context augmentation: Direct Memory (DM), which provides multiple candidates (or hints) to the final answer, and Indirect Memory (IM), which consists of self-contained question-hint pairs to capture broader scene context. DM directly assists in answering the question, while IM enhances understanding of the surgical scene beyond the immediate query. Reasoning over these object-aware memories enables the model to accurately interpret images and respond to questions. Extensive experiments on three publicly available Surgical VQA datasets demonstrate that SCAN achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.

Title: Anomaly Detection for People with Visual Impairments Using an Egocentric 360-Degree Camera

Authors: Inpyo Song, Sanghyeon Lee, Minjun Joo, Jangwon Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10945
Pdf URL: https://arxiv.org/pdf/2411.10945
Copy Paste: [[2411.10945]] Anomaly Detection for People with Visual Impairments Using an Egocentric 360-Degree Camera(https://arxiv.org/abs/2411.10945)
Keywords: security
Abstract: Recent advancements in computer vision have led to a renewed interest in developing assistive technologies for individuals with visual impairments. Although extensive research has been conducted in the field of computer vision-based assistive technologies, most of the focus has been on understanding contexts in images, rather than addressing their physical safety and security concerns. To address this challenge, we propose the first step towards detecting anomalous situations for visually impaired people by observing their entire surroundings using an egocentric 360-degree camera. We first introduce a novel egocentric 360-degree video dataset called VIEW360 (Visually Impaired Equipped with Wearable 360-degree camera), which contains abnormal activities that visually impaired individuals may encounter, such as shoulder surfing and pickpocketing. Furthermore, we propose a new architecture called the FDPN (Frame and Direction Prediction Network), which facilitates frame-level prediction of abnormal events and identifying of their directions. Finally, we evaluate our approach on our VIEW360 dataset and the publicly available UCF-Crime and Shanghaitech datasets, demonstrating state-of-the-art performance.

Title: Direct and Explicit 3D Generation from a Single Image

Authors: Haoyu Wu, Meher Gitika Karumuri, Chuhang Zou, Seungbae Bang, Yuelong Li, Dimitris Samaras, Sunil Hadap
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10947
Pdf URL: https://arxiv.org/pdf/2411.10947
Copy Paste: [[2411.10947]] Direct and Explicit 3D Generation from a Single Image(https://arxiv.org/abs/2411.10947)
Keywords: diffusion
Abstract: Current image-to-3D approaches suffer from high computational costs and lack scalability for high-resolution outputs. In contrast, we introduce a novel framework to directly generate explicit surface geometry and texture using multi-view 2D depth and RGB images along with 3D Gaussian features using a repurposed Stable Diffusion model. We introduce a depth branch into U-Net for efficient and high quality multi-view, cross-domain generation and incorporate epipolar attention into the latent-to-pixel decoder for pixel-level multi-view consistency. By back-projecting the generated depth pixels into 3D space, we create a structured 3D representation that can be either rendered via Gaussian splatting or extracted to high-quality meshes, thereby leveraging additional novel view synthesis loss to further improve our performance. Extensive experiments demonstrate that our method surpasses existing baselines in geometry and texture quality while achieving significantly faster generation time.

Title: Towards Accurate and Efficient Sub-8-Bit Integer Training

Authors: Wenjin Guo, Donglai Liu, Weiying Xie, Yunsong Li, Xuefei Ning, Zihan Meng, Shulin Zeng, Jie Lei, Zhenman Fang, Yu Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.10948
Pdf URL: https://arxiv.org/pdf/2411.10948
Copy Paste: [[2411.10948]] Towards Accurate and Efficient Sub-8-Bit Integer Training(https://arxiv.org/abs/2411.10948)
Keywords: transformer
Abstract: Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve high accuracy and efficiency simultaneously. In this paper, we explore sub-8-bit integer training from its essence of gradient descent optimization. Our integer training framework includes two components: ShiftQuant to realize accurate gradient estimation, and L1 normalization to smoothen the loss landscape. ShiftQuant attains performance that approaches the theoretical upper bound of group quantization. Furthermore, it liberates group quantization from inefficient memory rearrangement. The L1 normalization facilitates the implementation of fully quantized normalization layers with impressive convergence accuracy. Our method frees sub-8-bit integer training from pre-processing and supports general devices. This framework achieves negligible accuracy loss across various neural networks and tasks ($0.92\%$ on 4-bit ResNets, $0.61\%$ on 6-bit Transformers). The prototypical implementation of ShiftQuant achieves more than $1.85\times/15.3\%$ performance improvement on CPU/GPU compared to its FP16 counterparts, and $33.9\%$ resource consumption reduction on FPGA than the FP16 counterparts. The proposed fully-quantized L1 normalization layers achieve more than $35.54\%$ improvement in throughout on CPU compared to traditional L2 normalization layers. Moreover, theoretical analysis verifies the advancement of our method.

Title: Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Authors: Zeping Yu, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.10950
Pdf URL: https://arxiv.org/pdf/2411.10950
Copy Paste: [[2411.10950]] Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering(https://arxiv.org/abs/2411.10950)
Keywords: interpretability, large language model
Abstract: Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{this https URL}

Title: TSFormer: A Robust Framework for Efficient UHD Image Restoration

Authors: Xin Su, Chen Wu, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10951
Pdf URL: https://arxiv.org/pdf/2411.10951
Copy Paste: [[2411.10951]] TSFormer: A Robust Framework for Efficient UHD Image Restoration(https://arxiv.org/abs/2411.10951)
Keywords: robust
Abstract: Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates \textbf{T}rusted learning with \textbf{S}parsification to boost both generalization capability and computational efficiency in UHD image restoration. The key is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min-$p$ with random matrix theory to quantify the uncertainty of tokens, thereby improving the robustness of the model. Our model can run a 4K image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.

Title: V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

Authors: Lei Yang, Xinyu Zhang, Jun Li, Chen Wang, Zhiying Song, Tong Zhao, Ziying Song, Li Wang, Mo Zhou, Yang Shen, Kai Wu, Chen Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10962
Pdf URL: https://arxiv.org/pdf/2411.10962
Copy Paste: [[2411.10962]] V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception(https://arxiv.org/abs/2411.10962)
Keywords: robust
Abstract: Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby improving the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged. However, these datasets only focus on camera and LiDAR, overlooking 4D Radar, a sensor employed in single-vehicle autonomous driving for robust perception in adverse weather conditions. In this paper, to bridge the gap of missing 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large real-world multi-modal dataset featuring 4D Radar. Our V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data includes sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as typical challenging scenarios. The dataset comprises 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, with 350K annotated bounding boxes across five categories. To facilitate diverse research domains, we establish V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. We further provide comprehensive benchmarks of recent perception algorithms on the above three sub-datasets. The dataset and benchmark codebase will be available at \url{this http URL}.

Title: VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Authors: Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.10979
Pdf URL: https://arxiv.org/pdf/2411.10979
Copy Paste: [[2411.10979]] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?(https://arxiv.org/abs/2411.10979)
Keywords: large language model
Abstract: The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at this https URL.

Title: Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations

Authors: Agus Sudjianto, Yueyang Shen, Arun Prakash R, Anwesha Bhattacharyya, Maorong Rao, Yaqun Wang, Joel Vaughan, Nengfeng Zhou
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2411.10982
Pdf URL: https://arxiv.org/pdf/2411.10982
Copy Paste: [[2411.10982]] Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations(https://arxiv.org/abs/2411.10982)
Keywords: robust, interpretability
Abstract: We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.

Title: Framework for developing and evaluating ethical collaboration between expert and machine

Authors: Ayan Banerjee, Payal Kamboj, Sandeep Gupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.10983
Pdf URL: https://arxiv.org/pdf/2411.10983
Copy Paste: [[2411.10983]] Framework for developing and evaluating ethical collaboration between expert and machine(https://arxiv.org/abs/2411.10983)
Keywords: explainability
Abstract: Precision medicine is a promising approach for accessible disease diagnosis and personalized intervention planning in high-mortality diseases such as coronary artery disease (CAD), drug-resistant epilepsy (DRE), and chronic illnesses like Type 1 diabetes (T1D). By leveraging artificial intelligence (AI), precision medicine tailors diagnosis and treatment solutions to individual patients by explicitly modeling variance in pathophysiology. However, the adoption of AI in medical applications faces significant challenges, including poor generalizability across centers, demographics, and comorbidities, limited explainability in clinical terms, and a lack of trust in ethical decision-making. This paper proposes a framework to develop and ethically evaluate expert-guided multi-modal AI, addressing these challenges in AI integration within precision medicine. We illustrate this framework with case study on insulin management for T1D. To ensure ethical considerations and clinician engagement, we adopt a co-design approach where AI serves an assistive role, with final diagnoses or treatment plans emerging from collaboration between clinicians and AI.

Title: EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time

Authors: Wanli Xing, Shijie Lin, Linhan Yang, Zeqing Zhang, Yanjun Du, Maolin Lei, Yipeng Pan, Jia Pan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.11004
Pdf URL: https://arxiv.org/pdf/2411.11004
Copy Paste: [[2411.11004]] EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time(https://arxiv.org/abs/2411.11004)
Keywords: robust
Abstract: This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while enabling continuous mapping for enhanced spatial resolution. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.

Title: BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation

Authors: Haiyang Yu, Tian Xie, Jiaping Gui, Pengyang Wang, Ping Yi, Yue Wu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11006
Pdf URL: https://arxiv.org/pdf/2411.11006
Copy Paste: [[2411.11006]] BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation(https://arxiv.org/abs/2411.11006)
Keywords: defense
Abstract: We introduce BackdoorMBTI, the first backdoor learning toolkit and benchmark designed for multimodal evaluation across three representative modalities from eleven commonly used datasets. BackdoorMBTI provides a systematic backdoor learning pipeline, encompassing data processing, data poisoning, backdoor training, and evaluation. The generated poison datasets and backdoor models enable detailed evaluation of backdoor defense methods. Given the diversity of modalities, BackdoorMBTI facilitates systematic evaluation across different data types. Furthermore, BackdoorMBTI offers a standardized approach to handling practical factors in backdoor learning, such as issues related to data quality and erroneous labels. We anticipate that BackdoorMBTI will expedite future research in backdoor defense methods within a multimodal context. Code is available at this https URL.

Title: CCi-YOLOv8n: Enhanced Fire Detection with CARAFE and Context-Guided Modules

Authors: Kunwei Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11011
Pdf URL: https://arxiv.org/pdf/2411.11011
Copy Paste: [[2411.11011]] CCi-YOLOv8n: Enhanced Fire Detection with CARAFE and Context-Guided Modules(https://arxiv.org/abs/2411.11011)
Keywords: robust
Abstract: Fire incidents in urban and forested areas pose serious threats,underscoring the need for more effective detection technologies. To address these challenges, we present CCi-YOLOv8n, an enhanced YOLOv8 model with targeted improvements for detecting small fires and smoke. The model integrates the CARAFE up-sampling operator and a context-guided module to reduce information loss during up-sampling and down-sampling, thereby retaining richer feature representations. Additionally, an inverted residual mobile block enhanced C2f module captures small targets and fine smoke patterns, a critical improvement over the original model's detection this http URL validation, we introduce Web-Fire, a dataset curated for fire and smoke detection across diverse real-world scenarios. Experimental results indicate that CCi-YOLOv8n outperforms YOLOv8n in detection precision, confirming its effectiveness for robust fire detection tasks.

Title: Time Step Generating: A Universal Synthesized Deepfake Image Detector

Authors: Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11016
Pdf URL: https://arxiv.org/pdf/2411.11016
Copy Paste: [[2411.11016]] Time Step Generating: A Universal Synthesized Deepfake Image Detector(https://arxiv.org/abs/2411.11016)
Keywords: security, privacy, diffusion, generative
Abstract: Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.

Title: A Study of Malware Prevention in Linux Distributions

Authors: Duc-Ly Vu, Trevor Dunlap, Karla Obermeier-Velazquez, Paul Gilbert, John Speed Meyers, Santiago Torres-Arias
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.11017
Pdf URL: https://arxiv.org/pdf/2411.11017
Copy Paste: [[2411.11017]] A Study of Malware Prevention in Linux Distributions(https://arxiv.org/abs/2411.11017)
Keywords: attack
Abstract: Malicious attacks on open source software packages are a growing concern. This concern morphed into a panic-inducing crisis after the revelation of the XZ Utils backdoor, which would have provided the attacker with, according to one observer, a "skeleton key" to the internet. This study therefore explores the challenges of preventing and detecting malware in Linux distribution package repositories. To do so, we ask two research questions: (1) What measures have Linux distributions implemented to counter malware, and how have maintainers experienced these efforts? (2) How effective are current malware detection tools at identifying malicious Linux packages? To answer these questions, we conduct interviews with maintainers at several major Linux distributions and introduce a Linux package malware benchmark dataset. Using this dataset, we evaluate the performance of six open source malware detection scanners. Distribution maintainers, according to the interviews, have mostly focused on reproducible builds to date. Our interviews identified only a single Linux distribution, Wolfi OS, that performs active malware scanning. Using this new benchmark dataset, the evaluation found that the performance of existing open-source malware scanners is underwhelming. Most studied tools excel at producing false positives but only infrequently detect true malware. Those that avoid high false positive rates often do so at the expense of a satisfactory true positive. Our findings provide insights into Linux distribution package repositories' current practices for malware detection and demonstrate the current inadequacy of open-source tools designed to detect malicious Linux packages.

Title: Training a Label-Noise-Resistant GNN with Reduced Complexity

Authors: Rui Zhao, Bin Shi, Zhiming Liang, Jianfei Ruan, Bo Dong, Lu Lin
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2411.11020
Pdf URL: https://arxiv.org/pdf/2411.11020
Copy Paste: [[2411.11020]] Training a Label-Noise-Resistant GNN with Reduced Complexity(https://arxiv.org/abs/2411.11020)
Keywords: robust
Abstract: Graph Neural Networks (GNNs) have been widely employed for semi-supervised node classification tasks on graphs. However, the performance of GNNs is significantly affected by label noise, that is, a small amount of incorrectly labeled nodes can substantially misguide model training. Mainstream solutions define node classification with label noise (NCLN) as a reliable labeling task, often introducing node similarity with quadratic computational complexity to more accurately assess label reliability. To this end, in this paper, we introduce the Label Ensemble Graph Neural Network (LEGNN), a lower complexity method for robust GNNs training against label noise. LEGNN reframes NCLN as a label ensemble task, gathering informative multiple labels instead of constructing a single reliable label, avoiding high-complexity computations for reliability assessment. Specifically, LEGNN conducts a two-step process: bootstrapping neighboring contexts and robust learning with gathered multiple labels. In the former step, we apply random neighbor masks for each node and gather the predicted labels as a high-probability label set. This mitigates the impact of inaccurately labeled neighbors and diversifies the label set. In the latter step, we utilize a partial label learning based strategy to aggregate the high-probability label information for model training. Additionally, we symmetrically gather a low-probability label set to counteract potential noise from the bootstrapped high-probability label set. Extensive experiments on six datasets demonstrate that LEGNN achieves outstanding performance while ensuring efficiency. Moreover, it exhibits good scalability on dataset with over one hundred thousand nodes and one million edges.

Title: BianCang: A Traditional Chinese Medicine Large Language Model

Authors: Sibo Wei, Xueping Peng, Yi-fei Wang, Jiasheng Si, Weiyu Zhang, Wenpeng Lu, Xiaoming Wu, Yinglong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11027
Pdf URL: https://arxiv.org/pdf/2411.11027
Copy Paste: [[2411.11027]] BianCang: A Traditional Chinese Medicine Large Language Model(https://arxiv.org/abs/2411.11027)
Keywords: large language model
Abstract: The rise of large language models (LLMs) has driven significant progress in medical applications, including traditional Chinese medicine (TCM). However, current medical LLMs struggle with TCM diagnosis and syndrome differentiation due to substantial differences between TCM and modern medical theory, and the scarcity of specialized, high-quality corpora. This paper addresses these challenges by proposing BianCang, a TCM-specific LLM, using a two-stage training process that first injects domain-specific knowledge and then aligns it through targeted stimulation. To enhance diagnostic and differentiation capabilities, we constructed pre-training corpora, instruction-aligned datasets based on real hospital records, and the ChP-TCM dataset derived from the Pharmacopoeia of the People's Republic of China. We compiled extensive TCM and medical corpora for continuous pre-training and supervised fine-tuning, building a comprehensive dataset to refine the model's understanding of TCM. Evaluations across 11 test sets involving 29 models and 4 tasks demonstrate the effectiveness of BianCang, offering valuable insights for future research. Code, datasets, and models are available at this https URL.

Title: Wafer Map Defect Classification Using Autoencoder-Based Data Augmentation and Convolutional Neural Network

Authors: Yin-Yin Bao, Er-Chao Li, Hong-Qiang Yang, Bin-Bin Jia
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2411.11029
Pdf URL: https://arxiv.org/pdf/2411.11029
Copy Paste: [[2411.11029]] Wafer Map Defect Classification Using Autoencoder-Based Data Augmentation and Convolutional Neural Network(https://arxiv.org/abs/2411.11029)
Keywords: robust
Abstract: In semiconductor manufacturing, wafer defect maps (WDMs) play a crucial role in diagnosing issues and enhancing process yields by revealing critical defect patterns. However, accurately categorizing WDM defects presents significant challenges due to noisy data, unbalanced defect classes, and the complexity of failure modes. To address these challenges, this study proposes a novel method combining a self-encoder-based data augmentation technique with a convolutional neural network (CNN). By introducing noise into the latent space, the self-encoder enhances data diversity and mitigates class imbalance, thereby improving the model's generalization capabilities. The augmented dataset is subsequently used to train the CNN, enabling it to deliver precise classification of both common and rare defect patterns. Experimental results on the WM-811K dataset demonstrate that the proposed method achieves a classification accuracy of 98.56%, surpassing Random Forest, SVM, and Logistic Regression by 19%, 21%, and 27%, respectively. These findings highlight the robustness and effectiveness of the proposed approach, offering a reliable solution for wafer defect detection and classification.

Title: EfQAT: An Efficient Framework for Quantization-Aware Training

Authors: Saleh Ashkboos, Bram Verhoef, Torsten Hoefler, Evangelos Eleftheriou, Martino Dazzi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11038
Pdf URL: https://arxiv.org/pdf/2411.11038
Copy Paste: [[2411.11038]] EfQAT: An Efficient Framework for Quantization-Aware Training(https://arxiv.org/abs/2411.11038)
Keywords: transformer
Abstract: Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full precision backward pass. On the other hand, post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap, but they usually result in a significant accuracy drop. We address these challenges by proposing EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model. EfQAT starts by applying a PTQ scheme to a pre-trained model and only updates the most critical network parameters while freezing the rest, accelerating the backward pass. We demonstrate the effectiveness of EfQAT on various CNNs and Transformer-based models using different GPUs. Specifically, we show that EfQAT is significantly more accurate than PTQ with little extra compute. Furthermore, EfQAT can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

Title: FedUHB: Accelerating Federated Unlearning via Polyak Heavy Ball Method

Authors: Yu Jiang, Chee Wei Tan, Kwok-Yan Lam
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2411.11039
Pdf URL: https://arxiv.org/pdf/2411.11039
Copy Paste: [[2411.11039]] FedUHB: Accelerating Federated Unlearning via Polyak Heavy Ball Method(https://arxiv.org/abs/2411.11039)
Keywords: privacy, robust, federate
Abstract: Federated learning facilitates collaborative machine learning, enabling multiple participants to collectively develop a shared model while preserving the privacy of individual data. The growing importance of the "right to be forgotten" calls for effective mechanisms to facilitate data removal upon request. In response, federated unlearning (FU) has been developed to efficiently eliminate the influence of specific data from the model. Current FU methods primarily rely on approximate unlearning strategies, which seek to balance data removal efficacy with computational and communication costs, but often fail to completely erase data influence. To address these limitations, we propose FedUHB, a novel exact unlearning approach that leverages the Polyak heavy ball optimization technique, a first-order method, to achieve rapid retraining. In addition, we introduce a dynamic stopping mechanism to optimize the termination of the unlearning process. Our extensive experiments show that FedUHB not only enhances unlearning efficiency but also preserves robust model performance after unlearning. Furthermore, the dynamic stopping mechanism effectively reduces the number of unlearning iterations, conserving both computational and communication resources. FedUHB can be proved as an effective and efficient solution for exact data removal in federated learning settings.

Title: Efficient Federated Unlearning with Adaptive Differential Privacy Preservation

Authors: Yu Jiang, Xindi Tong, Ziyao Liu, Huanyi Ye, Chee Wei Tan, Kwok-Yan Lam
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11044
Pdf URL: https://arxiv.org/pdf/2411.11044
Copy Paste: [[2411.11044]] Efficient Federated Unlearning with Adaptive Differential Privacy Preservation(https://arxiv.org/abs/2411.11044)
Keywords: privacy, protect, federate
Abstract: Federated unlearning (FU) offers a promising solution to effectively address the need to erase the impact of specific clients' data on the global model in federated learning (FL), thereby granting individuals the ``Right to be Forgotten". The most straightforward approach to achieve unlearning is to train the model from scratch, excluding clients who request data removal, but it is resource-intensive. Current state-of-the-art FU methods extend traditional FL frameworks by leveraging stored historical updates, enabling more efficient unlearning than training from scratch. However, the use of stored updates introduces significant privacy risks. Adversaries with access to these updates can potentially reconstruct clients' local data, a well-known vulnerability in the privacy domain. While privacy-enhanced techniques exist, their applications to FU scenarios that balance unlearning efficiency with privacy protection remain underexplored. To address this gap, we propose FedADP, a method designed to achieve both efficiency and privacy preservation in FU. Our approach incorporates an adaptive differential privacy (DP) mechanism, carefully balancing privacy and unlearning performance through a novel budget allocation strategy tailored for FU. FedADP also employs a dual-layered selection process, focusing on global models with significant changes and client updates closely aligned with the global model, reducing storage and communication costs. Additionally, a novel calibration method is introduced to facilitate effective unlearning. Extensive experimental results demonstrate that FedADP effectively manages the trade-off between unlearning efficiency and privacy protection.

Title: StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Authors: Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11045
Pdf URL: https://arxiv.org/pdf/2411.11045
Copy Paste: [[2411.11045]] StableV2V: Stablizing Shape Consistency in Video-to-Video Editing(https://arxiv.org/abs/2411.11045)
Keywords: generative
Abstract: Recent advancements of generative AI have significantly promoted content creation and editing, where prevailing studies further extend this exciting progress to video editing. In doing so, these studies mainly transfer the inherent motion patterns from the source videos to the edited ones, where results with inferior consistency to user prompts are often observed, due to the lack of particular alignments between the delivered motions and edited contents. To address this limitation, we present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Furthermore, we curate a testing benchmark, namely DAVIS-Edit, for a comprehensive evaluation of video editing, considering various types of prompts and difficulties. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.

Title: Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting

Authors: Shubham Tanaji Kakde, Rony Mitra, Jasashwi Mandal, Manoj Kumar Tiwari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11046
Pdf URL: https://arxiv.org/pdf/2411.11046
Copy Paste: [[2411.11046]] Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting(https://arxiv.org/abs/2411.11046)
Keywords: transformer
Abstract: Multivariate Long Sequence Time-series Forecasting (LSTF) has been a critical task across various real-world applications. Recent advancements focus on the application of transformer architectures attributable to their ability to capture temporal patterns effectively over extended periods. However, these approaches often overlook the inherent relationships and interactions between the input variables that could be drawn from their characteristic properties. In this paper, we aim to bridge this gap by integrating information-rich Knowledge Graph Embeddings (KGE) with state-of-the-art transformer-based architectures. We introduce a novel approach that encapsulates conceptual relationships among variables within a well-defined knowledge graph, forming dynamic and learnable KGEs for seamless integration into the transformer architecture. We investigate the influence of this integration into seminal architectures such as PatchTST, Autoformer, Informer, and Vanilla Transformer. Furthermore, we thoroughly investigate the performance of these knowledge-enhanced architectures along with their original implementations for long forecasting horizons and demonstrate significant improvement in the benchmark results. This enhancement empowers transformer-based architectures to address the inherent structural relation between variables. Our knowledge-enhanced approach improves the accuracy of multivariate LSTF by capturing complex temporal and relational dynamics across multiple domains. To substantiate the validity of our model, we conduct comprehensive experiments using Weather and Electric Transformer Temperature (ETT) datasets.

Title: SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation

Authors: Bin Xu, Yiguan Lin, Yinghao Li, YangGao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11053
Pdf URL: https://arxiv.org/pdf/2411.11053
Copy Paste: [[2411.11053]] SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation(https://arxiv.org/abs/2411.11053)
Keywords: robust, large language model
Abstract: Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems.

Title: FastDraft: How to Train Your Draft

Authors: Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11055
Pdf URL: https://arxiv.org/pdf/2411.11055
Copy Paste: [[2411.11055]] FastDraft: How to Train Your Draft(https://arxiv.org/abs/2411.11055)
Keywords: large language model
Abstract: Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models (LLMs). However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary incompatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft with approximately 10 billion tokens on a single server with 8 Intel$^\circledR$ Gaudi$^\circledR$ 2 accelerators in under 24 hours. Our results show that the draft model achieves impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks. We validate our theoretical findings through benchmarking on the latest Intel$^\circledR$ Core$^{\tiny \text{TM}}$ Ultra, achieving a wall-clock time speedup of up to 2x, indicating a significant reduction in runtime. Due to its high quality, FastDraft unlocks large language models inference on AI-PC and other edge-devices.

Title: Patching FPGAs: The Security Implications of Bitstream Modifications

Authors: Endres Puschner, Maik Ender, Steffen Becker, Christof Paar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11060
Pdf URL: https://arxiv.org/pdf/2411.11060
Copy Paste: [[2411.11060]] Patching FPGAs: The Security Implications of Bitstream Modifications(https://arxiv.org/abs/2411.11060)
Keywords: security, protect, attack
Abstract: Field Programmable Gate Arrays (FPGAs) are known for their reprogrammability that allows for post-manufacture circuitry changes. Nowadays, they are integral to a variety of systems including high-security applications such as aerospace and military systems. However, this reprogrammability also introduces significant security challenges, as bitstream manipulation can directly alter hardware circuits. Malicious manipulations may lead to leakage of secret data and the implementation of hardware Trojans. In this paper, we present a comprehensive framework for manipulating bitstreams with minimal reverse engineering, thereby exposing the potential risks associated with inadequate bitstream protection. Our methodology does not require a complete understanding of proprietary bitstream formats or a fully reverse-engineered target design. Instead, it enables precise modifications by inserting pre-synthesized circuits into existing bitstreams. This novel approach is demonstrated through a semi-automated framework consisting of five steps: (1) partial bitstream reverse engineering, (2) designing the modification, (3) placing and (4) routing the modification into the existing circuit, and (5) merging of the modification with the original bitstream. We validate our framework through four practical case studies on the OpenTitan design synthesized for Xilinx 7-Series FPGAs. While current protections such as bitstream authentication and encryption often fall short, our work highlights and discusses the urgency of developing effective countermeasures. We recommend using FPGAs as trust anchors only when bitstream manipulation attacks can be reliably excluded.

Title: Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text

Authors: Xiaoliang Luo, Michael Ramscar, Bradley C. Love
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2411.11061
Pdf URL: https://arxiv.org/pdf/2411.11061
Copy Paste: [[2411.11061]] Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text(https://arxiv.org/abs/2411.11061)
Keywords: transformer, large language model
Abstract: The impressive performance of large language models (LLMs) has led to their consideration as models of human language processing. Instead, we suggest that the success of LLMs arises from the flexibility of the transformer learning architecture. To evaluate this conjecture, we trained LLMs on scientific texts that were either in a forward or backward format. Despite backward text being inconsistent with the structure of human languages, we found that LLMs performed equally well in either format on a neuroscience benchmark, eclipsing human expert performance for both forward and backward orders. Our results are consistent with the success of transformers across diverse domains, such as weather prediction and protein design. This widespread success is attributable to LLM's ability to extract predictive patterns from any sufficiently structured input. Given their generality, we suggest caution in interpreting LLM's success in linguistic tasks as evidence for human-like mechanisms.

Title: TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Authors: Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11066
Pdf URL: https://arxiv.org/pdf/2411.11066
Copy Paste: [[2411.11066]] TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models(https://arxiv.org/abs/2411.11066)
Keywords: large language model
Abstract: Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at this https URL.

Title: Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Authors: Wenjia Jiang, Xiaoke Zhu, Jiakang Gao, Di Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11069
Pdf URL: https://arxiv.org/pdf/2411.11069
Copy Paste: [[2411.11069]] Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2411.11069)
Keywords: robust
Abstract: Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

Title: Multilingual Large Language Models: A Systematic Survey

Authors: Shaolin Zhu, Supryadi, Shaoyang Xu, Haoran Sun, Leiyu Pan, Menglong Cui, Jiangcun Du, Renren Jin, António Branco, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11072
Pdf URL: https://arxiv.org/pdf/2411.11072
Copy Paste: [[2411.11072]] Multilingual Large Language Models: A Systematic Survey(https://arxiv.org/abs/2411.11072)
Keywords: interpretability, large language model
Abstract: This paper provides a comprehensive survey of the latest research on multilingual large language models (MLLMs). MLLMs not only are able to understand and generate language across linguistic boundaries, but also represent an important advancement in artificial intelligence. We first discuss the architecture and pre-training objectives of MLLMs, highlighting the key components and methodologies that contribute to their multilingual capabilities. We then discuss the construction of multilingual pre-training and alignment datasets, underscoring the importance of data quality and diversity in enhancing MLLM performance. An important focus of this survey is on the evaluation of MLLMs. We present a detailed taxonomy and roadmap covering the assessment of MLLMs' cross-lingual knowledge, reasoning, alignment with human values, safety, interpretability and specialized applications. Specifically, we extensively discuss multilingual evaluation benchmarks and datasets, and explore the use of LLMs themselves as multilingual evaluators. To enhance MLLMs from black to white boxes, we also address the interpretability of multilingual capabilities, cross-lingual transfer and language bias within these models. Finally, we provide a comprehensive review of real-world applications of MLLMs across diverse domains, including biology, medicine, computer science, mathematics and law. We showcase how these models have driven innovation and improvements in these specialized fields while also highlighting the challenges and opportunities in deploying MLLMs within diverse language communities and application this http URL listed the paper related in this survey and publicly available at this https URL .

Title: The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection

Authors: Tomas Horych, Christoph Mandl, Terry Ruas, Andre Greiner-Petter, Bela Gipp, Akiko Aizawa, Timo Spinde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11081
Pdf URL: https://arxiv.org/pdf/2411.11081
Copy Paste: [[2411.11081]] The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection(https://arxiv.org/abs/2411.11081)
Keywords: large language model
Abstract: High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate the annotation process, reducing these costs while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on the success in these areas, this study investigates whether LLMs are viable for annotating the complex task of media bias detection and whether a downstream media bias classifier can be trained on such data. We create annolexical, the first large-scale dataset for media bias classification with over 48000 synthetically annotated examples. Our classifier, fine-tuned on this dataset, surpasses all of the annotator LLMs by 5-9 percent in Matthews Correlation Coefficient (MCC) and performs close to or outperforms the model trained on human-labeled data when evaluated on two media bias benchmark datasets (BABE and BASIL). This study demonstrates how our approach significantly reduces the cost of dataset creation in the media bias domain and, by extension, the development of classifiers, while our subsequent behavioral stress-testing reveals some of its current limitations and trade-offs.

Title: D-Cube: Exploiting Hyper-Features of Diffusion Model for Robust Medical Classification

Authors: Minhee Jang, Juheon Son, Thanaporn Viriyasaranon, Junho Kim, Jang-Hwan Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11087
Pdf URL: https://arxiv.org/pdf/2411.11087
Copy Paste: [[2411.11087]] D-Cube: Exploiting Hyper-Features of Diffusion Model for Robust Medical Classification(https://arxiv.org/abs/2411.11087)
Keywords: robust, extraction, diffusion
Abstract: The integration of deep learning technologies in medical imaging aims to enhance the efficiency and accuracy of cancer diagnosis, particularly for pancreatic and breast cancers, which present significant diagnostic challenges due to their high mortality rates and complex imaging characteristics. This paper introduces Diffusion-Driven Diagnosis (D-Cube), a novel approach that leverages hyper-features from a diffusion model combined with contrastive learning to improve cancer diagnosis. D-Cube employs advanced feature selection techniques that utilize the robust representational capabilities of diffusion models, enhancing classification performance on medical datasets under challenging conditions such as data imbalance and limited sample availability. The feature selection process optimizes the extraction of clinically relevant features, significantly improving classification accuracy and demonstrating resilience in imbalanced and limited data scenarios. Experimental results validate the effectiveness of D-Cube across multiple medical imaging modalities, including CT, MRI, and X-ray, showing superior performance compared to existing baseline models. D-Cube represents a new strategy in cancer detection, employing advanced deep learning techniques to achieve state-of-the-art diagnostic accuracy and efficiency.

Title: MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Authors: Xi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Lin Yao, Linfeng Zhang, Guolin Ke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11098
Pdf URL: https://arxiv.org/pdf/2411.11098
Copy Paste: [[2411.11098]] MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild(https://arxiv.org/abs/2411.11098)
Keywords: extraction, large language model
Abstract: In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available.

Title: Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML

Authors: Prakhar Ganeesh, Usman Gohar, Lu Cheng, Golnoosh Farnadi
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2411.11101
Pdf URL: https://arxiv.org/pdf/2411.11101
Copy Paste: [[2411.11101]] Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML(https://arxiv.org/abs/2411.11101)
Keywords: fair
Abstract: With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.

Title: Label Sharing Incremental Learning Framework for Independent Multi-Label Segmentation Tasks

Authors: Deepa Anand, Bipul Das, Vyshnav Dangeti, Antony Jerald, Rakesh Mullick, Uday Patil, Pakhi Sharma, Prasad Sudhakar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11105
Pdf URL: https://arxiv.org/pdf/2411.11105
Copy Paste: [[2411.11105]] Label Sharing Incremental Learning Framework for Independent Multi-Label Segmentation Tasks(https://arxiv.org/abs/2411.11105)
Keywords: segmentation
Abstract: In a setting where segmentation models have to be built for multiple datasets, each with its own corresponding label set, a straightforward way is to learn one model for every dataset and its labels. Alternatively, multi-task architectures with shared encoders and multiple segmentation heads or shared weights with compound labels can also be made use of. This work proposes a novel label sharing framework where a shared common label space is constructed and each of the individual label sets are systematically mapped to the common labels. This transforms multiple datasets with disparate label sets into a single large dataset with shared labels, and therefore all the segmentation tasks can be addressed by learning a single model. This eliminates the need for task specific adaptations in network architectures and also results in parameter and data efficient models. Furthermore, label sharing framework is naturally amenable for incremental learning where segmentations for new datasets can be easily learnt. We experimentally validate our method on various medical image segmentation datasets, each involving multi-label segmentation. Furthermore, we demonstrate the efficacy of the proposed method in terms of performance and incremental learning ability vis-a-vis alternative methods.

Title: JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Authors: Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, Chun Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11114
Pdf URL: https://arxiv.org/pdf/2411.11114
Copy Paste: [[2411.11114]] JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit(https://arxiv.org/abs/2411.11114)
Keywords: security, attack, large language model
Abstract: Despite the outstanding performance of Large language models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected this http URL jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explain typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing the representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of these attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives (which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on four mainstream LLMs under seven jailbreak strategies. Our evaluation finds that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. Although this manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals, it still produce abnormal activation which can be caught in the circuit analysis.

Title: Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

Authors: Yan Zheng, Zhenxiao Liang, Xiaoyan Cong, Lanqing guo, Yuehao Wang, Peihao Wang, Zhangyang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11135
Pdf URL: https://arxiv.org/pdf/2411.11135
Copy Paste: [[2411.11135]] Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method(https://arxiv.org/abs/2411.11135)
Keywords: diffusion
Abstract: We explore the oscillatory behavior observed in inversion methods applied to large-scale text-to-image diffusion models, with a focus on the "Flux" model. By employing a fixed-point-inspired iterative approach to invert real-world images, we observe that the solution does not achieve convergence, instead oscillating between distinct clusters. Through both toy experiments and real-world diffusion models, we demonstrate that these oscillating clusters exhibit notable semantic coherence. We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in rectified flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing. Furthermore, we provide quantitative results demonstrating the effectiveness of our method for tasks such as image enhancement, makeup transfer, reconstruction quality, and guided sampling quality. Higher-quality examples of videos and images are available at \href{this https URL}{this link}.

Title: CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning

Authors: Depeng Chen, Xiao Liu, Jie Cui, Hong Zhong (School of Computer Science and Technology, Anhui University)
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2411.11144
Pdf URL: https://arxiv.org/pdf/2411.11144
Copy Paste: [[2411.11144]] CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning(https://arxiv.org/abs/2411.11144)
Keywords: attack, membership infer
Abstract: Since machine learning model is often trained on a limited data set, the model is trained multiple times on the same data sample, which causes the model to memorize most of the training set data. Membership Inference Attacks (MIAs) exploit this feature to determine whether a data sample is used for training a machine learning model. However, in realistic scenarios, it is difficult for the adversary to obtain enough qualified samples that mark accurate identity information, especially since most samples are non-members in real world applications. To address this limitation, in this paper, we propose a new attack method called CLMIA, which uses unsupervised contrastive learning to train an attack model without using extra membership status information. Meanwhile, in CLMIA, we require only a small amount of data with known membership status to fine-tune the attack model. Experimental results demonstrate that CLMIA performs better than existing attack methods for different datasets and model structures, especially with data with less marked identity information. In addition, we experimentally find that the attack performs differently for different proportions of labeled identity information for member and non-member data. More analysis proves that our attack method performs better with less labeled identity information, which applies to more realistic scenarios.

Title: From Primes to Paths: Enabling Fast Multi-Relational Graph Analysis

Authors: Konstantinos Bougiatiotis, Georgios Paliouras
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2411.11149
Pdf URL: https://arxiv.org/pdf/2411.11149
Copy Paste: [[2411.11149]] From Primes to Paths: Enabling Fast Multi-Relational Graph Analysis(https://arxiv.org/abs/2411.11149)
Keywords: extraction, interpretability
Abstract: Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while offering improved speed and interpretability.

Title: Person Segmentation and Action Classification for Multi-Channel Hemisphere Field of View LiDAR Sensors

Authors: Svetlana Seliunina, Artem Otelepko, Raphael Memmesheimer, Sven Behnke
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.11151
Pdf URL: https://arxiv.org/pdf/2411.11151
Copy Paste: [[2411.11151]] Person Segmentation and Action Classification for Multi-Channel Hemisphere Field of View LiDAR Sensors(https://arxiv.org/abs/2411.11151)
Keywords: segmentation
Abstract: Robots need to perceive persons in their surroundings for safety and to interact with them. In this paper, we present a person segmentation and action classification approach that operates on 3D scans of hemisphere field of view LiDAR sensors. We recorded a data set with an Ouster OSDome-64 sensor consisting of scenes where persons perform three different actions and annotated it. We propose a method based on a MaskDINO model to detect and segment persons and to recognize their actions from combined spherical projected multi-channel representations of the LiDAR data with an additional positional encoding. Our approach demonstrates good performance for the person segmentation task and further performs well for the estimation of the person action states walking, waving, and sitting. An ablation study provides insights about the individual channel contributions for the person segmentation task. The trained models, code and dataset are made publicly available.

Title: Federated Learning for UAV-Based Spectrum Sensing: Enhancing Accuracy Through SNR-Weighted Model Aggregation

Authors: Kürşat Tekbıyık, Güneş Karabulut Kurt, Antoine Lesage-Landry
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11159
Pdf URL: https://arxiv.org/pdf/2411.11159
Copy Paste: [[2411.11159]] Federated Learning for UAV-Based Spectrum Sensing: Enhancing Accuracy Through SNR-Weighted Model Aggregation(https://arxiv.org/abs/2411.11159)
Keywords: privacy, federate
Abstract: The increasing demand for data usage in wireless communications requires using wider bands in the spectrum, especially for backhaul links. Yet, allocations in the spectrum for non-communication systems inhibit merging bands to achieve wider bandwidth. To overcome this issue, spectrum-sharing or opportunistic spectrum utilization by secondary users stands out as a promising solution. However, both approaches must minimize interference to primary users. Therefore, spectrum sensing becomes vital for such opportunistic usage, ensuring the proper operation of the primary users. Although this problem has been investigated for 2D networks, unmanned aerial vehicle (UAV) networks need different points of view concerning 3D space, its challenges, and opportunities. For this purpose, we propose a federated learning (FL)-based method for spectrum sensing in UAV networks to account for their distributed nature and limited computational capacity. FL enables local training without sharing raw data while guaranteeing the privacy of local users,lowering communication overhead, and increasing data diversity. Furthermore, we develop a federated aggregation method, namely FedSNR, that considers the signal-to-noise ratio observed by UAVs to acquire a global model. The numerical results show that the proposed architecture and the aggregation method outperform traditional methods.

Title: MPLite: Multi-Aspect Pretraining for Mining Clinical Health Records

Authors: Eric Yang, Pengfei Hu, Xiaoxue Han, Yue Ning
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11161
Pdf URL: https://arxiv.org/pdf/2411.11161
Copy Paste: [[2411.11161]] MPLite: Multi-Aspect Pretraining for Mining Clinical Health Records(https://arxiv.org/abs/2411.11161)
Keywords: robust
Abstract: The adoption of digital systems in healthcare has resulted in the accumulation of vast electronic health records (EHRs), offering valuable data for machine learning methods to predict patient health outcomes. However, single-visit records of patients are often neglected in the training process due to the lack of annotations of next-visit information, thereby limiting the predictive and expressive power of machine learning models. In this paper, we present a novel framework MPLite that utilizes Multi-aspect Pretraining with Lab results through a light-weight neural network to enhance medical concept representation and predict future health outcomes of individuals. By incorporating both structured medical data and additional information from lab results, our approach fully leverages patient admission records. We design a pretraining module that predicts medical codes based on lab results, ensuring robust prediction by fusing multiple aspects of features. Our experimental evaluation using both MIMIC-III and MIMIC-IV datasets demonstrates improvements over existing models in diagnosis prediction and heart failure prediction tasks, achieving a higher weighted-F1 and recall with MPLite. This work reveals the potential of integrating diverse aspects of data to advance predictive modeling in healthcare.

Title: RPN 2: On Interdependence Function Learning Towards Unifying and Advancing CNN, RNN, GNN, and Transformer

Authors: Jiawei Zhang
Subjects: cs.LG, cs.AI, cs.CV, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2411.11162
Pdf URL: https://arxiv.org/pdf/2411.11162
Copy Paste: [[2411.11162]] RPN 2: On Interdependence Function Learning Towards Unifying and Advancing CNN, RNN, GNN, and Transformer(https://arxiv.org/abs/2411.11162)
Keywords: transformer
Abstract: This paper builds upon our previous work on the Reconciled Polynomial Network (RPN). The original RPN model was designed under the assumption of input data independence, presuming the independence among both individual instances within data batches and attributes in each data instance. However, this assumption often proves invalid for function learning tasks involving complex, interdependent data such as language, images, time series, and graphs. Ignoring such data interdependence may inevitably lead to significant performance degradation. To overcome these limitations, we introduce the new Reconciled Polynomial Network (version 2), namely RPN 2, in this paper. By incorporating data and structural interdependence functions, RPN 2 explicitly models data interdependence via new component functions in its architecture. This enhancement not only significantly improves RPN 2's learning performance but also substantially expands its unifying potential, enabling it to encompass a broader range of contemporary dominant backbone models within its canonical representation. These backbones include, but are not limited to, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and Transformers. Our analysis reveals that the fundamental distinctions among these backbone models primarily stem from their diverse approaches to defining the interdependence functions. Furthermore, this unified representation opens up new opportunities for designing innovative architectures with the potential to surpass the performance of these dominant backbones.

Title: Enhanced Anime Image Generation Using USE-CMHSA-GAN

Authors: J. Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11179
Pdf URL: https://arxiv.org/pdf/2411.11179
Copy Paste: [[2411.11179]] Enhanced Anime Image Generation Using USE-CMHSA-GAN(https://arxiv.org/abs/2411.11179)
Keywords: extraction, generative
Abstract: With the growing popularity of ACG (Anime, Comics, and Games) culture, generating high-quality anime character images has become an important research topic. This paper introduces a novel Generative Adversarial Network model, USE-CMHSA-GAN, designed to produce high-quality anime character images. The model builds upon the traditional DCGAN framework, incorporating USE and CMHSA modules to enhance feature extraction capabilities for anime character images. Experiments were conducted on the anime-face-dataset, and the results demonstrate that USE-CMHSA-GAN outperforms other benchmark models, including DCGAN, VAE-GAN, and WGAN, in terms of FID and IS scores, indicating superior image quality. These findings suggest that USE-CMHSA-GAN is highly effective for anime character image generation and provides new insights for further improving the quality of generative models.

Title: AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers

Authors: Jake Grigsby, Justin Sasek, Samyak Parajuli, Daniel Adebi, Amy Zhang, Yuke Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11188
Pdf URL: https://arxiv.org/pdf/2411.11188
Copy Paste: [[2411.11188]] AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers(https://arxiv.org/abs/2411.11188)
Keywords: transformer
Abstract: Language models trained on diverse datasets unlock generalization by in-context learning. Reinforcement Learning (RL) policies can achieve a similar effect by meta-learning within the memory of a sequence model. However, meta-RL research primarily focuses on adapting to minor variations of a single task. It is difficult to scale towards more general behavior without confronting challenges in multi-task optimization, and few solutions are compatible with meta-RL's goal of learning from large training sets of unlabeled tasks. To address this challenge, we revisit the idea that multi-task RL is bottlenecked by imbalanced training losses created by uneven return scales across different tasks. We build upon recent advancements in Transformer-based (in-context) meta-RL and evaluate a simple yet scalable solution where both an agent's actor and critic objectives are converted to classification terms that decouple optimization from the current scale of returns. Large-scale comparisons in Meta-World ML45, Multi-Game Procgen, Multi-Task POPGym, Multi-Game Atari, and BabyAI find that this design unlocks significant progress in online multi-task adaptation and memory problems without explicit task labels.

Title: Careless Whisper: Exploiting Stealthy End-to-End Leakage in Mobile Instant Messengers

Authors: Gabriel K. Gegenhuber, Maximilian Günther, Markus Maier, Aljosha Judmayer, Florian Holzbauer, Philipp É. Frenzel, Johanna Ullrich
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2411.11194
Pdf URL: https://arxiv.org/pdf/2411.11194
Copy Paste: [[2411.11194]] Careless Whisper: Exploiting Stealthy End-to-End Leakage in Mobile Instant Messengers(https://arxiv.org/abs/2411.11194)
Keywords: security, privacy, attack, steal
Abstract: A majority of the global population relies on mobile instant messengers for personal and professional communication. Besides plain messaging, many services implement convenience features, such as delivery- and read receipts, informing a user when a message has successfully reached its target. Furthermore, they have widely adopted security and privacy improvements, such as end-to-end encryption. In this paper, we show that even when messages are sufficiently encrypted, private information about a user and their devices can still be extracted by an adversary. Using specifically crafted messages that stealthily trigger delivery receipts allows arbitrary users to be pinged without their knowledge or consent. We demonstrate how an attacker could extract private information, such as the number of user devices, their operating system, and their online- and activity status. Moreover, we show the feasibility of resource exhaustion attacks draining a user's battery or data allowance. Due to the widespread adoption of vulnerable messengers (WhatsApp and Signal), we show that over two billion customers can be targeted simply by knowing their phone number.

Title: SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

Authors: Ruoxi Sun, Jiamin Chang, Hammond Pearce, Chaowei Xiao, Bo Li, Qi Wu, Surya Nepal, Minhui Xue
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11195
Pdf URL: https://arxiv.org/pdf/2411.11195
Copy Paste: [[2411.11195]] SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach(https://arxiv.org/abs/2411.11195)
Keywords: security, protect, defense, robust
Abstract: Multimodal foundation models (MFMs) represent a significant advancement in artificial intelligence, combining diverse data modalities to enhance learning and understanding across a wide range of applications. However, this integration also brings unique safety and security challenges. In this paper, we conceptualize cybersafety and cybersecurity in the context of multimodal learning and present a comprehensive Systematization of Knowledge (SoK) to unify these concepts in MFMs, identifying key threats to these models. We propose a taxonomy framework grounded in information theory, evaluating and categorizing threats through the concepts of channel capacity, signal, noise, and bandwidth. This approach provides a novel framework that unifies model safety and system security in MFMs, offering a more comprehensive and actionable understanding of the risks involved. We used this to explore existing defense mechanisms, and identified gaps in current research - particularly, a lack of protection for alignment between modalities and a need for more systematic defense methods. Our work contributes to a deeper understanding of the security and safety landscape in MFMs, providing researchers and practitioners with valuable insights for improving the robustness and reliability of these models.

Title: Stealing Training Graphs from Graph Neural Networks

Authors: Minhua Lin, Enyan Dai, Junjie Xu, Jinyuan Jia, Xiang Zhang, Suhang Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.11197
Pdf URL: https://arxiv.org/pdf/2411.11197
Copy Paste: [[2411.11197]] Stealing Training Graphs from Graph Neural Networks(https://arxiv.org/abs/2411.11197)
Keywords: steal, diffusion
Abstract: Graph Neural Networks (GNNs) have shown promising results in modeling graphs in various tasks. The training of GNNs, especially on specialized tasks such as bioinformatics, demands extensive expert annotations, which are expensive and usually contain sensitive information of data providers. The trained GNN models are often shared for deployment in the real world. As neural networks can memorize the training samples, the model parameters of GNNs have a high risk of leaking private training data. Our theoretical analysis shows the strong connections between trained GNN parameters and the training graphs used, confirming the training graph leakage issue. However, explorations into training data leakage from trained GNNs are rather limited. Therefore, we investigate a novel problem of stealing graphs from trained GNNs. To obtain high-quality graphs that resemble the target training set, a graph diffusion model with diffusion noise optimization is deployed as a graph generator. Furthermore, we propose a selection method that effectively leverages GNN model parameters to identify training graphs from samples generated by the graph diffusion model. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework in stealing training graphs from the trained GNN.

Title: Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies

Authors: Kealan Dunnett, Reza Arablouei, Dimity Miller, Volkan Dedeoglu, Raja Jurdak
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11200
Pdf URL: https://arxiv.org/pdf/2411.11200
Copy Paste: [[2411.11200]] Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies(https://arxiv.org/abs/2411.11200)
Keywords: security, protect, attack, explainability
Abstract: The widespread adoption of deep learning across various industries has introduced substantial challenges, particularly in terms of model explainability and security. The inherent complexity of deep learning models, while contributing to their effectiveness, also renders them susceptible to adversarial attacks. Among these, backdoor attacks are especially concerning, as they involve surreptitiously embedding specific triggers within training data, causing the model to exhibit aberrant behavior when presented with input containing the triggers. Such attacks often exploit vulnerabilities in outsourced processes, compromising model integrity without affecting performance on clean (trigger-free) input data. In this paper, we present a comprehensive review of existing mitigation strategies designed to counter backdoor attacks in image recognition. We provide an in-depth analysis of the theoretical foundations, practical efficacy, and limitations of these approaches. In addition, we conduct an extensive benchmarking of sixteen state-of-the-art approaches against eight distinct backdoor attacks, utilizing three datasets, four model architectures, and three poisoning ratios. Our results, derived from 122,236 individual experiments, indicate that while many approaches provide some level of protection, their performance can vary considerably. Furthermore, when compared to two seminal approaches, most newer approaches do not demonstrate substantial improvements in overall performance or consistency across diverse settings. Drawing from these findings, we propose potential directions for developing more effective and generalizable defensive mechanisms in the future.

Title: Capturing Sparks of Abstraction for the ARC Challenge

Authors: Martin Andrews
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11206
Pdf URL: https://arxiv.org/pdf/2411.11206
Copy Paste: [[2411.11206]] Capturing Sparks of Abstraction for the ARC Challenge(https://arxiv.org/abs/2411.11206)
Keywords: large language model
Abstract: Excellent progress has been made recently in solving ARC Challenge problems. However, it seems that new techniques may be required to push beyond 60% accuracy. Even commercial Large Language Models (LLMs) struggle to 'understand' many of the problems (when given the input and output grids), which makes discovering solutions by LLM-lead program search somewhat futile. In this work, LLM 'understanding' is attempted from a stronger starting position : An LLM is given complete solutions to tasks in code, and then asked to explain how the task is being solved at various levels of abstraction. Specifically, the LLM was given code solutions implemented in arc-dsl-llm (an LLM-legible version of Hodel's arc-dsl to obtain: (a) commented code; (b) code refactored into reusable functional chunks; (c) problem solution steps; and (d) high-level problem-solving tactics. We demonstrate that 'Sparks of Abstraction' can be extracted from the LLM output - in a form that could be used in downstream tasks with Local LLMs eligible to enter the ARC Prize. Both the arc-dsl-llm DSL framework (with the re-engineered solutions) and the Gemini LLM-generated data (along with the generation code) are made Open Source.

Title: Making Sigmoid-MSE Great Again: Output Reset Challenges Softmax Cross-Entropy in Neural Network Classification

Authors: Kanishka Tyagi, Chinmay Rane, Ketaki Vaidya, Jeshwanth Challgundla, Soumitro Swapan Auddy, Michael Manry
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2411.11213
Pdf URL: https://arxiv.org/pdf/2411.11213
Copy Paste: [[2411.11213]] Making Sigmoid-MSE Great Again: Output Reset Challenges Softmax Cross-Entropy in Neural Network Classification(https://arxiv.org/abs/2411.11213)
Keywords: robust
Abstract: This study presents a comparative analysis of two objective functions, Mean Squared Error (MSE) and Softmax Cross-Entropy (SCE) for neural network classification tasks. While SCE combined with softmax activation is the conventional choice for transforming network outputs into class probabilities, we explore an alternative approach using MSE with sigmoid activation. We introduce the Output Reset algorithm, which reduces inconsistent errors and enhances classifier robustness. Through extensive experiments on benchmark datasets (MNIST, CIFAR-10, and Fashion-MNIST), we demonstrate that MSE with sigmoid activation achieves comparable accuracy and convergence rates to SCE, while exhibiting superior performance in scenarios with noisy data. Our findings indicate that MSE, despite its traditional association with regression tasks, serves as a viable alternative for classification problems, challenging conventional wisdom about neural network training strategies.

Title: DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery

Authors: Jaewoo Heo, George Hu, Zeyu Wang, Serena Yeung-Levy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11214
Pdf URL: https://arxiv.org/pdf/2411.11214
Copy Paste: [[2411.11214]] DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery(https://arxiv.org/abs/2411.11214)
Keywords: transformer
Abstract: Human Mesh Recovery (HMR) is an important yet challenging problem with applications across various domains including motion capture, augmented reality, and biomechanics. Accurately predicting human pose parameters from a single image remains a challenging 3D computer vision task. In this work, we introduce DeforHMR, a novel regression-based monocular HMR framework designed to enhance the prediction of human pose parameters using deformable attention transformers. DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder to effectively regress the visual features extracted from a frozen pretrained vision transformer (ViT) encoder. The proposed deformable cross-attention mechanism allows the model to attend to relevant spatial features more flexibly and in a data-dependent manner. Equipped with a transformer decoder capable of spatially-nuanced attention, DeforHMR achieves state-of-the-art performance for single-frame regression-based methods on the widely used 3D HMR benchmarks 3DPW and RICH. By pushing the boundary on the field of 3D human mesh recovery through deformable attention, we introduce an new, effective paradigm for decoding local spatial information from large pretrained vision encoders in computer vision.

Title: Efficient Transfer Learning for Video-language Foundation Models

Authors: Haoxing Chen, Zizheng Huang, Yan Hong, Yanshuo Wang, Zhongcai Lyu, Zhuoer Xu, Jun Lan, Zhangxuan Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11223
Pdf URL: https://arxiv.org/pdf/2411.11223
Copy Paste: [[2411.11223]] Efficient Transfer Learning for Video-language Foundation Models(https://arxiv.org/abs/2411.11223)
Keywords: robust
Abstract: Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capacity brought by these additional parameters helps better fit the video-specific inductive biases, existing methods require learning a large number of parameters and are prone to catastrophic forgetting of the original generalizable knowledge. In this paper, we propose a simple yet effective Multi-modal Spatio-Temporal Adapter (MSTA) to improve the alignment between representations in the text and vision branches, achieving a balance between general knowledge and task-specific knowledge. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint. This constraint involves feeding template inputs (i.e., ``a video of $\{\textbf{cls}\}$'') into the trainable language branch, while LLM-generated spatio-temporal descriptions are input into the pre-trained language branch, enforcing consistency between the outputs of the two branches. This mechanism prevents over-fitting to downstream tasks and improves the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA achieves outstanding performance across all evaluations, while using only 2-7\% of the trainable parameters in the original model. Code will be avaliable at this https URL.

Title: MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

Authors: Yingjie Zhou, Zicheng Zhang, Jiezhang Cao, Jun Jia, Yanwei Jiang, Farong Wen, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11235
Pdf URL: https://arxiv.org/pdf/2411.11235
Copy Paste: [[2411.11235]] MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis(https://arxiv.org/abs/2411.11235)
Keywords: generative, large language model
Abstract: Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.

Title: ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification

Authors: Son T. Luu, Hiep Nguyen, Trung Vo, Le-Minh Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11247
Pdf URL: https://arxiv.org/pdf/2411.11247
Copy Paste: [[2411.11247]] ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification(https://arxiv.org/abs/2411.11247)
Keywords: large language model
Abstract: In this paper, we propose ZeFaV - a zero-shot based fact-checking verification framework to enhance the performance on fact verification task of large language models by leveraging the in-context learning ability of large language models to extract the relations among the entities within a claim, re-organized the information from the evidence in a relationally logical form, and combine the above information with the original evidence to generate the context from which our fact-checking model provide verdicts for the input claims. We conducted empirical experiments to evaluate our approach on two multi-hop fact-checking datasets including HoVer and FEVEROUS, and achieved potential results results comparable to other state-of-the-art fact verification task methods.

Title: EXCON: Extreme Instance-based Contrastive Representation Learning of Severely Imbalanced Multivariate Time Series for Solar Flare Prediction

Authors: Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11249
Pdf URL: https://arxiv.org/pdf/2411.11249
Copy Paste: [[2411.11249]] EXCON: Extreme Instance-based Contrastive Representation Learning of Severely Imbalanced Multivariate Time Series for Solar Flare Prediction(https://arxiv.org/abs/2411.11249)
Keywords: robust
Abstract: In heliophysics research, predicting solar flares is crucial due to their potential to impact both space-based systems and Earth's infrastructure substantially. Magnetic field data from solar active regions, recorded by solar imaging observatories, are transformed into multivariate time series to enable solar flare prediction using temporal window-based analysis. In the realm of multivariate time series-driven solar flare prediction, addressing severe class imbalance with effective strategies for multivariate time series representation learning is key to developing robust predictive models. Traditional methods often struggle with overfitting to the majority class in prediction tasks where major solar flares are infrequent. This work presents EXCON, a contrastive representation learning framework designed to enhance classification performance amidst such imbalances. EXCON operates through four stages: obtaining core features from multivariate time series data; selecting distinctive contrastive representations for each class to maximize inter-class separation; training a temporal feature embedding module with a custom extreme reconstruction loss to minimize intra-class variation; and applying a classifier to the learned embeddings for robust classification. The proposed method leverages contrastive learning principles to map similar instances closer in the feature space while distancing dissimilar ones, a strategy not extensively explored in solar flare prediction tasks. This approach not only addresses class imbalance but also offers a versatile solution applicable to univariate and multivariate time series across binary and multiclass classification problems. Experimental results, including evaluations on the benchmark solar flare dataset and multiple time series archive datasets with binary and multiclass labels, demonstrate EXCON's efficacy in enhancing classification performance.

Title: Large corpora and large language models: a replicable method for automating grammatical annotation

Authors: Cameron Morin, Matti Marttinen Larsson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11260
Pdf URL: https://arxiv.org/pdf/2411.11260
Copy Paste: [[2411.11260]] Large corpora and large language models: a replicable method for automating grammatical annotation(https://arxiv.org/abs/2411.11260)
Keywords: large language model
Abstract: Much linguistic research relies on annotated datasets of features extracted from text corpora, but the rapid quantitative growth of these corpora has created practical difficulties for linguists to manually annotate large data samples. In this paper, we present a replicable, supervised method that leverages large language models for assisting the linguist in grammatical annotation through prompt engineering, training, and evaluation. We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y', based on the large language model Claude 3.5 Sonnet and corpus data from Davies' NOW and EnTenTen21 (SketchEngine). Overall, we reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data, validating the method for the annotation of very large quantities of tokens of the construction in the future. We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change, underlining the value of AI copilots as tools for future linguistic research.

Title: VersaTune: Fine-Tuning Multi-Ability LLMs Efficiently

Authors: Keer Lu, Keshi Zhao, Zheng Liang, Da Pan, Shusen Zhang, Xin Wu, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11266
Pdf URL: https://arxiv.org/pdf/2411.11266
Copy Paste: [[2411.11266]] VersaTune: Fine-Tuning Multi-Ability LLMs Efficiently(https://arxiv.org/abs/2411.11266)
Keywords: large language model
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in handling multiple tasks across domains due to their emergent properties. These capabilities are further augmented during the Supervised Fine-Tuning (SFT) phase. Despite their potential, existing work mainly focuses on domain-specific enhancements during fine-tuning, the challenge of which lies in catastrophic forgetting of knowledge across other domains. In this study, we introduce VersaTune, a novel data composition framework designed for enhancing LLMs' overall multi-ability performances during fine-tuning. We categorize knowledge into distinct domains including law, medicine, finance, science, code. We begin with detecting the distribution of domain-specific knowledge within the base model, followed by the composition of training data that aligns with the model's existing knowledge distribution. During the fine-tuning process, weights of different domains are dynamically adjusted based on their learnable potential and forgetting degree. Experimental results demonstrate that VersaTune achieves significant improvements in multi-domain performance, with a 35.21% enhancement in comprehensive multi-domain tasks. Additionally, in scenarios where specific domain optimization is required, VersaTune reduces the degradation of performance in other domains by 38.77%, without compromising the target domain's training efficacy.

Title: Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development

Authors: Ranjan Sapkota, Achyut Paudel, Manoj Karkee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11285
Pdf URL: https://arxiv.org/pdf/2411.11285
Copy Paste: [[2411.11285]] Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development(https://arxiv.org/abs/2411.11285)
Keywords: large language model, segmentation
Abstract: Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM

Title: Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications

Authors: Scarlett Raine, Frederic Maire, Niko Suenderhauf, Tobias Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11287
Pdf URL: https://arxiv.org/pdf/2411.11287
Copy Paste: [[2411.11287]] Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications(https://arxiv.org/abs/2411.11287)
Keywords: segmentation
Abstract: Underwater surveys provide long-term data for informing management strategies, monitoring coral reef health, and estimating blue carbon stocks. Advances in broad-scale survey methods, such as robotic underwater vehicles, have increased the range of marine surveys but generate large volumes of imagery requiring analysis. Computer vision methods such as semantic segmentation aid automated image analysis, but typically rely on fully supervised training with extensive labelled data. While ground truth label masks for tasks like street scene segmentation can be quickly and affordably generated by non-experts through crowdsourcing services like Amazon Mechanical Turk, ecology presents greater challenges. The complexity of underwater images, coupled with the specialist expertise needed to accurately identify species at the pixel level, makes this process costly, time-consuming, and heavily dependent on domain experts. In recent years, some works have performed automated analysis of underwater imagery, and a smaller number of studies have focused on weakly supervised approaches which aim to reduce the expert-provided labelled data required. This survey focuses on approaches which reduce dependency on human expert input, while reviewing the prior and related approaches to position these works in the wider field of underwater perception. Further, we offer an overview of coastal ecosystems and the challenges of underwater imagery. We provide background on weakly and self-supervised deep learning and integrate these elements into a taxonomy that centres on the intersection of underwater monitoring, computer vision, and deep learning, while motivating approaches for weakly supervised deep learning with reduced dependency on domain expert data annotations. Lastly, the survey examines available datasets and platforms, and identifies gaps, barriers, and opportunities for automating underwater surveys.

Title: Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

Authors: Yang Chen, Jingcai Guo, Song Guo, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11288
Pdf URL: https://arxiv.org/pdf/2411.11288
Copy Paste: [[2411.11288]] Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition(https://arxiv.org/abs/2411.11288)
Keywords: robust
Abstract: Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to novel unseen action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method.

Title: LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

Authors: Yungi Kim, Hyunsoo Ha, Seonghoon Yang, Sukyung Lee, Jihoo Kim, Chanjun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11289
Pdf URL: https://arxiv.org/pdf/2411.11289
Copy Paste: [[2411.11289]] LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models(https://arxiv.org/abs/2411.11289)
Keywords: extraction, large language model
Abstract: Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.

Title: SADDE: Semi-supervised Anomaly Detection with Dependable Explanations

Authors: Yachao Yuan, Yu Huang, Yali Yuan, Jin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11293
Pdf URL: https://arxiv.org/pdf/2411.11293
Copy Paste: [[2411.11293]] SADDE: Semi-supervised Anomaly Detection with Dependable Explanations(https://arxiv.org/abs/2411.11293)
Keywords: security, interpretability
Abstract: Semi-supervised learning holds a pivotal position in anomaly detection applications, yet identifying anomaly patterns with a limited number of labeled samples poses a significant challenge. Furthermore, the absence of interpretability poses major obstacles to the practical adoption of semi-supervised frameworks. The majority of existing interpretation techniques are tailored for supervised/unsupervised frameworks or non-security domains, falling short in providing dependable interpretations. In this research paper, we introduce SADDE, a general framework designed to accomplish two primary objectives: (1) to render the anomaly detection process interpretable and enhance the credibility of interpretation outcomes, and (2) to assign high-confidence pseudo labels to unlabeled samples, thereby boosting the performance of anomaly detection systems when supervised data is scarce. To achieve the first objective, we devise a cutting-edge interpretation method that utilizes both global and local interpreters to furnish trustworthy explanations. For the second objective, we conceptualize a novel two-stage semi-supervised learning framework tailored for network anomaly detection, ensuring that the model predictions of both stages align with specific constraints. We apply SADDE to two illustrative network anomaly detection tasks and conduct extensive evaluations in comparison with notable prior works. The experimental findings underscore that SADDE is capable of delivering precise detection results alongside dependable interpretations for semi-supervised network anomaly detection systems. The source code for SADDE is accessible at: this https URL.

Title: Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation

Authors: Peng Shu, Junhao Chen, Zhengliang Liu, Hui Wang, Zihao Wu, Tianyang Zhong, Yiwei Li, Huaqin Zhao, Hanqi Jiang, Yi Pan, Yifan Zhou, Constance Owl, Xiaoming Zhai, Ninghao Liu, Claudio Saunt, Tianming Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11295
Pdf URL: https://arxiv.org/pdf/2411.11295
Copy Paste: [[2411.11295]] Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation(https://arxiv.org/abs/2411.11295)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of tasks and domains. However, their performance in low-resource language translation, particularly when translating into these languages, remains underexplored. This gap poses significant challenges, as linguistic barriers hinder the cultural preservation and development of minority communities. To address this issue, this paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages. In contrast, our retrieval-based method shows promise in improving both word-level accuracy and overall semantic understanding by leveraging existing resources more effectively.

Title: Steering Language Model Refusal with Sparse Autoencoders

Authors: Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11296
Pdf URL: https://arxiv.org/pdf/2411.11296
Copy Paste: [[2411.11296]] Steering Language Model Refusal with Sparse Autoencoders(https://arxiv.org/abs/2411.11296)
Keywords: attack, robust
Abstract: Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steerings adverse effects on performance.

Title: Toward Personalized Federated Node Classification in One-shot Communication

Authors: Guochen Yan, Xunkai Li, Luyuan Xie, Wentao Zhang, Qingni Shen, Yuejian Fang, Zhonghai Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11304
Pdf URL: https://arxiv.org/pdf/2411.11304
Copy Paste: [[2411.11304]] Toward Personalized Federated Node Classification in One-shot Communication(https://arxiv.org/abs/2411.11304)
Keywords: secure, security, federate
Abstract: Federated Graph Learning (FGL) has become a promising paradigm for collaborative training with distributed and private graph data. One-shot Federated Learning (OFL) enables collaboration in a single communication round to largely reduce communication costs and potential security concerns. However, existing OFL methods are not designed for graph data and existing FGL methods are ineffective within one communication round under both data and model heterogeneity. To mitigate this gap, we are the first to propose a one-shot personalized federated graph learning method for node classification, which is also compatible with the Secure Aggregation scheme. We estimate and aggregate the statistics of class-wise feature distribution to generate a global pseudo-graph on the server, which could be used to train a global graph model. Furthermore, We reveal the under-explored problem of existing personalized FGL methods that their personalized models are biased and neglect the ability to generalize to minorities. To achieve better personalization and generalization simultaneously, we propose a two-stage personalized training to adaptively utilize the personal information from local data and global information from the global pseudo-graph. Comprehensive experiments on 8 multi-scale graph datasets under different partitions with various settings demonstrate our superior performance over state-of-the-art baselines.

Title: TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation

Authors: Ranmin Wang, Limin Zhuang, Hongkun Chen, Boyan Xu, Ruichu Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11305
Pdf URL: https://arxiv.org/pdf/2411.11305
Copy Paste: [[2411.11305]] TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation(https://arxiv.org/abs/2411.11305)
Keywords: segmentation
Abstract: The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.

Title: A Review on Machine Unlearning

Authors: Haibo Zhang, Toru Nakamura, Takamasa Isohara, Kouichi Sakurai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11315
Pdf URL: https://arxiv.org/pdf/2411.11315
Copy Paste: [[2411.11315]] A Review on Machine Unlearning(https://arxiv.org/abs/2411.11315)
Keywords: security, privacy, protect
Abstract: Recently, an increasing number of laws have governed the useability of users' privacy. For example, Article 17 of the General Data Protection Regulation (GDPR), the right to be forgotten, requires machine learning applications to remove a portion of data from a dataset and retrain it if the user makes such a request. Furthermore, from the security perspective, training data for machine learning models, i.e., data that may contain user privacy, should be effectively protected, including appropriate erasure. Therefore, researchers propose various privacy-preserving methods to deal with such issues as machine unlearning. This paper provides an in-depth review of the security and privacy concerns in machine learning models. First, we present how machine learning can use users' private data in daily life and the role that the GDPR plays in this problem. Then, we introduce the concept of machine unlearning by describing the security threats in machine learning models and how to protect users' privacy from being violated using machine learning platforms. As the core content of the paper, we introduce and analyze current machine unlearning approaches and several representative research results and discuss them in the context of the data lineage. Furthermore, we also discuss the future research challenges in this field.

Title: Establishing Minimum Elements for Effective Vulnerability Management in AI Software

Authors: Mohamad Fazelnia, Sara Moshtari, Mehdi Mirakhorli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11317
Pdf URL: https://arxiv.org/pdf/2411.11317
Copy Paste: [[2411.11317]] Establishing Minimum Elements for Effective Vulnerability Management in AI Software(https://arxiv.org/abs/2411.11317)
Keywords: secure, robust
Abstract: In the rapidly evolving field of artificial intelligence (AI), the identification, documentation, and mitigation of vulnerabilities are paramount to ensuring robust and secure systems. This paper discusses the minimum elements for AI vulnerability management and the establishment of an Artificial Intelligence Vulnerability Database (AIVD). It presents standardized formats and protocols for disclosing, analyzing, cataloging, and documenting AI vulnerabilities. It discusses how such an AI incident database must extend beyond the traditional scope of vulnerabilities by focusing on the unique aspects of AI systems. Additionally, this paper highlights challenges and gaps in AI Vulnerability Management, including the need for new severity scores, weakness enumeration systems, and comprehensive mitigation strategies specifically designed to address the multifaceted nature of AI vulnerabilities.

Title: Enhancing Decision Transformer with Diffusion-Based Trajectory Branch Generation

Authors: Zhihong Liu, Long Qian, Zeyang Liu, Lipeng Wan, Xingyu Chen, Xuguang Lan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11327
Pdf URL: https://arxiv.org/pdf/2411.11327
Copy Paste: [[2411.11327]] Enhancing Decision Transformer with Diffusion-Based Trajectory Branch Generation(https://arxiv.org/abs/2411.11327)
Keywords: diffusion, transformer
Abstract: Decision Transformer (DT) can learn effective policy from offline datasets by converting the offline reinforcement learning (RL) into a supervised sequence modeling task, where the trajectory elements are generated auto-regressively conditioned on the return-to-go (RTG).However, the sequence modeling learning approach tends to learn policies that converge on the sub-optimal trajectories within the dataset, for lack of bridging data to move to better trajectories, even if the condition is set to the highest this http URL address this issue, we introduce Diffusion-Based Trajectory Branch Generation (BG), which expands the trajectories of the dataset with branches generated by a diffusion this http URL trajectory branch is generated based on the segment of the trajectory within the dataset, and leads to trajectories with higher this http URL concatenate the generated branch with the trajectory segment as an expansion of the this http URL expanding, DT has more opportunities to learn policies to move to better trajectories, preventing it from converging to the sub-optimal this http URL, after processing with BG, DT outperforms state-of-the-art sequence modeling methods on D4RL benchmark, demonstrating the effectiveness of adding branches to the dataset without further modifications.

Title: Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

Authors: Qinglong Cao, Ding Wang, Xirui Li, Yuntian Chen, Chao Ma, Xiaokang Yang
Subjects: cs.CV, stat.AP
Abstract URL: https://arxiv.org/abs/2411.11343
Pdf URL: https://arxiv.org/pdf/2411.11343
Copy Paste: [[2411.11343]] Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge(https://arxiv.org/abs/2411.11343)
Keywords: diffusion
Abstract: Video diffusion models have exhibited tremendous progress in various video generation tasks. However, existing models struggle to capture latent physical knowledge, failing to infer physical phenomena that are challenging to articulate with natural language. Generating videos following the fundamental physical laws is still an opening challenge. To address this challenge, we propose a novel method to teach video diffusion models with latent physical phenomenon knowledge, enabling the accurate generation of physically informed phenomena. Specifically, we first pretrain Masked Autoencoders (MAE) to reconstruct the physical phenomena, resulting in output embeddings that encapsulate latent physical phenomenon knowledge. Leveraging these embeddings, we could generate the pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders. Particularly, given that diffusion models typically use CLIP's language encoder for text prompt embeddings, our approach integrates the CLIP visual features informed by latent physical knowledge into a quaternion hidden space. This enables the modeling of spatial relationships to produce physical knowledge-informed pseudo-language prompts. By incorporating these prompt features and fine-tuning the video diffusion model in a parameter-efficient manner, the physical knowledge-informed videos are successfully generated. We validate our method extensively through both numerical simulations and real-world observations of physical phenomena, demonstrating its remarkable performance across diverse scenarios.

Title: Zero-Shot Load Forecasting with Large Language Models

Authors: Wenlong Liao, Zhe Yang, Mengshuo Jia, Christian Rehtanz, Jiannong Fang, Fernando Porté-Agel
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.11350
Pdf URL: https://arxiv.org/pdf/2411.11350
Copy Paste: [[2411.11350]] Zero-Shot Load Forecasting with Large Language Models(https://arxiv.org/abs/2411.11350)
Keywords: large language model
Abstract: Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero-shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios without the need for extensive data-specific training. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.

Title: Visual-Semantic Graph Matching Net for Zero-Shot Learning

Authors: Bowen Duan, Shiming Chen, Yufei Guo, Guo-Sen Xie, Weiping Ding, Yisong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11351
Pdf URL: https://arxiv.org/pdf/2411.11351
Copy Paste: [[2411.11351]] Visual-Semantic Graph Matching Net for Zero-Shot Learning(https://arxiv.org/abs/2411.11351)
Keywords: robust
Abstract: Zero-shot learning (ZSL) aims to leverage additional semantic information to recognize unseen classes. To transfer knowledge from seen to unseen classes, most ZSL methods often learn a shared embedding space by simply aligning visual embeddings with semantic prototypes. However, methods trained under this paradigm often struggle to learn robust embedding space because they align the two modalities in an isolated manner among classes, which ignore the crucial class relationship during the alignment process. To address the aforementioned challenges, this paper proposes a Visual-Semantic Graph Matching Net, termed as VSGMN, which leverages semantic relationships among classes to aid in visual-semantic embedding. VSGMN employs a Graph Build Network (GBN) and a Graph Matching Network (GMN) to achieve two-stage visual-semantic alignment. Specifically, GBN first utilizes an embedding-based approach to build visual and semantic graphs in the semantic space and align the embedding with its prototype for first-stage alignment. Additionally, to supplement unseen class relations in these graphs, GBN also build the unseen class nodes based on semantic relationships. In the second stage, GMN continuously integrates neighbor and cross-graph information into the constructed graph nodes, and aligns the node relationships between the two graphs under the class relationship constraint. Extensive experiments on three benchmark datasets demonstrate that VSGMN achieves superior performance in both conventional and generalized ZSL scenarios. The implementation of our VSGMN and experimental results are available at github: this https URL

Title: CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Authors: Zhiming Wang, Mingze Wang, Sheng Xu, Yanjing Li, Baochang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11360
Pdf URL: https://arxiv.org/pdf/2411.11360
Copy Paste: [[2411.11360]] CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset(https://arxiv.org/abs/2411.11360)
Keywords: large language model
Abstract: Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of surface changes between multi-temporal remote sensing images, detailing the categories, locations, and dynamics of changed objects (e.g., additions or disappearances). Many current methods attempt to leverage the long-sequence understanding and reasoning capabilities of multimodal large language models (MLLMs) for this task. However, without comprehensive data support, these approaches often alter the essential feature transmission pathways of MLLMs, disrupting the intrinsic knowledge within the models and limiting their potential in RSICC. In this paper, we propose a novel model, CCExpert, based on a new, advanced multimodal large model framework. Firstly, we design a difference-aware integration module to capture multi-scale differences between bi-temporal images and incorporate them into the original image context, thereby enhancing the signal-to-noise ratio of differential features. Secondly, we constructed a high-quality, diversified dataset called CC-Foundation, containing 200,000 image pairs and 1.2 million captions, to provide substantial data support for continue pretraining in this domain. Lastly, we employed a three-stage progressive training process to ensure the deep integration of the difference-aware integration module with the pretrained MLLM. CCExpert achieved a notable performance of $S^*_m=81.80$ on the LEVIR-CC benchmark, significantly surpassing previous state-of-the-art methods. The code and part of the dataset will soon be open-sourced at this https URL.

Title: MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models

Authors: Harshita Sharma, Valentina Salvatelli, Shaury Srivastav, Kenza Bouzid, Shruthi Bannur, Daniel C. Castro, Maximilian Ilse, Sam Bond-Taylor, Mercy Prasanna Ranjit, Fabian Falck, Fernando Pérez-García, Anton Schwaighofer, Hannah Richardson, Maria Teodora Wetscherek, Stephanie L. Hyland, Javier Alvarez-Valle
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.11362
Pdf URL: https://arxiv.org/pdf/2411.11362
Copy Paste: [[2411.11362]] MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models(https://arxiv.org/abs/2411.11362)
Keywords: large language model, segmentation
Abstract: There is growing interest in applying AI to radiology report generation, particularly for chest X-rays (CXRs). This paper investigates whether incorporating pixel-level information through segmentation masks can improve fine-grained image interpretation of multimodal large language models (MLLMs) for radiology report generation. We introduce MAIRA-Seg, a segmentation-aware MLLM framework designed to utilize semantic segmentation masks alongside CXRs for generating radiology reports. We train expert segmentation models to obtain mask pseudolabels for radiology-specific structures in CXRs. Subsequently, building on the architectures of MAIRA, a CXR-specialised model for report generation, we integrate a trainable segmentation tokens extractor that leverages these mask pseudolabels, and employ mask-aware prompting to generate draft radiology reports. Our experiments on the publicly available MIMIC-CXR dataset show that MAIRA-Seg outperforms non-segmentation baselines. We also investigate set-of-marks prompting with MAIRA and find that MAIRA-Seg consistently demonstrates comparable or superior performance. The results confirm that using segmentation masks enhances the nuanced reasoning of MLLMs, potentially contributing to better clinical outcomes.

Title: Continual Task Learning through Adaptive Policy Self-Composition

Authors: Shengchao Hu, Yuhang Zhou, Ziqing Fan, Jifeng Hu, Li Shen, Ya Zhang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11364
Pdf URL: https://arxiv.org/pdf/2411.11364
Copy Paste: [[2411.11364]] Continual Task Learning through Adaptive Policy Self-Composition(https://arxiv.org/abs/2411.11364)
Keywords: transformer
Abstract: Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.

Title: Adapting to Cyber Threats: A Phishing Evolution Network (PEN) Framework for Phishing Generation and Analyzing Evolution Patterns using Large Language Models

Authors: Fengchao Chen, Tingmin Wu, Van Nguyen, Shuo Wang, Hongsheng Hu, Alsharif Abuadbba, Carsten Rudolph
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11389
Pdf URL: https://arxiv.org/pdf/2411.11389
Copy Paste: [[2411.11389]] Adapting to Cyber Threats: A Phishing Evolution Network (PEN) Framework for Phishing Generation and Analyzing Evolution Patterns using Large Language Models(https://arxiv.org/abs/2411.11389)
Keywords: privacy, defense, attack, robust, large language model
Abstract: Phishing remains a pervasive cyber threat, as attackers craft deceptive emails to lure victims into revealing sensitive information. While Artificial Intelligence (AI), particularly deep learning, has become a key component in defending against phishing attacks, these approaches face critical limitations. The scarcity of publicly available, diverse, and updated data, largely due to privacy concerns, constrains their effectiveness. As phishing tactics evolve rapidly, models trained on limited, outdated data struggle to detect new, sophisticated deception strategies, leaving systems vulnerable to an ever-growing array of attacks. Addressing this gap is essential to strengthening defenses in an increasingly hostile cyber landscape. To address this gap, we propose the Phishing Evolution Network (PEN), a framework leveraging large language models (LLMs) and adversarial training mechanisms to continuously generate high quality and realistic diverse phishing samples, and analyze features of LLM-provided phishing to understand evolving phishing patterns. We evaluate the quality and diversity of phishing samples generated by PEN and find that it produces over 80% realistic phishing samples, effectively expanding phishing datasets across seven dominant types. These PEN-generated samples enhance the performance of current phishing detectors, leading to a 40% improvement in detection accuracy. Additionally, the use of PEN significantly boosts model robustness, reducing detectors' sensitivity to perturbations by up to 60%, thereby decreasing attack success rates under adversarial conditions. When we analyze the phishing patterns that are used in LLM-generated phishing, the cognitive complexity and the tone of time limitation are detected with statistically significant differences compared with existing phishing.

Title: The GECo algorithm for Graph Neural Networks Explanation

Authors: Salvatore Calderaro, Domenico Amato, Giosuè Lo Bosco, Riccardo Rizzo, Filippo Vella
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11391
Pdf URL: https://arxiv.org/pdf/2411.11391
Copy Paste: [[2411.11391]] The GECo algorithm for Graph Neural Networks Explanation(https://arxiv.org/abs/2411.11391)
Keywords: interpretability, explainability
Abstract: Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, called GECo, exploits the idea that if a community is a subset of graph nodes densely connected, this property should play a role in graph classification. This is reasonable, especially if we consider the message-passing mechanism, which is the basic mechanism of GNNs. GECo analyzes the contribution to the classification result of the communities in the graph, building a mask that highlights graph-relevant structures. GECo is tested for Graph Convolutional Networks on six artificial and four real-world graph datasets and is compared to the main explainability methods such as PGMExplainer, PGExplainer, GNNExplainer, and SubgraphX using four different metrics. The obtained results outperform the other methods for artificial graph datasets and most real-world datasets.

Title: Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms

Authors: Haizhou Ge, Ruixiang Wang, Zhu-ang Xu, Hongrui Zhu, Ruichen Deng, Yuhang Dong, Zeyu Pang, Guyue Zhou, Junyu Zhang, Lu Shi
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2411.11406
Pdf URL: https://arxiv.org/pdf/2411.11406
Copy Paste: [[2411.11406]] Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms(https://arxiv.org/abs/2411.11406)
Keywords: transformer
Abstract: Advanced imitation learning with structures like the transformer is increasingly demonstrating its advantages in robotics. However, deploying these large-scale models on embedded platforms remains a major challenge. In this paper, we propose a pipeline that facilitates the migration of advanced imitation learning algorithms to edge devices. The process is achieved via an efficient model compression method and a practical asynchronous parallel method Temporal Ensemble with Dropped Actions (TEDA) that enhances the smoothness of operations. To show the efficiency of the proposed pipeline, large-scale imitation learning models are trained on a server and deployed on an edge device to complete various manipulation tasks.

Title: The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

Authors: Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11407
Pdf URL: https://arxiv.org/pdf/2411.11407
Copy Paste: [[2411.11407]] The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models(https://arxiv.org/abs/2411.11407)
Keywords: defense, attack, large language model
Abstract: The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs' bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority citation matcher and generator designed for a black-box setting. DarkCite matches optimal citation types to specific risk types and generates authoritative citations relevant to harmful instructions, enabling more effective jailbreak attacks on aligned this http URL experiments show that DarkCite achieves a higher attack success rate (e.g., LLama-2 at 76% versus 68%) than previous methods. To counter this risk, we propose an authenticity and harm verification defense strategy, raising the average defense pass rate (DPR) from 11% to 74%. More importantly, the ability to link citations to the content they encompass has become a foundational function in LLMs, amplifying the influence of LLMs' bias toward authority.

Title: IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Authors: Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, Jiajun Wu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2411.11409
Pdf URL: https://arxiv.org/pdf/2411.11409
Copy Paste: [[2411.11409]] IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos(https://arxiv.org/abs/2411.11409)
Keywords: segmentation
Abstract: Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

Title: TEEMATE: Fast and Efficient Confidential Container using Shared Enclave

Authors: Chulmin Lee, Jaewon Hur, Sangho Lee, Byoungyoung Lee
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11423
Pdf URL: https://arxiv.org/pdf/2411.11423
Copy Paste: [[2411.11423]] TEEMATE: Fast and Efficient Confidential Container using Shared Enclave(https://arxiv.org/abs/2411.11423)
Keywords: protect
Abstract: Confidential container is becoming increasingly popular as it meets both needs for efficient resource management by cloud providers, and data protection by cloud users. Specifically, confidential containers integrate the container and the enclave, aiming to inherit the design-wise advantages of both (i.e., resource management and data protection). However, current confidential containers suffer from large performance overheads caused by i) a larger startup latency due to the enclave creation, and ii) a larger memory footprint due to the non-shareable characteristics of enclave memory. This paper explores a design conundrum of confidential container, examining why the confidential containers impose such large performance overheads. Surprisingly, we found there is a universal misconception that an enclave can only be used by a single (containerized) process that created it. However, an enclave can be shared across multiple processes, because an enclave is merely a set of physical resources while the process is an abstraction constructed by the host kernel. To this end, we introduce TeeMate, a new approach to utilize the enclaves on the host system. Especially, TeeMate designs the primitives to i) share the enclave memory between processes, thus preserving memory abstraction, and ii) assign the threads in enclave between processes, thus preserving thread abstraction. We concretized TeeMate on Intel SGX, and implemented confidential serverless computing and confidential database on top of TeeMate based confidential containers. The evaluation clearly demonstrated the strong practical impact of TeeMate by achieving at least 4.5 times lower latency and 2.8 times lower memory usage compared to the applications built on the conventional confidential containers.

Title: Membership Inference Attack against Long-Context Large Language Models

Authors: Zixiong Wang, Gaoyang Liu, Yang Yang, Chen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11424
Pdf URL: https://arxiv.org/pdf/2411.11424
Copy Paste: [[2411.11424]] Membership Inference Attack against Long-Context Large Language Models(https://arxiv.org/abs/2411.11424)
Keywords: privacy, attack, membership infer, large language model
Abstract: Recent advances in Large Language Models (LLMs) have enabled them to overcome their context window limitations, and demonstrate exceptional retrieval and reasoning capacities on longer context. Quesion-answering systems augmented with Long-Context Language Models (LCLMs) can automatically search massive external data and incorporate it into their contexts, enabling faithful predictions and reducing issues such as hallucinations and knowledge staleness. Existing studies targeting LCLMs mainly concentrate on addressing the so-called lost-in-the-middle problem or improving the inference effiencicy, leaving their privacy risks largely unexplored. In this paper, we aim to bridge this gap and argue that integrating all information into the long context makes it a repository of sensitive information, which often contains private data such as medical records or personal identities. We further investigate the membership privacy within LCLMs external context, with the aim of determining whether a given document or sequence is included in the LCLMs context. Our basic idea is that if a document lies in the context, it will exhibit a low generation loss or a high degree of semantic similarity to the contents generated by LCLMs. We for the first time propose six membership inference attack (MIA) strategies tailored for LCLMs and conduct extensive experiments on various popular models. Empirical results demonstrate that our attacks can accurately infer membership status in most cases, e.g., 90.66% attack F1-score on Multi-document QA datasets with LongChat-7b-v1.5-32k, highlighting significant risks of membership leakage within LCLMs input contexts. Furthermore, we examine the underlying reasons why LCLMs are susceptible to revealing such membership information.

Title: CLUE-MARK: Watermarking Diffusion Models using CLWE

Authors: Kareem Shehata, Aashish Kolluri, Prateek Saxena
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11434
Pdf URL: https://arxiv.org/pdf/2411.11434
Copy Paste: [[2411.11434]] CLUE-MARK: Watermarking Diffusion Models using CLWE(https://arxiv.org/abs/2411.11434)
Keywords: attack, robust, watermark, diffusion
Abstract: As AI-generated images become widespread, reliable watermarking is essential for content verification, copyright enforcement, and combating disinformation. Existing techniques rely on heuristic approaches and lack formal guarantees of undetectability, making them vulnerable to steganographic attacks that can expose or erase the watermark. Additionally, these techniques often degrade output quality by introducing perceptible changes, which is not only undesirable but an important barrier to adoption in practice. In this work, we introduce CLUE-Mark, the first provably undetectable watermarking scheme for diffusion models. CLUE-Mark requires no changes to the model being watermarked, is computationally efficient, and because it is provably undetectable is guaranteed to have no impact on model output quality. Our approach leverages the Continuous Learning With Errors (CLWE) problem -- a cryptographically hard lattice problem -- to embed watermarks in the latent noise vectors used by diffusion models. By proving undetectability via reduction to a cryptographically hard problem we ensure not only that the watermark is imperceptible to human observers or adhoc heuristics, but to \emph{any} efficient detector that does not have the secret key. CLUE-Mark allows multiple keys to be embedded, enabling traceability of images to specific users without altering model parameters. Empirical evaluations on state-of-the-art diffusion models confirm that CLUE-Mark achieves high recoverability, preserves image quality, and is robust to minor perturbations such JPEG compression and brightness adjustments. Uniquely, CLUE-Mark cannot be detected nor removed by recent steganographic attacks.

Title: Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

Authors: Hongjun Wang, Jiyuan Chen, Lingyu Zhang, Renhe Jiang, Xuan Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11448
Pdf URL: https://arxiv.org/pdf/2411.11448
Copy Paste: [[2411.11448]] Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting(https://arxiv.org/abs/2411.11448)
Keywords: robust, transformer
Abstract: Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have shown significant promise in traffic forecasting by effectively modeling temporal and spatial correlations. However, rapid urbanization in recent years has led to dynamic shifts in traffic patterns and travel demand, posing major challenges for accurate long-term traffic prediction. The generalization capability of ST-GNNs in extended temporal scenarios and cross-city applications remains largely unexplored. In this study, we evaluate state-of-the-art models on an extended traffic benchmark and observe substantial performance degradation in existing ST-GNNs over time, which we attribute to their limited inductive capabilities. Our analysis reveals that this degradation stems from an inability to adapt to evolving spatial relationships within urban environments. To address this limitation, we reconsider the design of adaptive embeddings and propose a Principal Component Analysis (PCA) embedding approach that enables models to adapt to new scenarios without retraining. We incorporate PCA embeddings into existing ST-GNN and Transformer architectures, achieving marked improvements in performance. Notably, PCA embeddings allow for flexibility in graph structures between training and testing, enabling models trained on one city to perform zero-shot predictions on other cities. This adaptability demonstrates the potential of PCA embeddings in enhancing the robustness and generalization of spatiotemporal models.

Title: Upside-Down Reinforcement Learning for More Interpretable Optimal Control

Authors: Juan Cardenas-Cartagena, Massimiliano Falzari, Marco Zullich, Matthia Sabatelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11457
Pdf URL: https://arxiv.org/pdf/2411.11457
Copy Paste: [[2411.11457]] Upside-Down Reinforcement Learning for More Interpretable Optimal Control(https://arxiv.org/abs/2411.11457)
Keywords: robust
Abstract: Model-Free Reinforcement Learning (RL) algorithms either learn how to map states to expected rewards or search for policies that can maximize a certain performance function. Model-Based algorithms instead, aim to learn an approximation of the underlying model of the RL environment and then use it in combination with planning algorithms. Upside-Down Reinforcement Learning (UDRL) is a novel learning paradigm that aims to learn how to predict actions from states and desired commands. This task is formulated as a Supervised Learning problem and has successfully been tackled by Neural Networks (NNs). In this paper, we investigate whether function approximation algorithms other than NNs can also be used within a UDRL framework. Our experiments, performed over several popular optimal control benchmarks, show that tree-based methods like Random Forests and Extremely Randomized Trees can perform just as well as NNs with the significant benefit of resulting in policies that are inherently more interpretable than NNs, therefore paving the way for more transparent, safe, and robust RL.

Title: Re-examining learning linear functions in context

Authors: Omar Naim, Guilhem Fouilhé, Nicholas Asher
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2411.11465
Pdf URL: https://arxiv.org/pdf/2411.11465
Copy Paste: [[2411.11465]] Re-examining learning linear functions in context(https://arxiv.org/abs/2411.11465)
Keywords: transformer
Abstract: In context learning (ICL) is an attractive method of solving a wide range of problems. Inspired by Garg et al. (2022), we look closely at ICL in a variety of train and test settings for several transformer models of different sizes trained from scratch. Our study complements prior work by pointing out several systematic failures of these models to generalize to data not in the training distribution, thereby showing some limitations of ICL. We find that models adopt a strategy for this task that is very different from standard solutions.

Title: MGNiceNet: Unified Monocular Geometric Scene Understanding

Authors: Markus Schön, Michael Buchholz, Klaus Dietmayer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11466
Pdf URL: https://arxiv.org/pdf/2411.11466
Copy Paste: [[2411.11466]] MGNiceNet: Unified Monocular Geometric Scene Understanding(https://arxiv.org/abs/2411.11466)
Keywords: segmentation
Abstract: Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at this https URL.

Title: Generalizable Person Re-identification via Balancing Alignment and Uniformity

Authors: Yoonki Cho, Jaeyoon Kim, Woo Jae Kim, Junsik Jung, Sung-eui Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11471
Pdf URL: https://arxiv.org/pdf/2411.11471
Copy Paste: [[2411.11471]] Generalizable Person Re-identification via Balancing Alignment and Uniformity(https://arxiv.org/abs/2411.11471)
Keywords: robust
Abstract: Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we investigate this phenomenon and reveal that it leads to sparse representation spaces with reduced uniformity. To address this issue, we propose a novel framework, Balancing Alignment and Uniformity (BAU), which effectively mitigates this effect by maintaining a balance between alignment and uniformity. Specifically, BAU incorporates alignment and uniformity losses applied to both original and augmented images and integrates a weighting strategy to assess the reliability of augmented samples, further improving the alignment loss. Additionally, we introduce a domain-specific uniformity loss that promotes uniformity within each source domain, thereby enhancing the learning of domain-invariant features. Extensive experimental results demonstrate that BAU effectively exploits the advantages of data augmentation, which previous studies could not fully utilize, and achieves state-of-the-art performance without requiring complex training procedures. The code is available at \url{this https URL}.

Title: Graph Artificial Intelligence for Quantifying Compatibility Mechanisms in Traditional Chinese Medicine

Authors: Jingqi Zeng, Xiaobin Jia
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.11474
Pdf URL: https://arxiv.org/pdf/2411.11474
Copy Paste: [[2411.11474]] Graph Artificial Intelligence for Quantifying Compatibility Mechanisms in Traditional Chinese Medicine(https://arxiv.org/abs/2411.11474)
Keywords: robust
Abstract: Traditional Chinese Medicine (TCM) involves complex compatibility mechanisms characterized by multi-component and multi-target interactions, which are challenging to quantify. To address this challenge, we applied graph artificial intelligence to develop a TCM multi-dimensional knowledge graph that bridges traditional TCM theory and modern biomedical science (this https URL ). Using feature engineering and embedding, we processed key TCM terminology and Chinese herbal pieces (CHP), introducing medicinal properties as virtual nodes and employing graph neural networks with attention mechanisms to model and analyze 6,080 Chinese herbal formulas (CHF). Our method quantitatively assessed the roles of CHP within CHF and was validated using 215 CHF designed for COVID-19 management. With interpretable models, open-source data, and code (this https URL ), this study provides robust tools for advancing TCM theory and drug discovery.

Title: MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion

Authors: Dongseok Shim, Yichun Shi, Kejie Li, H. Jin Kim, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11475
Pdf URL: https://arxiv.org/pdf/2411.11475
Copy Paste: [[2411.11475]] MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion(https://arxiv.org/abs/2411.11475)
Keywords: diffusion, generative
Abstract: Recent advancements in text-to-3D generation, building on the success of high-performance text-to-image generative models, have made it possible to create imaginative and richly textured 3D objects from textual descriptions. However, a key challenge remains in effectively decoupling light-independent and lighting-dependent components to enhance the quality of generated 3D models and their relighting performance. In this paper, we present MVLight, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process. This enables the model to synthesize high-quality images that faithfully reflect the specified lighting environment across multiple camera views. By leveraging this capability to Score Distillation Sampling (SDS), we can effectively synthesize 3D models with improved geometric precision and relighting capabilities. We validate the effectiveness of MVLight through extensive experiments and a user study.

Title: SoK: On the Role and Future of AIGC Watermarking in the Era of Gen-AI

Authors: Kui Ren, Ziqi Yang, Li Lu, Jian Liu, Yiming Li, Jie Wan, Xiaodi Zhao, Xianheng Feng, Shuo Shao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11478
Pdf URL: https://arxiv.org/pdf/2411.11478
Copy Paste: [[2411.11478]] SoK: On the Role and Future of AIGC Watermarking in the Era of Gen-AI(https://arxiv.org/abs/2411.11478)
Keywords: security, watermark
Abstract: The rapid advancement of AI technology, particularly in generating AI-generated content (AIGC), has transformed numerous fields, e.g., art video generation, but also brings new risks, including the misuse of AI for misinformation and intellectual property theft. To address these concerns, AIGC watermarks offer an effective solution to mitigate malicious activities. However, existing watermarking surveys focus more on traditional watermarks, overlooking AIGC-specific challenges. In this work, we propose a systematic investigation into AIGC watermarking and provide the first formal definition of AIGC watermarking. Different from previous surveys, we provide a taxonomy based on the core properties of the watermark which are summarized through comprehensive literature from various AIGC modalities. Derived from the properties, we discuss the functionality and security threats of AIGC watermarking. In the end, we thoroughly investigate the AIGC governance of different countries and practitioners. We believe this taxonomy better aligns with the practical demands for watermarking in the era of GenAI, thus providing a clearer summary of existing work and uncovering potential future research directions for the community.

Title: Exploring Emerging Trends and Research Opportunities in Visual Place Recognition

Authors: Antonios Gasteratos, Konstantinos A. Tsintotas, Tobias Fischer, Yiannis Aloimonos, Michael Milford
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2411.11481
Pdf URL: https://arxiv.org/pdf/2411.11481
Copy Paste: [[2411.11481]] Exploring Emerging Trends and Research Opportunities in Visual Place Recognition(https://arxiv.org/abs/2411.11481)
Keywords: robust
Abstract: Visual-based recognition, e.g., image classification, object detection, etc., is a long-standing challenge in computer vision and robotics communities. Concerning the roboticists, since the knowledge of the environment is a prerequisite for complex navigation tasks, visual place recognition is vital for most localization implementations or re-localization and loop closure detection pipelines within simultaneous localization and mapping (SLAM). More specifically, it corresponds to the system's ability to identify and match a previously visited location using computer vision tools. Towards developing novel techniques with enhanced accuracy and robustness, while motivated by the success presented in natural language processing methods, researchers have recently turned their attention to vision-language models, which integrate visual and textual data.

Title: Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Authors: Chenhang Cui, Gelei Deng, An Zhang, Jingnan Zheng, Yicong Li, Lianli Gao, Tianwei Zhang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11496
Pdf URL: https://arxiv.org/pdf/2411.11496
Copy Paste: [[2411.11496]] Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models(https://arxiv.org/abs/2411.11496)
Keywords: defense, generative
Abstract: Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \url{this https URL}.

Title: LaVin-DiT: Large Vision Diffusion Transformer

Authors: Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11505
Pdf URL: https://arxiv.org/pdf/2411.11505
Copy Paste: [[2411.11505]] LaVin-DiT: Large Vision Diffusion Transformer(https://arxiv.org/abs/2411.11505)
Keywords: diffusion, transformer, generative
Abstract: This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be open-sourced.

Title: Cascaded Diffusion Models for 2D and 3D Microscopy Image Synthesis to Enhance Cell Segmentation

Authors: Rüveyda Yilmaz, Kaan Keven, Yuli Wu, Johannes Stegmaier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11515
Pdf URL: https://arxiv.org/pdf/2411.11515
Copy Paste: [[2411.11515]] Cascaded Diffusion Models for 2D and 3D Microscopy Image Synthesis to Enhance Cell Segmentation(https://arxiv.org/abs/2411.11515)
Keywords: diffusion, segmentation
Abstract: Automated cell segmentation in microscopy images is essential for biomedical research, yet conventional methods are labor-intensive and prone to error. While deep learning-based approaches have proven effective, they often require large annotated datasets, which are scarce due to the challenges of manual annotation. To overcome this, we propose a novel framework for synthesizing densely annotated 2D and 3D cell microscopy images using cascaded diffusion models. Our method synthesizes 2D and 3D cell masks from sparse 2D annotations using multi-level diffusion models and NeuS, a 3D surface reconstruction approach. Following that, a pretrained 2D Stable Diffusion model is finetuned to generate realistic cell textures and the final outputs are combined to form cell populations. We show that training a segmentation model with a combination of our synthetic data and real data improves cell segmentation performance by up to 9\% across multiple datasets. Additionally, the FID scores indicate that the synthetic data closely resembles real data. The code for our proposed approach will be available at this https URL\_diffusion.

Title: Preempting Text Sanitization Utility in Resource-Constrained Privacy-Preserving LLM Interactions

Authors: Robin Carpentier, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Dali Kaafar
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11521
Pdf URL: https://arxiv.org/pdf/2411.11521
Copy Paste: [[2411.11521]] Preempting Text Sanitization Utility in Resource-Constrained Privacy-Preserving LLM Interactions(https://arxiv.org/abs/2411.11521)
Keywords: privacy, large language model
Abstract: Individuals have been increasingly interacting with online Large Language Models (LLMs), both in their work and personal lives. These interactions raise privacy issues as the LLMs are typically hosted by third-parties who can gather a variety of sensitive information about users and their companies. Text Sanitization techniques have been proposed in the literature and can be used to sanitize user prompts before sending them to the LLM. However, sanitization has an impact on the downstream task performed by the LLM, and often to such an extent that it leads to unacceptable results for the user. This is not just a minor annoyance, with clear monetary consequences as LLM services charge on a per use basis as well as great amount of computing resources wasted. We propose an architecture leveraging a Small Language Model (SLM) at the user-side to help estimate the impact of sanitization on a prompt before it is sent to the LLM, thus preventing resource losses. Our evaluation of this architecture revealed a significant problem with text sanitization based on Differential Privacy, on which we want to draw the attention of the community for further investigation.

Title: Reliable Poisoned Sample Detection against Backdoor Attacks Enhanced by Sharpness Aware Minimization

Authors: Mingda Zhang, Mingli Zhu, Zihao Zhu, Baoyuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11525
Pdf URL: https://arxiv.org/pdf/2411.11525
Copy Paste: [[2411.11525]] Reliable Poisoned Sample Detection against Backdoor Attacks Enhanced by Sharpness Aware Minimization(https://arxiv.org/abs/2411.11525)
Keywords: security, attack
Abstract: Backdoor attack has been considered as a serious security threat to deep neural networks (DNNs). Poisoned sample detection (PSD) that aims at filtering out poisoned samples from an untrustworthy training dataset has shown very promising performance for defending against data poisoning based backdoor attacks. However, we observe that the detection performance of many advanced methods is likely to be unstable when facing weak backdoor attacks, such as low poisoning ratio or weak trigger strength. To further verify this observation, we make a statistical investigation among various backdoor attacks and poisoned sample detections, showing a positive correlation between backdoor effect and detection performance. It inspires us to strengthen the backdoor effect to enhance detection performance. Since we cannot achieve that goal via directly manipulating poisoning ratio or trigger strength, we propose to train one model using the Sharpness-Aware Minimization (SAM) algorithm, rather than the vanilla training algorithm. We also provide both empirical and theoretical analysis about how SAM training strengthens the backdoor effect. Then, this SAM trained model can be seamlessly integrated with any off-the-shelf PSD method that extracts discriminative features from the trained model for detection, called SAM-enhanced PSD. Extensive experiments on several benchmark datasets show the reliable detection performance of the proposed method against both weak and strong backdoor attacks, with significant improvements against various attacks ($+34.38\%$ TPR on average), over the conventional PSD methods (i.e., without SAM enhancement). Overall, this work provides new insights about PSD and proposes a novel approach that can complement existing detection methods, which may inspire more in-depth explorations in this field.

Title: Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality

Authors: Viktoriia Chekalina, Anton Razzigaev, Elizaveta Goncharova, Andrey Kuznetsov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11531
Pdf URL: https://arxiv.org/pdf/2411.11531
Copy Paste: [[2411.11531]] Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality(https://arxiv.org/abs/2411.11531)
Keywords: large language model
Abstract: In this paper we present an approach to reduce hallucinations in Large Language Models (LLMs) by incorporating Knowledge Graphs (KGs) as an additional modality. Our method involves transforming input text into a set of KG embeddings and using an adapter to integrate these embeddings into the language model space, without relying on external retrieval processes. To facilitate this, we created WikiEntities, a dataset containing over 3 million Wikipedia texts annotated with entities from Wikidata and their corresponding embeddings from PyTorch-BigGraph. This dataset serves as a valuable resource for training Entity Linking models and adapting the described method to various LLMs using specialized adapters. Our method does not require fine-tuning of the language models themselves; instead, we only train the adapter. This ensures that the model's performance on other tasks is not affected. We trained an adapter for the Mistral 7B, LLaMA 2-7B (chat), and LLaMA 3-8B (instruct) models using this dataset and demonstrated that our approach improves performance on the HaluEval, True-False benchmarks and FEVER dataset. The results indicate that incorporating KGs as a new modality can effectively reduce hallucinations and improve the factual accuracy of language models, all without the need for external retrieval.

Title: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

Authors: Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11543
Pdf URL: https://arxiv.org/pdf/2411.11543
Copy Paste: [[2411.11543]] Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment(https://arxiv.org/abs/2411.11543)
Keywords: defense, attack, explainability, large language model
Abstract: Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.

Title: Real-Time Fitness Exercise Classification and Counting from Video Frames

Authors: Riccardo Riccio
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11548
Pdf URL: https://arxiv.org/pdf/2411.11548
Copy Paste: [[2411.11548]] Real-Time Fitness Exercise Classification and Counting from Video Frames(https://arxiv.org/abs/2411.11548)
Keywords: robust
Abstract: This paper introduces a novel method for real-time exercise classification using a Bidirectional Long Short-Term Memory (BiLSTM) neural network. Existing exercise recognition approaches often rely on synthetic datasets, raw coordinate inputs sensitive to user and camera variations, and fail to fully exploit the temporal dependencies in exercise movements. These issues limit their generalizability and robustness in real-world conditions, where lighting, camera angles, and user body types vary. To address these challenges, we propose a BiLSTM-based model that leverages invariant features, such as joint angles, alongside raw coordinates. By using both angles and (x, y, z) coordinates, the model adapts to changes in perspective, user positioning, and body differences, improving generalization. Training on 30-frame sequences enables the BiLSTM to capture the temporal context of exercises and recognize patterns evolving over time. We compiled a dataset combining synthetic data from the InfiniteRep dataset and real-world videos from Kaggle and other sources. This dataset includes four common exercises: squat, push-up, shoulder press, and bicep curl. The model was trained and validated on these diverse datasets, achieving an accuracy of over 99% on the test set. To assess generalizability, the model was tested on 2 separate test sets representative of typical usage conditions. Comparisons with the previous approach from the literature are present in the result section showing that the proposed model is the best-performing one. The classifier is integrated into a web application providing real-time exercise classification and repetition counting without manual exercise selection. Demo and datasets are available at the following GitHub Repository: this https URL.

Title: Simple But Not Secure: An Empirical Security Analysis of Two-factor Authentication Systems

Authors: Zhi Wang, Xin Yang, Du Chen, Han Gao, Meiqi Tian, Yan Jia, Wanpeng Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11551
Pdf URL: https://arxiv.org/pdf/2411.11551
Copy Paste: [[2411.11551]] Simple But Not Secure: An Empirical Security Analysis of Two-factor Authentication Systems(https://arxiv.org/abs/2411.11551)
Keywords: secure, security, protect, attack
Abstract: To protect users from data breaches and phishing attacks, service providers typically implement two-factor authentication (2FA) to add an extra layer of security against suspicious login attempts. However, since 2FA can sometimes hinder user experience by introducing additional steps, many websites aim to reduce inconvenience by minimizing the frequency of 2FA prompts. One approach to achieve this is by storing the user's ``Remember the Device'' preference in a cookie. As a result, users are only prompted for 2FA when this cookie expires or if they log in from a new device. To understand and improve the security of 2FA systems in real-world settings, we propose SE2FA, a vulnerability evaluation framework designed to detect vulnerabilities in 2FA systems. This framework enables us to analyze the security of 407 2FA systems across popular websites from the Tranco Top 10,000 list. Our analysis and evaluation found three zero-day vulnerabilities on three service providers that could allow an attacker to access a victim's account without possessing the victim's second authentication factor, thereby bypassing 2FA protections entirely. A further investigation found that these vulnerabilities stem from design choices aimed at simplifying 2FA for users but that unintentionally reduce its security effectiveness. We have disclosed these findings to the affected websites and assisted them in mitigating the risks. Based on the insights from this research, we provide practical recommendations for countermeasures to strengthen 2FA security and address these newly identified threats.

Title: GNN-Based Code Annotation Logic for Establishing Security Boundaries in C Code

Authors: Varun Gadey, Raphael Goetz, Christoph Sendner, Sampo Sovio, Alexandra Dmitrienko
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.11567
Pdf URL: https://arxiv.org/pdf/2411.11567
Copy Paste: [[2411.11567]] GNN-Based Code Annotation Logic for Establishing Security Boundaries in C Code(https://arxiv.org/abs/2411.11567)
Keywords: secure, security, attack
Abstract: Securing sensitive operations in today's interconnected software landscape is crucial yet challenging. Modern platforms rely on Trusted Execution Environments (TEEs), such as Intel SGX and ARM TrustZone, to isolate security sensitive code from the main system, reducing the Trusted Computing Base (TCB) and providing stronger assurances. However, identifying which code should reside in TEEs is complex and requires specialized expertise, which is not supported by current automated tools. Existing solutions often migrate entire applications to TEEs, leading to suboptimal use and an increased TCB. To address this gap, we propose Code Annotation Logic (CAL), a pioneering tool that automatically identifies security sensitive components for TEE isolation. CAL analyzes codebases, leveraging a graph-based approach with novel feature construction and employing a custom graph neural network model to accurately determine which parts of the code should be isolated. CAL effectively optimizes TCB, reducing the burden of manual analysis and enhancing overall security. Our contributions include the definition of security sensitive code, the construction and labeling of a comprehensive dataset of source files, a feature rich graph based data preparation pipeline, and the CAL model for TEE integration. Evaluation results demonstrate CAL's efficacy in identifying sensitive code with a recall of 86.05%, an F1 score of 81.56%, and an identification rate of 91.59% for security sensitive functions. By enabling efficient code isolation, CAL advances the secure development of applications using TEEs, offering a practical solution for developers to reduce attack vectors.

Title: OASIS: Open Agents Social Interaction Simulations on One Million Agents

Authors: Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Wanli Ouyang, Yu Qiao, Philip Torr, Jing Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11581
Pdf URL: https://arxiv.org/pdf/2411.11581
Copy Paste: [[2411.11581]] OASIS: Open Agents Social Interaction Simulations on One Million Agents(https://arxiv.org/abs/2411.11581)
Keywords: large language model
Abstract: There has been a growing interest in enhancing rule-based agent-based models (ABMs) for social media platforms (\emph{i.e.}, X, Reddit) with more realistic large language model (LLM) agents, thereby allowing for a more nuanced study of complex systems. As a result, several LLM-based ABMs have been proposed in the past year. While they hold promise, each simulator is specifically designed to study a particular scenario, making it time-consuming and resource-intensive to explore other phenomena using the same ABM. Additionally, these models simulate only a limited number of agents, whereas real-world social media platforms involve millions of users. To this end, we propose OASIS, a generalizable and scalable social media simulator. OASIS is designed based on real-world social media platforms, incorporating dynamically updated environments (\emph{i.e.}, dynamic social networks and post information), diverse action spaces (\emph{i.e.}, following, commenting), and recommendation systems (\emph{i.e.}, interest-based and hot-score-based). Additionally, OASIS supports large-scale user simulations, capable of modeling up to one million users. With these features, OASIS can be easily extended to different social media platforms to study large-scale group phenomena and behaviors. We replicate various social phenomena, including information spreading, group polarization, and herd effects across X and Reddit platforms. Moreover, we provide observations of social phenomena at different agent group scales. We observe that the larger agent group scale leads to more enhanced group dynamics and more diverse and helpful agents' opinions. These findings demonstrate OASIS's potential as a powerful tool for studying complex systems in digital environments.

Title: Generative Spatio-temporal GraphNet for Transonic Wing Pressure Distribution Forecasting

Authors: Gabriele Immordino, Andrea Vaiuso, Andrea Da Ronch, Marcello Righi
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2411.11592
Pdf URL: https://arxiv.org/pdf/2411.11592
Copy Paste: [[2411.11592]] Generative Spatio-temporal GraphNet for Transonic Wing Pressure Distribution Forecasting(https://arxiv.org/abs/2411.11592)
Keywords: generative
Abstract: This study presents a framework for predicting unsteady transonic wing pressure distributions, integrating an autoencoder architecture with graph convolutional networks and graph-based temporal layers to model time dependencies. The framework compresses high-dimensional pressure distribution data into a lower-dimensional latent space using an autoencoder, ensuring efficient data representation while preserving essential features. Within this latent space, graph-based temporal layers are employed to predict future wing pressures based on past data, effectively capturing temporal dependencies and improving predictive accuracy. This combined approach leverages the strengths of autoencoders for dimensionality reduction, graph convolutional networks for handling unstructured grid data, and temporal layers for modeling time-based sequences. The effectiveness of the proposed framework is validated through its application to the Benchmark Super Critical Wing test case, achieving accuracy comparable to computational fluid dynamics, while significantly reducing prediction time. This framework offers a scalable, computationally efficient solution for the aerodynamic analysis of unsteady phenomena.

Title: Feature Selection for Network Intrusion Detection

Authors: Charles Westphal, Stephen Hailes, Mirco Musolesi
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.11603
Pdf URL: https://arxiv.org/pdf/2411.11603
Copy Paste: [[2411.11603]] Feature Selection for Network Intrusion Detection(https://arxiv.org/abs/2411.11603)
Keywords: security, attack
Abstract: Network Intrusion Detection (NID) remains a key area of research within the information security community, while also being relevant to Machine Learning (ML) practitioners. The latter generally aim to detect attacks using network features, which have been extracted from raw network data typically using dimensionality reduction methods, such as principal component analysis (PCA). However, PCA is not able to assess the relevance of features for the task at hand. Consequently, the features available are of varying quality, with some being entirely non-informative. From this, two major drawbacks arise. Firstly, trained and deployed models have to process large amounts of unnecessary data, therefore draining potentially costly resources. Secondly, the noise caused by the presence of irrelevant features can, in some cases, impede a model's ability to detect an attack. In order to deal with these challenges, we present Feature Selection for Network Intrusion Detection (FSNID) a novel information-theoretic method that facilitates the exclusion of non-informative features when detecting network intrusions. The proposed method is based on function approximation using a neural network, which enables a version of our approach that incorporates a recurrent layer. Consequently, this version uniquely enables the integration of temporal dependencies. Through an extensive set of experiments, we demonstrate that the proposed method selects a significantly reduced feature set, while maintaining NID performance. Code will be made available upon publication.

Title: ST-Tree with Interpretability for Multivariate Time Series Classification

Authors: Mingsen Du, Yanxuan Wei, Yingxia Tang, Xiangwei Zheng, Shoushui Wei, Cun Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11620
Pdf URL: https://arxiv.org/pdf/2411.11620
Copy Paste: [[2411.11620]] ST-Tree with Interpretability for Multivariate Time Series Classification(https://arxiv.org/abs/2411.11620)
Keywords: interpretability, transformer
Abstract: Multivariate time series classification is of great importance in practical applications and is a challenging task. However, deep neural network models such as Transformers exhibit high accuracy in multivariate time series classification but lack interpretability and fail to provide insights into the decision-making process. On the other hand, traditional approaches based on decision tree classifiers offer clear decision processes but relatively lower accuracy. Swin Transformer (ST) addresses these issues by leveraging self-attention mechanisms to capture both fine-grained local patterns and global patterns. It can also model multi-scale feature representation learning, thereby providing a more comprehensive representation of time series features. To tackle the aforementioned challenges, we propose ST-Tree with interpretability for multivariate time series classification. Specifically, the ST-Tree model combines ST as the backbone network with an additional neural tree model. This integration allows us to fully leverage the advantages of ST in learning time series context while providing interpretable decision processes through the neural tree. This enables researchers to gain clear insights into the model's decision-making process and extract meaningful interpretations. Through experimental evaluations on 10 UEA datasets, we demonstrate that the ST-Tree model improves accuracy in multivariate time series classification tasks and provides interpretability through visualizing the decision-making process across different datasets.

Title: Federated Incremental Named Entity Recognition

Authors: Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11623
Pdf URL: https://arxiv.org/pdf/2411.11623
Copy Paste: [[2411.11623]] Federated Incremental Named Entity Recognition(https://arxiv.org/abs/2411.11623)
Keywords: privacy, protect, defense, federate
Abstract: Federated Named Entity Recognition (FNER) boosts model training within each local client by aggregating the model updates of decentralized local clients, without sharing their private data. However, existing FNER methods assume fixed entity types and local clients in advance, leading to their ineffectiveness in practical applications. In a more realistic scenario, local clients receive new entity types continuously, while new local clients collecting novel data may irregularly join the global FNER training. This challenging setup, referred to here as Federated Incremental NER, renders the global model suffering from heterogeneous forgetting of old entity types from both intra-client and inter-client perspectives. To overcome these challenges, we propose a Local-Global Forgetting Defense (LGFD) model. Specifically, to address intra-client forgetting, we develop a structural knowledge distillation loss to retain the latent space's feature structure and a pseudo-label-guided inter-type contrastive loss to enhance discriminative capability over different entity types, effectively preserving previously learned knowledge within local clients. To tackle inter-client forgetting, we propose a task switching monitor that can automatically identify new entity types under privacy protection and store the latest old global model for knowledge distillation and pseudo-labeling. Experiments demonstrate significant improvement of our LGFD model over comparison methods.

Title: Teapot: Efficiently Uncovering Spectre Gadgets in COTS Binaries

Authors: Fangzheng Lin, Zhongfa Wang, Hiroshi Sasaki
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2411.11624
Pdf URL: https://arxiv.org/pdf/2411.11624
Copy Paste: [[2411.11624]] Teapot: Efficiently Uncovering Spectre Gadgets in COTS Binaries(https://arxiv.org/abs/2411.11624)
Keywords: attack
Abstract: Speculative execution is crucial in enhancing modern processor performance but can introduce Spectre-type vulnerabilities that may leak sensitive information. Detecting Spectre gadgets from programs has been a research focus to enhance the analysis and understanding of Spectre attacks. However, one of the problems of existing approaches is that they rely on the presence of source code (or are impractical in terms of run-time performance and gadget detection ability). This paper presents Teapot, the first Spectre gadget scanner that works on COTS binaries with comparable performance to compiler-based alternatives. As its core principle, we introduce Speculation Shadows, a novel approach that separates the binary code for normal execution and speculation simulation in order to improve run-time efficiency. Teapot is based on static binary rewriting. It instruments the program to simulate the effects of speculative execution and also adds integrity checks to detect Spectre gadgets at run time. By leveraging fuzzing, Teapot succeeds in efficiently detecting Spectre gadgets. Evaluations show that Teapot outperforms both performance (more than 20x performant) and gadget detection ability than a previously proposed binary-based approach.

Title: Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare

Authors: Leon Kopitar, Primoz Kocbek, Lucija Gosak, Gregor Stiglic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11635
Pdf URL: https://arxiv.org/pdf/2411.11635
Copy Paste: [[2411.11635]] Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare(https://arxiv.org/abs/2411.11635)
Keywords: extraction, transformer, generative
Abstract: This review examines the development of abstractive NLP-based text summarization approaches and compares them to existing techniques for extractive summarization. A brief history of text summarization from the 1950s to the introduction of pre-trained language models such as Bidirectional Encoder Representations from Transformer (BERT) and Generative Pre-training Transformers (GPT) are presented. In total, 60 studies were identified in PubMed and Web of Science, of which 29 were excluded and 24 were read and evaluated for eligibility, resulting in the use of seven studies for further analysis. This chapter also includes a section with examples including an example of a comparison between GPT-3 and state-of-the-art GPT-4 solutions in scientific text summarisation. Natural language processing has not yet reached its full potential in the generation of brief textual summaries. As there are acknowledged concerns that must be addressed, we can expect gradual introduction of such models in practise.

Title: SP${ }^3$ : Superpixel-propagated pseudo-label learning for weakly semi-supervised medical image segmentation

Authors: Shiman Li, Jiayue Zhao, Shaolei Liu, Xiaokun Dai, Chenxi Zhang, Zhijian Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11636
Pdf URL: https://arxiv.org/pdf/2411.11636
Copy Paste: [[2411.11636]] SP${ }^3$ : Superpixel-propagated pseudo-label learning for weakly semi-supervised medical image segmentation(https://arxiv.org/abs/2411.11636)
Keywords: segmentation
Abstract: Deep learning-based medical image segmentation helps assist diagnosis and accelerate the treatment process while the model training usually requires large-scale dense annotation datasets. Weakly semi-supervised medical image segmentation is an essential application because it only requires a small amount of scribbles and a large number of unlabeled data to train the model, which greatly reduces the clinician's effort to fully annotate images. To handle the inadequate supervisory information challenge in weakly semi-supervised segmentation (WSSS), a SuperPixel-Propagated Pseudo-label (SP${}^3$) learning method is proposed, using the structural information contained in superpixel for supplemental information. Specifically, the annotation of scribbles is propagated to superpixels and thus obtains a dense annotation for supervised training. Since the quality of pseudo-labels is limited by the low-quality annotation, the beneficial superpixels selected by dynamic thresholding are used to refine pseudo-labels. Furthermore, aiming to alleviate the negative impact of noise in pseudo-label, superpixel-level uncertainty is incorporated to guide the pseudo-label supervision for stable learning. Our method achieves state-of-the-art performance on both tumor and organ segmentation datasets under the WSSS setting, using only 3\% of the annotation workload compared to fully supervised methods and attaining approximately 80\% Dice score. Additionally, our method outperforms eight weakly and semi-supervised methods under both weakly supervised and semi-supervised settings. Results of extensive experiments validate the effectiveness and annotation efficiency of our weakly semi-supervised segmentation, which can assist clinicians in achieving automated segmentation for organs or tumors quickly and ultimately benefit patients.

Title: TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection

Authors: Mengxuan Li, Ke Liu, Hongyang Chen, Jiajun Bu, Hongwei Wang, Haishuai Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11641
Pdf URL: https://arxiv.org/pdf/2411.11641
Copy Paste: [[2411.11641]] TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection(https://arxiv.org/abs/2411.11641)
Keywords: transformer, large language model
Abstract: Time series anomaly detection aims to identify unusual patterns in data or deviations from systems' expected behavior. The reconstruction-based methods are the mainstream in this task, which learn point-wise representation via unsupervised learning. However, the unlabeled anomaly points in training data may cause these reconstruction-based methods to learn and reconstruct anomalous data, resulting in the challenge of capturing normal patterns. In this paper, we propose a time series anomaly detection method based on implicit neural representation (INR) reconstruction, named TSINR, to address this challenge. Due to the property of spectral bias, TSINR enables prioritizing low-frequency signals and exhibiting poorer performance on high-frequency abnormal data. Specifically, we adopt INR to parameterize time series data as a continuous function and employ a transformer-based architecture to predict the INR of given data. As a result, the proposed TSINR method achieves the advantage of capturing the temporal continuity and thus is more sensitive to discontinuous anomaly data. In addition, we further design a novel form of INR continuous function to learn inter- and intra-channel information, and leverage a pre-trained large language model to amplify the intense fluctuations in anomalies. Extensive experiments demonstrate that TSINR achieves superior overall performance on both univariate and multivariate time series anomaly detection benchmarks compared to other state-of-the-art reconstruction-based methods. Our codes are available.

Title: Can Highlighting Help GitHub Maintainers Track Security Fixes?

Authors: Xueqing Liu, Yuchen Xiong, Qiushi Liu, Jiangrui Zheng
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2411.11646
Pdf URL: https://arxiv.org/pdf/2411.11646
Copy Paste: [[2411.11646]] Can Highlighting Help GitHub Maintainers Track Security Fixes?(https://arxiv.org/abs/2411.11646)
Keywords: security
Abstract: In recent years, the rapid growth of security vulnerabilities poses great challenges to tracing and managing them. For example, it was reported that the NVD database experienced significant delays due to the shortage of maintainers. Such delay creates challenges for third-party security personnel (e.g., administrators) to trace the information related to the CVE. To help security personnel trace a vulnerability patch, we build a retrieval system that automatically retrieves the patch in the repository. Inspired by existing work on explainable machine learning, we ask the following research question: can explanations help security maintainers make decisions in patch tracing? First, we investigate using LIME (a widely used explainable machine learning method) to highlight the rationale tokens in the commit message and code. In addition, we propose an explanation method called TfIdf-Highlight, which leverages the Tf-Idf statistics to select the most informative words in the repository and the dataset. We evaluate the effectiveness of highlighting using two experiments. First, we compare LIME and TfIdf-Highlight using a faithfulness score (i.e., sufficiency and comprehensiveness) defined for ranking. We find that TfIdf-Highlight significantly outperforms LIME's sufficiency scores by 15\% and slightly outperforms the comprehensiveness scores. Second, we conduct a blind human labeling experiment by asking the annotators to guess the patch under 3 settings (TfIdf-Highlight, LIME, and no highlight). We find that the helpfulness score for TfIdf-Highlight is higher than LIME while the labeling accuracies of LIME and TfIdf-Highlight are similar. Nevertheless, highlighting does not improve the accuracy over non-highlighting.

Title: No-regret Exploration in Shuffle Private Reinforcement Learning

Authors: Shaojie Bai, Mohammad Sadegh Talebi, Chengcheng Zhao, Peng Cheng, Jiming Chen
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2411.11647
Pdf URL: https://arxiv.org/pdf/2411.11647
Copy Paste: [[2411.11647]] No-regret Exploration in Shuffle Private Reinforcement Learning(https://arxiv.org/abs/2411.11647)
Keywords: privacy, protect
Abstract: Differential privacy (DP) has recently been introduced into episodic reinforcement learning (RL) to formally address user privacy concerns in personalized services. Previous work mainly focuses on two trust models of DP: the central model, where a central agent is responsible for protecting users' sensitive data, and the (stronger) local model, where the protection occurs directly on the user side. However, they either require a trusted central agent or incur a significantly higher privacy cost, making it unsuitable for many scenarios. This work introduces a trust model stronger than the central model but with a lower privacy cost than the local model, leveraging the emerging \emph{shuffle} model of privacy. We present the first generic algorithm for episodic RL under the shuffle model, where a trusted shuffler randomly permutes a batch of users' data before sending it to the central agent. We then instantiate the algorithm using our proposed shuffle Privatizer, relying on a shuffle private binary summation mechanism. Our analysis shows that the algorithm achieves a near-optimal regret bound comparable to that of the centralized model and significantly outperforms the local model in terms of privacy cost.

Title: Dissecting Misalignment of Multimodal Large Language Models via Influence Function

Authors: Lijie Hu, Chenyang Ren, Huanyi Xie, Khouloud Saadi, Shu Yang, Jingfeng Zhang, Di Wang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.11667
Pdf URL: https://arxiv.org/pdf/2411.11667
Copy Paste: [[2411.11667]] Dissecting Misalignment of Multimodal Large Language Models via Influence Function(https://arxiv.org/abs/2411.11667)
Keywords: robust, interpretability, large language model
Abstract: Multi-modal Large Language models (MLLMs) are always trained on data from diverse and unreliable sources, which may contain misaligned or mislabeled text-image pairs. This frequently causes robustness issues and hallucinations, leading to performance degradation. Data valuation is an efficient way to detect and trace these misalignments. Nevertheless, existing methods are computationally expensive for MLLMs. While computationally efficient, the classical influence functions are inadequate for contrastive learning models because they were originally designed for pointwise loss. Additionally, contrastive learning involves minimizing the distance between the modalities of positive samples and maximizing the distance between the modalities of negative samples. This requires us to evaluate the influence of samples from both perspectives. To tackle these challenges, we introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models, eliminating the need for retraining. Building upon ECIF, we develop a series of algorithms for data evaluation in MLLM, misalignment detection, and misprediction trace-back tasks. Experimental results demonstrate our ECIF advances the transparency and interpretability of MLLMs by offering a more accurate assessment of data impact and model alignment compared to traditional baseline methods.

Title: Efficient and Robust Continual Graph Learning for Graph Classification in Biology

Authors: Ding Zhang, Jane Downer, Can Chen, Ren Wang
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.11668
Pdf URL: https://arxiv.org/pdf/2411.11668
Copy Paste: [[2411.11668]] Efficient and Robust Continual Graph Learning for Graph Classification in Biology(https://arxiv.org/abs/2411.11668)
Keywords: attack, robust
Abstract: Graph classification is essential for understanding complex biological systems, where molecular structures and interactions are naturally represented as graphs. Traditional graph neural networks (GNNs) perform well on static tasks but struggle in dynamic settings due to catastrophic forgetting. We present Perturbed and Sparsified Continual Graph Learning (PSCGL), a robust and efficient continual graph learning framework for graph data classification, specifically targeting biological datasets. We introduce a perturbed sampling strategy to identify critical data points that contribute to model learning and a motif-based graph sparsification technique to reduce storage needs while maintaining performance. Additionally, our PSCGL framework inherently defends against graph backdoor attacks, which is crucial for applications in sensitive biological contexts. Extensive experiments on biological datasets demonstrate that PSCGL not only retains knowledge across tasks but also enhances the efficiency and robustness of graph classification models in biology.

Title: Few-shot Model Extraction Attacks against Sequential Recommender Systems

Authors: Hui Zhang, Fu Liu
Subjects: cs.LG, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2411.11677
Pdf URL: https://arxiv.org/pdf/2411.11677
Copy Paste: [[2411.11677]] Few-shot Model Extraction Attacks against Sequential Recommender Systems(https://arxiv.org/abs/2411.11677)
Keywords: attack, extraction, data-free
Abstract: Among adversarial attacks against sequential recommender systems, model extraction attacks represent a method to attack sequential recommendation models without prior knowledge. Existing research has primarily concentrated on the adversary's execution of black-box attacks through data-free model extraction. However, a significant gap remains in the literature concerning the development of surrogate models by adversaries with access to few-shot raw data (10\% even less). That is, the challenge of how to construct a surrogate model with high functional similarity within the context of few-shot data scenarios remains an issue that requires this http URL study addresses this gap by introducing a novel few-shot model extraction framework against sequential recommenders, which is designed to construct a superior surrogate model with the utilization of few-shot data. The proposed few-shot model extraction framework is comprised of two components: an autoregressive augmentation generation strategy and a bidirectional repair loss-facilitated model distillation procedure. Specifically, to generate synthetic data that closely approximate the distribution of raw data, autoregressive augmentation generation strategy integrates a probabilistic interaction sampler to extract inherent dependencies and a synthesis determinant signal module to characterize user behavioral patterns. Subsequently, bidirectional repair loss, which target the discrepancies between the recommendation lists, is designed as auxiliary loss to rectify erroneous predictions from surrogate models, transferring knowledge from the victim model to the surrogate model effectively. Experiments on three datasets show that the proposed few-shot model extraction framework yields superior surrogate models.

Title: Conceptwm: A Diffusion Model Watermark for Concept Protection

Authors: Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Subjects: cs.CR, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2411.11688
Pdf URL: https://arxiv.org/pdf/2411.11688
Copy Paste: [[2411.11688]] Conceptwm: A Diffusion Model Watermark for Concept Protection(https://arxiv.org/abs/2411.11688)
Keywords: protect, watermark, diffusion
Abstract: The personalization techniques of diffusion models succeed in generating specific concepts but also pose threats to copyright protection and illegal use. Model Watermarking is an effective method to prevent the unauthorized use of subject-driven or style-driven image generation, safeguarding concept copyrights. However, under the goal of concept-oriented protection, current watermarking schemes typically add watermarks to all images rather than applying them in a refined manner targeted at specific concepts. Additionally, the personalization techniques of diffusion models can easily remove watermarks. Existing watermarking methods struggle to achieve fine-grained watermark embedding with a few images of specific concept and prevent removal of watermarks through personalized fine-tuning. Therefore, we introduce a novel concept-oriented watermarking framework that seamlessly embeds imperceptible watermarks into the concept of diffusion models. We conduct extensive experiments and ablation studies to verify our framework. Our code is available at this https URL.

Title: Towards Degradation-Robust Reconstruction in Generalizable NeRF

Authors: Chan Ho Park, Ka Leong Cheng, Zhicheng Wang, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11691
Pdf URL: https://arxiv.org/pdf/2411.11691
Copy Paste: [[2411.11691]] Towards Degradation-Robust Reconstruction in Generalizable NeRF(https://arxiv.org/abs/2411.11691)
Keywords: robust
Abstract: Generalizable Neural Radiance Field (GNeRF) across scenes has been proven to be an effective way to avoid per-scene optimization by representing a scene with deep image features of source images. However, despite its potential for real-world applications, there has been limited research on the robustness of GNeRFs to different types of degradation present in the source images. The lack of such research is primarily attributed to the absence of a large-scale dataset fit for training a degradation-robust generalizable NeRF model. To address this gap and facilitate investigations into the degradation robustness of 3D reconstruction tasks, we construct the Objaverse Blur Dataset, comprising 50,000 images from over 1000 settings featuring multiple levels of blur degradation. In addition, we design a simple and model-agnostic module for enhancing the degradation robustness of GNeRFs. Specifically, by extracting 3D-aware features through a lightweight depth estimator and denoiser, the proposed module shows improvement on different popular methods in GNeRFs in terms of both quantitative and visual quality over varying degradation types and levels. Our dataset and code will be made publicly available.

Title: Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

Authors: Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, Zheng Liu, Dong Yan, Jian Xie, Zhongyuan Wang, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11694
Pdf URL: https://arxiv.org/pdf/2411.11694
Copy Paste: [[2411.11694]] Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search(https://arxiv.org/abs/2411.11694)
Keywords: large language model
Abstract: Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.

Title: Robust Reinforcement Learning under Diffusion Models for Data with Jumps

Authors: Chenyang Jiang, Donggyu Kim, Alejandra Quintos, Yazhen Wang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.11697
Pdf URL: https://arxiv.org/pdf/2411.11697
Copy Paste: [[2411.11697]] Robust Reinforcement Learning under Diffusion Models for Data with Jumps(https://arxiv.org/abs/2411.11697)
Keywords: robust, diffusion
Abstract: Reinforcement Learning (RL) has proven effective in solving complex decision-making tasks across various domains, but challenges remain in continuous-time settings, particularly when state dynamics are governed by stochastic differential equations (SDEs) with jump components. In this paper, we address this challenge by introducing the Mean-Square Bipower Variation Error (MSBVE) algorithm, which enhances robustness and convergence in scenarios involving significant stochastic noise and jumps. We first revisit the Mean-Square TD Error (MSTDE) algorithm, commonly used in continuous-time RL, and highlight its limitations in handling jumps in state dynamics. The proposed MSBVE algorithm minimizes the mean-square quadratic variation error, offering improved performance over MSTDE in environments characterized by SDEs with jumps. Simulations and formal proofs demonstrate that the MSBVE algorithm reliably estimates the value function in complex settings, surpassing MSTDE's performance when faced with jump processes. These findings underscore the importance of alternative error metrics to improve the resilience and effectiveness of RL algorithms in continuous-time frameworks.

Title: Bitcoin Under Volatile Block Rewards: How Mempool Statistics Can Influence Bitcoin Mining

Authors: Roozbeh Sarenche, Alireza Aghabagherloo, Svetla Nikova, Bart Preneel
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2411.11702
Pdf URL: https://arxiv.org/pdf/2411.11702
Copy Paste: [[2411.11702]] Bitcoin Under Volatile Block Rewards: How Mempool Statistics Can Influence Bitcoin Mining(https://arxiv.org/abs/2411.11702)
Keywords: security, attack
Abstract: As Bitcoin experiences more halving events, the protocol reward converges to zero, making transaction fees the primary source of miner rewards. This shift in Bitcoin's incentivization mechanism, which introduces volatility into block rewards, could lead to the emergence of new security threats or intensify existing ones. Previous security analyses of Bitcoin have either considered a fixed block reward model or a highly simplified volatile model, overlooking the complexities of Bitcoin's mempool behavior. In this paper, we present a reinforcement learning-based tool designed to analyze mining strategies under a more realistic volatile model. Our tool uses the Asynchronous Advantage Actor-Critic (A3C) algorithm to derive near-optimal mining strategies while interacting with an environment that models the complexity of the Bitcoin mempool. This tool enables the analysis of adversarial mining strategies, such as selfish mining and undercutting, both before and after difficulty adjustments, providing insights into the effects of mining attacks in both the short and long term. Our analysis reveals that Bitcoin users' trend of offering higher fees to speed up the inclusion of their transactions in the chain can incentivize payoff-maximizing miners to deviate from the honest strategy. In the fixed reward model, a disincentive for the selfish mining attack is the initial loss period of at least two weeks, during which the attack is not profitable. However, our analysis shows that once the protocol reward diminishes to zero in the future, or even currently on days when transaction fees are comparable to the protocol reward, mining pools might be incentivized to abandon honest mining to gain an immediate profit.

Title: FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models

Authors: Tao Fan, Yan Kang, Guoqiang Ma, Lixin Fan, Kai Chen, Qiang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11707
Pdf URL: https://arxiv.org/pdf/2411.11707
Copy Paste: [[2411.11707]] FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models(https://arxiv.org/abs/2411.11707)
Keywords: privacy, federate, large language model
Abstract: By adapting Large Language Models (LLMs) to domain-specific tasks or enriching them with domain-specific knowledge, we can fully harness the capabilities of LLMs. Nonetheless, a gap persists in achieving simultaneous mutual enhancement between the server's LLM and the downstream clients' Small Language Models (SLMs). To address this, we propose FedCoLLM, a novel and parameter-efficient federated framework designed for co-tuning LLMs and SLMs. This approach is aimed at adaptively transferring server-side LLMs knowledge to clients' SLMs while simultaneously enriching the LLMs with domain insights from the clients. To accomplish this, FedCoLLM utilizes lightweight adapters in conjunction with SLMs, facilitating knowledge exchange between server and clients in a manner that respects data privacy while also minimizing computational and communication overhead. Our evaluation of FedCoLLM, utilizing various public LLMs and SLMs across a range of NLP text generation tasks, reveals that the performance of clients' SLMs experiences notable improvements with the assistance of the LLMs. Simultaneously, the LLMs enhanced via FedCoLLM achieves comparable performance to that obtained through direct fine-tuning on clients' data.

Title: FLMarket: Enabling Privacy-preserved Pre-training Data Pricing for Federated Learning

Authors: Zhenyu Wen, Wanglei Feng, Di Wu, Haozhen Hu, Chang Xu, Bin Qian, Zhen Hong, Cong Wang, Shouling Ji
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2411.11713
Pdf URL: https://arxiv.org/pdf/2411.11713
Copy Paste: [[2411.11713]] FLMarket: Enabling Privacy-preserved Pre-training Data Pricing for Federated Learning(https://arxiv.org/abs/2411.11713)
Keywords: security, privacy, federate
Abstract: Federated Learning (FL), as a mainstream privacy-preserving machine learning paradigm, offers promising solutions for privacy-critical domains such as healthcare and finance. Although extensive efforts have been dedicated from both academia and industry to improve the vanilla FL, little work focuses on the data pricing mechanism. In contrast to the straightforward in/post-training pricing techniques, we study a more difficult problem of pre-training pricing without direct information from the learning process. We propose FLMarket that integrates a two-stage, auction-based pricing mechanism with a security protocol to address the utility-privacy conflict. Through comprehensive experiments, we show that the client selection according to FLMarket can achieve more than 10% higher accuracy in subsequent FL training compared to state-of-the-art methods. In addition, it outperforms the in-training baseline with more than 2% accuracy increase and 3x run-time speedup.

Title: RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model

Authors: Hongjun Chen, Wencheng Han, Huan Zheng, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11717
Pdf URL: https://arxiv.org/pdf/2411.11717
Copy Paste: [[2411.11717]] RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model(https://arxiv.org/abs/2411.11717)
Keywords: extraction
Abstract: Recent advancements in sRGB-to-RAW de-rendering have increasingly emphasized metadata-driven approaches to reconstruct RAW data from sRGB images, supplemented by partial RAW information. In image-based de-rendering, metadata is commonly obtained through sampling, whereas in video tasks, it is typically derived from the initial frame. The distinct metadata requirements necessitate specialized network architectures, leading to architectural incompatibilities that increase deployment complexity. In this paper, we propose RAWMamba, a Mamba-based unified framework developed for sRGB-to-RAW de-rendering across both image and video domains. The core of RAWMamba is the Unified Metadata Embedding (UME) module, which harmonizes diverse metadata types into a unified representation. In detail, a multi-perspective affinity modeling method is proposed to promote the extraction of reference information. In addition, we introduce the Local Tone-Aware Mamba (LTA-Mamba) module, which captures long-range dependencies to enable effective global propagation of metadata. Experimental results demonstrate that the proposed RAWMamba achieves state-of-the-art performance, yielding high-quality RAW data reconstruction.

Title: Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Authors: Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, Dacheng Tao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.11727
Pdf URL: https://arxiv.org/pdf/2411.11727
Copy Paste: [[2411.11727]] Aligning Few-Step Diffusion Models with Dense Reward Difference Learning(https://arxiv.org/abs/2411.11727)
Keywords: robust, diffusion
Abstract: Aligning diffusion models with downstream objectives is essential for their practical applications. However, standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. To address this, we introduce Stepwise Diffusion Policy Optimization (SDPO), a novel alignment method tailored for few-step diffusion models. Unlike prior approaches that rely on a single sparse reward from only the final step of each denoising trajectory for trajectory-level optimization, SDPO incorporates dense reward feedback at every intermediate step. By learning the differences in dense rewards between paired samples, SDPO facilitates stepwise optimization of few-step diffusion models, ensuring consistent alignment across all denoising steps. To promote stable and efficient training, SDPO introduces an online reinforcement learning framework featuring several novel strategies designed to effectively exploit the stepwise granularity of dense rewards. Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities. Code is avaliable at this https URL.

Title: Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

Authors: Allison Huang, Yulu Niki Pi, Carlos Mougan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11731
Pdf URL: https://arxiv.org/pdf/2411.11731
Copy Paste: [[2411.11731]] Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment(https://arxiv.org/abs/2411.11731)
Keywords: large language model
Abstract: We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.

Title: Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking

Authors: German Gritsai, Anastasia Voznyuk, Ildar Khabutdinov, Andrey Grabovoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.11736
Pdf URL: https://arxiv.org/pdf/2411.11736
Copy Paste: [[2411.11736]] Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking(https://arxiv.org/abs/2411.11736)
Keywords: transformer
Abstract: The paper describes a system designed by Advacheck team to recognise machine-generated and human-written texts in the monolingual subtask of GenAI Detection Task 1 competition. Our developed system is a multi-task architecture with shared Transformer Encoder between several classification heads. One head is responsible for binary classification between human-written and machine-generated texts, while the other heads are auxiliary multiclass classifiers for texts of different domains from particular datasets. As multiclass heads were trained to distinguish the domains presented in the data, they provide a better understanding of the samples. This approach led us to achieve the first place in the official ranking with 83.07% macro F1-score on the test set and bypass the baseline by 10%. We further study obtained system through ablation, error and representation analyses, finding that multi-task learning outperforms single-task mode and simultaneous tasks form a cluster structure in embeddings space.

Title: BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Authors: Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2411.11745
Pdf URL: https://arxiv.org/pdf/2411.11745
Copy Paste: [[2411.11745]] BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration(https://arxiv.org/abs/2411.11745)
Keywords: generative, large language model
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.

Title: Freezing of Gait Detection Using Gramian Angular Fields and Federated Learning from Wearable Sensors

Authors: Shovito Barua Soumma, S M Raihanul Alam, Rudmila Rahman, Umme Niraj Mahi, Sayyed Mostafa Mostafavi, Hassan Ghasemzadeh
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2411.11764
Pdf URL: https://arxiv.org/pdf/2411.11764
Copy Paste: [[2411.11764]] Freezing of Gait Detection Using Gramian Angular Fields and Federated Learning from Wearable Sensors(https://arxiv.org/abs/2411.11764)
Keywords: robust, federate
Abstract: Freezing of gait (FOG) is a debilitating symptom of Parkinson's disease (PD) that impairs mobility and safety. Traditional detection methods face challenges due to intra and inter-patient variability, and most systems are tested in controlled settings, limiting their real-world applicability. Addressing these gaps, we present FOGSense, a novel FOG detection system designed for uncontrolled, free-living conditions. It uses Gramian Angular Field (GAF) transformations and federated deep learning to capture temporal and spatial gait patterns missed by traditional methods. We evaluated our FOGSense system using a public PD dataset, 'tdcsfog'. FOGSense improves accuracy by 10.4% over a single-axis accelerometer, reduces failure points compared to multi-sensor systems, and demonstrates robustness to missing values. The federated architecture allows personalized model adaptation and efficient smartphone synchronization during off-peak hours, making it effective for long-term monitoring as symptoms evolve. Overall, FOGSense achieves a 22.2% improvement in F1-score compared to state-of-the-art methods, along with enhanced sensitivity for FOG episode detection. Code is available: this https URL.

Title: LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

Authors: Enshuo Hsu, Kirk Roberts
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11779
Pdf URL: https://arxiv.org/pdf/2411.11779
Copy Paste: [[2411.11779]] LLM-IE: A Python Package for Generative Information Extraction with Large Language Models(https://arxiv.org/abs/2411.11779)
Keywords: robust, extraction, generative, large language model
Abstract: Objectives: Despite the recent adoption of large language models (LLMs) for biomedical information extraction, challenges in prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed LLM-IE: a Python package for building complete information extraction pipelines. Our key innovation is an interactive LLM agent to support schema definition and prompt design. Materials and Methods: The LLM-IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We benchmarked on the i2b2 datasets and conducted a system evaluation. Results: The sentence-based prompting algorithm resulted in the best performance while requiring a longer inference time. System evaluation provided intuitive visualization. Discussion: LLM-IE was designed from practical NLP experience in healthcare and has been adopted in internal projects. It should hold great value to the biomedical NLP community. Conclusion: We developed a Python package, LLM-IE, that provides building blocks for robust information extraction pipeline construction.

Title: A Potential Game Perspective in Federated Learning

Authors: Kang Liu, Ziqi Wang, Enrique Zuazua
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.11793
Pdf URL: https://arxiv.org/pdf/2411.11793
Copy Paste: [[2411.11793]] A Potential Game Perspective in Federated Learning(https://arxiv.org/abs/2411.11793)
Keywords: federate
Abstract: Federated learning (FL) is an emerging paradigm for training machine learning models across distributed clients. Traditionally, in FL settings, a central server assigns training efforts (or strategies) to clients. However, from a market-oriented perspective, clients may independently choose their training efforts based on rational self-interest. To explore this, we propose a potential game framework where each client's payoff is determined by their individual efforts and the rewards provided by the server. The rewards are influenced by the collective efforts of all clients and can be modulated through a reward factor. Our study begins by establishing the existence of Nash equilibria (NEs), followed by an investigation of uniqueness in homogeneous settings. We demonstrate a significant improvement in clients' training efforts at a critical reward factor, identifying it as the optimal choice for the server. Furthermore, we prove the convergence of the best-response algorithm to compute NEs for our FL game. Finally, we apply the training efforts derived from specific NEs to a real-world FL scenario, validating the effectiveness of the identified optimal reward factor.

Title: Tackling prediction tasks in relational databases with LLMs

Authors: Marek Wydmuch, Łukasz Borchmann, Filip Graliński
Subjects: cs.LG, cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2411.11829
Pdf URL: https://arxiv.org/pdf/2411.11829
Copy Paste: [[2411.11829]] Tackling prediction tasks in relational databases with LLMs(https://arxiv.org/abs/2411.11829)
Keywords: large language model
Abstract: Though large language models (LLMs) have demonstrated exceptional performance across numerous problems, their application to predictive tasks in relational databases remains largely unexplored. In this work, we address the notion that LLMs cannot yield satisfactory results on relational databases due to their interconnected tables, complex relationships, and heterogeneous data types. Using the recently introduced RelBench benchmark, we demonstrate that even a straightforward application of LLMs achieves competitive performance on these tasks. These findings establish LLMs as a promising new baseline for ML on relational databases and encourage further research in this direction.

Title: Bi-Mamba: Towards Accurate 1-Bit State Space Models

Authors: Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.11843
Pdf URL: https://arxiv.org/pdf/2411.11843
Copy Paste: [[2411.11843]] Bi-Mamba: Towards Accurate 1-Bit State Space Models(https://arxiv.org/abs/2411.11843)
Keywords: transformer, large language model
Abstract: The typical selective state-space model (SSM) of Mamba addresses several limitations of Transformers, such as quadratic computational complexity with sequence length and significant inference-time memory requirements due to the key-value cache. However, the growing size of Mamba models continues to pose training and deployment challenges and raises environmental concerns due to considerable energy consumption. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba models are trained from scratch on data volume as regular LLM pertaining using an autoregressive distillation loss. Extensive experimental results on language modeling demonstrate that Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba baselines, while significantly reducing memory footprint and energy consumption compared to the original Mamba model. Our study pioneers a new linear computational complexity LLM framework under low-bit representation and facilitates the future design of specialized hardware tailored for efficient 1-bit Mamba-based LLMs.

Title: Generative World Explorer

Authors: Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.11844
Pdf URL: https://arxiv.org/pdf/2411.11844
Copy Paste: [[2411.11844]] Generative World Explorer(https://arxiv.org/abs/2411.11844)
Keywords: generative
Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world this http URL contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.