2024-09-17

Title: AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding

Authors: Zihan Huang, Tao Wu, Wang Lin, Shengyu Zhang, Jingyuan Chen, Fei Wu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2409.09039
Pdf URL: https://arxiv.org/pdf/2409.09039
Copy Paste: [[2409.09039]] AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding(https://arxiv.org/abs/2409.09039)
Keywords: large language model
Abstract: With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100k, an extensive repository comprising 100k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research. Project page: this https URL.

Title: HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Authors: Tianyi Chen, Xiaoyi Qu, David Aponte, Colby Banbury, Jongwoo Ko, Tianyu Ding, Yong Ma, Vladimir Lyapunov, Ilya Zharkov, Luming Liang
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2409.09085
Pdf URL: https://arxiv.org/pdf/2409.09085
Copy Paste: [[2409.09085]] HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning(https://arxiv.org/abs/2409.09085)
Keywords: large language model
Abstract: Structured pruning is one of the most popular approaches to effectively compress the heavy deep neural networks (DNNs) into compact sub-networks while retaining performance. The existing methods suffer from multi-stage procedures along with significant engineering efforts and human expertise. The Only-Train-Once (OTO) series has been recently proposed to resolve the many pain points by streamlining the workflow by automatically conducting (i) search space generation, (ii) structured sparse optimization, and (iii) sub-network construction. However, the built-in sparse optimizers in the OTO series, i.e., the Half-Space Projected Gradient (HSPG) family, have limitations that require hyper-parameter tuning and the implicit controls of the sparsity exploration, consequently requires intervening by human expertise. To address such limitations, we propose a Hybrid Efficient Structured Sparse Optimizer (HESSO). HESSO could automatically and efficiently train a DNN to produce a high-performing subnetwork. Meanwhile, it is almost tuning-free and enjoys user-friendly integration for generic training applications. To address another common issue of irreversible performance collapse observed in pruning DNNs, we further propose a Corrective Redundant Identification Cycle (CRIC) for reliably identifying indispensable structures. We numerically demonstrate the efficacy of HESSO and its enhanced version HESSO-CRIC on a variety of applications ranging from computer vision to natural language processing, including large language model. The numerical results showcase that HESSO can achieve competitive even superior performance to varying state-of-the-arts and support most DNN architectures. Meanwhile, CRIC can effectively prevent the irreversible performance collapse and further enhance the performance of HESSO on certain applications. The code is available at this https URL.

Title: Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Authors: Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo
Subjects: cs.LG, cs.AI, cs.CV, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2409.09086
Pdf URL: https://arxiv.org/pdf/2409.09086
Copy Paste: [[2409.09086]] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU(https://arxiv.org/abs/2409.09086)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.

Title: Y-Drop: A Conductance based Dropout for fully connected layers

Authors: Efthymios Georgiou, Georgios Paraskevopoulos, Alexandros Potamianos
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2409.09088
Pdf URL: https://arxiv.org/pdf/2409.09088
Copy Paste: [[2409.09088]] Y-Drop: A Conductance based Dropout for fully connected layers(https://arxiv.org/abs/2409.09088)
Keywords: robust
Abstract: In this work, we introduce Y-Drop, a regularization method that biases the dropout algorithm towards dropping more important neurons with higher probability. The backbone of our approach is neuron conductance, an interpretable measure of neuron importance that calculates the contribution of each neuron towards the end-to-end mapping of the network. We investigate the impact of the uniform dropout selection criterion on performance by assigning higher dropout probability to the more important units. We show that forcing the network to solve the task at hand in the absence of its important units yields a strong regularization effect. Further analysis indicates that Y-Drop yields solutions where more neurons are important, i.e have high conductance, and yields robust networks. In our experiments we show that the regularization effect of Y-Drop scales better than vanilla dropout w.r.t. the architecture size and consistently yields superior performance over multiple datasets and architecture combinations, with little tuning.

Title: Trimming the Risk: Towards Reliable Continuous Training for Deep Learning Inspection Systems

Authors: Altaf Allah Abbassi, Houssem Ben Braiek, Foutse Khomh, Thomas Reid
Subjects: cs.LG, cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2409.09108
Pdf URL: https://arxiv.org/pdf/2409.09108
Copy Paste: [[2409.09108]] Trimming the Risk: Towards Reliable Continuous Training for Deep Learning Inspection Systems(https://arxiv.org/abs/2409.09108)
Keywords: robust
Abstract: The industry increasingly relies on deep learning (DL) technology for manufacturing inspections, which are challenging to automate with rule-based machine vision algorithms. DL-powered inspection systems derive defect patterns from labeled images, combining human-like agility with the consistency of a computerized system. However, finite labeled datasets often fail to encompass all natural variations necessitating Continuous Training (CT) to regularly adjust their models with recent data. Effective CT requires fresh labeled samples from the original distribution; otherwise, selfgenerated labels can lead to silent performance degradation. To mitigate this risk, we develop a robust CT-based maintenance approach that updates DL models using reliable data selections through a two-stage filtering process. The initial stage filters out low-confidence predictions, as the model inherently discredits them. The second stage uses variational auto-encoders and histograms to generate image embeddings that capture latent and pixel characteristics, then rejects the inputs of substantially shifted embeddings as drifted data with erroneous overconfidence. Then, a fine-tuning of the original DL model is executed on the filtered inputs while validating on a mixture of recent production and original datasets. This strategy mitigates catastrophic forgetting and ensures the model adapts effectively to new operational conditions. Evaluations on industrial inspection systems for popsicle stick prints and glass bottles using critical real-world datasets showed less than 9% of erroneous self-labeled data are retained after filtering and used for fine-tuning, improving model performance on production data by up to 14% without compromising its results on original validation data.

Title: Neural Message Passing Induced by Energy-Constrained Diffusion

Authors: Qitian Wu, David Wipf, Junchi Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09111
Pdf URL: https://arxiv.org/pdf/2409.09111
Copy Paste: [[2409.09111]] Neural Message Passing Induced by Energy-Constrained Diffusion(https://arxiv.org/abs/2409.09111)
Keywords: diffusion, transformer
Abstract: Learning representations for structured data with certain geometries (observed or unobserved) is a fundamental challenge, wherein message passing neural networks (MPNNs) have become a de facto class of model solutions. In this paper, we propose an energy-constrained diffusion model as a principled interpretable framework for understanding the mechanism of MPNNs and navigating novel architectural designs. The model, inspired by physical systems, combines the inductive bias of diffusion on manifolds with layer-wise constraints of energy minimization. As shown by our analysis, the diffusion operators have a one-to-one correspondence with the energy functions implicitly descended by the diffusion process, and the finite-difference iteration for solving the energy-constrained diffusion system induces the propagation layers of various types of MPNNs operated on observed or latent structures. On top of these findings, we devise a new class of neural message passing models, dubbed as diffusion-inspired Transformers, whose global attention layers are induced by the principled energy-constrained diffusion. Across diverse datasets ranging from real-world networks to images and physical particles, we show that the new model can yield promising performance for cases where the data structures are observed (as a graph), partially observed or completely unobserved.

Title: DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Authors: Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2409.09143
Pdf URL: https://arxiv.org/pdf/2409.09143
Copy Paste: [[2409.09143]] DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification(https://arxiv.org/abs/2409.09143)
Keywords: security
Abstract: Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.

Title: PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Authors: Denis Zavadski, Damjan Kalšan, Carsten Rother
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09144
Pdf URL: https://arxiv.org/pdf/2409.09144
Copy Paste: [[2409.09144]] PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage(https://arxiv.org/abs/2409.09144)
Keywords: robust, diffusion
Abstract: This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.

Title: Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Authors: Qifan Fu, Xiaohang Yang, Muhammad Asad, Changjae Oh, Shanxin Yuan, Gregory Slabaugh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09149
Pdf URL: https://arxiv.org/pdf/2409.09149
Copy Paste: [[2409.09149]] Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss(https://arxiv.org/abs/2409.09149)
Keywords: diffusion
Abstract: Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at this https URL.

Title: Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

Authors: Panayiotis Panayiotou, Özgür Şimşek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09169
Pdf URL: https://arxiv.org/pdf/2409.09169
Copy Paste: [[2409.09169]] Curricula for Learning Robust Policies over Factored State Representations in Changing Environments(https://arxiv.org/abs/2409.09169)
Keywords: robust
Abstract: Robust policies enable reinforcement learning agents to effectively adapt to and operate in unpredictable, dynamic, and ever-changing real-world environments. Factored representations, which break down complex state and action spaces into distinct components, can improve generalization and sample efficiency in policy learning. In this paper, we explore how the curriculum of an agent using a factored state representation affects the robustness of the learned policy. We experimentally demonstrate three simple curricula, such as varying only the variable of highest regret between episodes, that can significantly enhance policy robustness, offering practical insights for reinforcement learning in complex environments.

Title: Incorporation of Verifier Functionality in the Software for Operations and Network Attack Results Review and the Autonomous Penetration Testing System

Authors: Jordan Milbrath, Jeremy Straub
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09174
Pdf URL: https://arxiv.org/pdf/2409.09174
Copy Paste: [[2409.09174]] Incorporation of Verifier Functionality in the Software for Operations and Network Attack Results Review and the Autonomous Penetration Testing System(https://arxiv.org/abs/2409.09174)
Keywords: attack
Abstract: The software for operations and network attack results review (SONARR) and the autonomous penetration testing system (APTS) use facts and common properties in digital twin networks to represent real-world entities. However, in some cases fact values will change regularly, making it difficult for objects in SONARR and APTS to consistently and accurately represent their real-world counterparts. This paper proposes and evaluates the addition of verifiers, which check real-world conditions and update network facts, to SONARR. This inclusion allows SONARR to retrieve fact values from its executing environment and update its network, providing a consistent method of ensuring that the operations and, therefore, the results align with the real-world systems being assessed. Verifiers allow arbitrary scripts and dynamic arguments to be added to normal SONARR operations. This provides a layer of flexibility and consistency that results in more reliable output from the software.

Title: Cybersecurity Software Tool Evaluation Using a 'Perfect' Network Model

Authors: Jeremy Straub
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.09175
Pdf URL: https://arxiv.org/pdf/2409.09175
Copy Paste: [[2409.09175]] Cybersecurity Software Tool Evaluation Using a 'Perfect' Network Model(https://arxiv.org/abs/2409.09175)
Keywords: security, attack
Abstract: Cybersecurity software tool evaluation is difficult due to the inherently adversarial nature of the field. A penetration testing (or offensive) tool must be tested against a viable defensive adversary and a defensive tool must, similarly, be tested against a viable offensive adversary. Characterizing the tool's performance inherently depends on the quality of the adversary, which can vary from test to test. This paper proposes the use of a 'perfect' network, representing computing systems, a network and the attack pathways through it as a methodology to use for testing cybersecurity decision-making tools. This facilitates testing by providing a known and consistent standard for comparison. It also allows testing to include researcher-selected levels of error, noise and uncertainty to evaluate cybersecurity tools under these experimental conditions.

Title: Transformer with Controlled Attention for Synchronous Motion Captioning

Authors: Karim Radouane, Sylvie Ranwez, Julien Lagarde, Andon Tchechmedjiev
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09177
Pdf URL: https://arxiv.org/pdf/2409.09177
Copy Paste: [[2409.09177]] Transformer with Controlled Attention for Synchronous Motion Captioning(https://arxiv.org/abs/2409.09177)
Keywords: interpretability, transformer, segmentation
Abstract: In this paper, we address a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. This task pertains to numerous applications, such as aligned sign language transcription, unsupervised action segmentation and temporal grounding. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences. We demonstrate the superior performance of our approach through evaluation on the two available benchmark datasets, KIT-ML and HumanML3D. As visual evaluation is essential for this task, we provide a comprehensive set of animated visual illustrations in the code repository: this https URL.

Title: ProcessTBench: An LLM Plan Generation Dataset for Process Mining

Authors: Andrei Cosmin Redis, Mohammadreza Fani Sani, Bahram Zarrin, Andrea Burattin
Subjects: cs.LG, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2409.09191
Pdf URL: https://arxiv.org/pdf/2409.09191
Copy Paste: [[2409.09191]] ProcessTBench: An LLM Plan Generation Dataset for Process Mining(https://arxiv.org/abs/2409.09191)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown significant promise in plan generation. Yet, existing datasets often lack the complexity needed for advanced tool use scenarios - such as handling paraphrased query statements, supporting multiple languages, and managing actions that can be done in parallel. These scenarios are crucial for evaluating the evolving capabilities of LLMs in real-world applications. Moreover, current datasets don't enable the study of LLMs from a process perspective, particularly in scenarios where understanding typical behaviors and challenges in executing the same process under different conditions or formulations is crucial. To address these gaps, we present the ProcessTBench dataset, an extension of the TaskBench dataset specifically designed to evaluate LLMs within a process mining framework.

Title: Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features

Authors: Rowan Swiers, Subash Prabanantham, Andrew Maher
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09199
Pdf URL: https://arxiv.org/pdf/2409.09199
Copy Paste: [[2409.09199]] Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features(https://arxiv.org/abs/2409.09199)
Keywords: fair
Abstract: Multi-armed Bandits (MABs) are increasingly employed in online platforms and e-commerce to optimize decision making for personalized user experiences. In this work, we focus on the Contextual Bandit problem with linear rewards, under conditions of sparsity and batched data. We address the challenge of fairness by excluding irrelevant features from decision-making processes using a novel algorithm, Online Batched Sequential Inclusion (OBSI), which sequentially includes features as confidence in their impact on the reward increases. Our experiments on synthetic data show the superior performance of OBSI compared to other algorithms in terms of regret, relevance of features used, and compute.

Title: Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

Authors: Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Katherine Heller
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09201
Pdf URL: https://arxiv.org/pdf/2409.09201
Copy Paste: [[2409.09201]] Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases(https://arxiv.org/abs/2409.09201)
Keywords: large language model
Abstract: While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.

Title: Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?

Authors: Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill
Subjects: cs.CV, cs.CL, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.09221
Pdf URL: https://arxiv.org/pdf/2409.09221
Copy Paste: [[2409.09221]] Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?(https://arxiv.org/abs/2409.09221)
Keywords: transformer
Abstract: Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.

Title: Autoregressive + Chain of Thought (CoT) $\simeq$ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer

Authors: Xiang Zhang, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09239
Pdf URL: https://arxiv.org/pdf/2409.09239
Copy Paste: [[2409.09239]] Autoregressive + Chain of Thought (CoT) $\simeq$ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer(https://arxiv.org/abs/2409.09239)
Keywords: transformer
Abstract: The Transformer architecture excels in a variety of language modeling tasks, outperforming traditional neural architectures such as RNN and LSTM. This is partially due to its elimination of recurrent connections, which allows for parallel training and a smoother flow of gradients. However, this move away from recurrent structures places the Transformer model at the lower end of Chomsky's computational hierarchy, imposing limitations on its computational abilities. Consequently, even advanced Transformer-based models face considerable difficulties in tasks like counting, string reversal, bracket pairing, and multiplication. These tasks, though seemingly elementary, require a level of computational complexity that exceeds the capabilities of the Transformer architecture. Concurrently, the emergence of ``Chain of Thought" (CoT) prompting has enabled Transformer-based language models to tackle tasks that were previously impossible or poorly executed. Despite some previous research primarily interpreting CoT from a psychological perspective, a comprehensive understanding of \textit{why} CoT proves so effective in the reasoning process remains elusive. In this work, we thoroughly investigate the influence of recurrent structures in language models on their reasoning abilities, shedding light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence. It is this approximated recurrence that notably improves the model's performance and computational capacity. Moreover, we revisit recent recurrent-based Transformer model designs, focusing on their computational abilities through our proposed concept of ``recurrence-completeness" and identify key theoretical limitations in models like Linear Transformer and RWKV. Through this, we aim to provide insight into the neural model architectures and prompt better model design.

Title: Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

Authors: Wei Liu, Saurabh Prasad, Melba Crawford
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09244
Pdf URL: https://arxiv.org/pdf/2409.09244
Copy Paste: [[2409.09244]] Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery(https://arxiv.org/abs/2409.09244)
Keywords: robust, transformer
Abstract: In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network's capability to extract local feature information. Yet, the theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. To address this issue, a unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification, is investigated. In this streamlined yet effective vision Transformer architecture, multiple mixer modules are strategically integrated separately. These include the CNN-mixer, which executes convolution operations; the spatial self-attention (SSA)-mixer and channel self-attention (CSA)-mixer, both of which are adaptations of classical self-attention blocks; and hybrid models such as the SSA+CNN-mixer and CSA+CNN-mixer, which merge convolution with self-attention operations. This integration facilitates the development of a broad spectrum of vision Transformer-based models tailored for HSI classification. In terms of the training process, a comprehensive analysis is performed, contrasting classical CNN models and vision Transformer-based counterparts, with particular attention to disturbance robustness and the distribution of the largest eigenvalue of the Hessian. From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multi-head self-attention (MSA) components.

Title: Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Authors: Chengxi Ye, Grace Chu, Yanfeng Liu, Yichi Zhang, Lukasz Lew, Andrew Howard
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2409.09245
Pdf URL: https://arxiv.org/pdf/2409.09245
Copy Paste: [[2409.09245]] Robust Training of Neural Networks at Arbitrary Precision and Sparsity(https://arxiv.org/abs/2409.09245)
Keywords: robust
Abstract: The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.

Title: NovAScore: A New Automated Metric for Evaluating Document Level Novelty

Authors: Lin Ai, Ziwei Gong, Harshsaiprasad Deshpande, Alexander Johnson, Emmy Phung, Ahmad Emami, Julia Hirschberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09249
Pdf URL: https://arxiv.org/pdf/2409.09249
Copy Paste: [[2409.09249]] NovAScore: A New Automated Metric for Evaluating Document Level Novelty(https://arxiv.org/abs/2409.09249)
Keywords: interpretability, large language model
Abstract: The rapid expansion of online content has intensified the issue of information redundancy, underscoring the need for solutions that can identify genuinely new information. Despite this challenge, the research community has seen a decline in focus on novelty detection, particularly with the rise of large language models (LLMs). Additionally, previous approaches have relied heavily on human annotation, which is time-consuming, costly, and particularly challenging when annotators must compare a target document against a vast number of historical documents. In this work, we introduce NovAScore (Novelty Evaluation in Atomicity Score), an automated metric for evaluating document-level novelty. NovAScore aggregates the novelty and salience scores of atomic information, providing high interpretability and a detailed analysis of a document's novelty. With its dynamic weight adjustment scheme, NovAScore offers enhanced flexibility and an additional dimension to assess both the novelty level and the importance of information within a document. Our experiments show that NovAScore strongly correlates with human judgments of novelty, achieving a 0.626 Point-Biserial correlation on the TAP-DLND 1.0 dataset and a 0.920 Pearson correlation on an internal human-annotated dataset.

Title: ETAGE: Enhanced Test Time Adaptation with Integrated Entropy and Gradient Norms for Robust Model Performance

Authors: Afshar Shamsi, Rejisa Becirovic, Ahmadreza Argha, Ehsan Abbasnejad, Hamid Alinejad-Rokny, Arash Mohammadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09251
Pdf URL: https://arxiv.org/pdf/2409.09251
Copy Paste: [[2409.09251]] ETAGE: Enhanced Test Time Adaptation with Integrated Entropy and Gradient Norms for Robust Model Performance(https://arxiv.org/abs/2409.09251)
Keywords: robust
Abstract: Test time adaptation (TTA) equips deep learning models to handle unseen test data that deviates from the training distribution, even when source data is inaccessible. While traditional TTA methods often rely on entropy as a confidence metric, its effectiveness can be limited, particularly in biased scenarios. Extending existing approaches like the Pseudo Label Probability Difference (PLPD), we introduce ETAGE, a refined TTA method that integrates entropy minimization with gradient norms and PLPD, to enhance sample selection and adaptation. Our method prioritizes samples that are less likely to cause instability by combining high entropy with high gradient norms out of adaptation, thus avoiding the overfitting to noise often observed in previous methods. Extensive experiments on CIFAR-10-C and CIFAR-100-C datasets demonstrate that our approach outperforms existing TTA techniques, particularly in challenging and biased scenarios, leading to more robust and consistent model performance across diverse test scenarios. The codebase for ETAGE is available on this https URL.

Title: VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Authors: Hongyu Sun, Yongcai Wang, Peng Wang, Haoran Deng, Xudong Cai, Deying Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09254
Pdf URL: https://arxiv.org/pdf/2409.09254
Copy Paste: [[2409.09254]] VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding(https://arxiv.org/abs/2409.09254)
Keywords: transformer
Abstract: View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{this https URL}.

Title: Active Learning to Guide Labeling Efforts for Question Difficulty Estimation

Authors: Arthur Thuy, Ekaterina Loginova, Dries F. Benoit
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09258
Pdf URL: https://arxiv.org/pdf/2409.09258
Copy Paste: [[2409.09258]] Active Learning to Guide Labeling Efforts for Question Difficulty Estimation(https://arxiv.org/abs/2409.09258)
Keywords: transformer
Abstract: In recent years, there has been a surge in research on Question Difficulty Estimation (QDE) using natural language processing techniques. Transformer-based neural networks achieve state-of-the-art performance, primarily through supervised methods but with an isolated study in unsupervised learning. While supervised methods focus on predictive performance, they require abundant labeled data. On the other hand, unsupervised methods do not require labeled data but rely on a different evaluation metric that is also computationally expensive in practice. This work bridges the research gap by exploring active learning for QDE, a supervised human-in-the-loop approach striving to minimize the labeling efforts while matching the performance of state-of-the-art models. The active learning process iteratively trains on a labeled subset, acquiring labels from human experts only for the most informative unlabeled data points. Furthermore, we propose a novel acquisition function PowerVariance to add the most informative samples to the labeled set, a regression extension to the PowerBALD function popular in classification. We employ DistilBERT for QDE and identify informative samples by applying Monte Carlo dropout to capture epistemic uncertainty in unlabeled samples. The experiments demonstrate that active learning with PowerVariance acquisition achieves a performance close to fully supervised models after labeling only 10% of the training data. The proposed methodology promotes the responsible use of educational resources, makes QDE tools more accessible to course instructors, and is promising for other applications such as personalized support systems and question-answering tools.

Title: Informative Subgraphs Aware Masked Auto-Encoder in Dynamic Graphs

Authors: Pengfe Jiao, Xinxun Zhang, Mengzhou Gao, Tianpeng Li, Zhidong Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09262
Pdf URL: https://arxiv.org/pdf/2409.09262
Copy Paste: [[2409.09262]] Informative Subgraphs Aware Masked Auto-Encoder in Dynamic Graphs(https://arxiv.org/abs/2409.09262)
Keywords: generative
Abstract: Generative self-supervised learning (SSL), especially masked autoencoders (MAE), has greatly succeeded and garnered substantial research interest in graph machine learning. However, the research of MAE in dynamic graphs is still scant. This gap is primarily due to the dynamic graph not only possessing topological structure information but also encapsulating temporal evolution dependency. Applying a random masking strategy which most MAE methods adopt to dynamic graphs will remove the crucial subgraph that guides the evolution of dynamic graphs, resulting in the loss of crucial spatio-temporal information in node representations. To bridge this gap, in this paper, we propose a novel Informative Subgraphs Aware Masked Auto-Encoder in Dynamic Graph, namely DyGIS. Specifically, we introduce a constrained probabilistic generative model to generate informative subgraphs that guide the evolution of dynamic graphs, successfully alleviating the issue of missing dynamic evolution subgraphs. The informative subgraph identified by DyGIS will serve as the input of dynamic graph masked autoencoder (DGMAE), effectively ensuring the integrity of the evolutionary spatio-temporal information within dynamic graphs. Extensive experiments on eleven datasets demonstrate that DyGIS achieves state-of-the-art performance across multiple tasks.

Title: SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Authors: Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu
Subjects: cs.CR, cs.AI, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.09272
Pdf URL: https://arxiv.org/pdf/2409.09272
Copy Paste: [[2409.09272]] SafeEar: Content Privacy-Preserving Audio Deepfake Detection(https://arxiv.org/abs/2409.09272)
Keywords: privacy
Abstract: Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

Title: LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels

Authors: Tetsushi Ohki, Yuya Sato, Masakatsu Nishigaki, Koichi Ito
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09274
Pdf URL: https://arxiv.org/pdf/2409.09274
Copy Paste: [[2409.09274]] LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels(https://arxiv.org/abs/2409.09274)
Keywords: fair
Abstract: Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces ``LabellessFace'', a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing margin-based metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.

Title: Language Models "Grok" to Copy

Authors: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09281
Pdf URL: https://arxiv.org/pdf/2409.09281
Copy Paste: [[2409.09281]] Language Models "Grok" to Copy(https://arxiv.org/abs/2409.09281)
Keywords: transformer
Abstract: We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.

Title: SAM-OCTA2: Layer Sequence OCTA Segmentation with Fine-tuned Segment Anything Model 2

Authors: Xinrun Chen, Chengliang Wang, Haojian Ning, Mengzhan Zhang, Mei Shen, Shiying Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09286
Pdf URL: https://arxiv.org/pdf/2409.09286
Copy Paste: [[2409.09286]] SAM-OCTA2: Layer Sequence OCTA Segmentation with Fine-tuned Segment Anything Model 2(https://arxiv.org/abs/2409.09286)
Keywords: segmentation
Abstract: Segmentation of indicated targets aids in the precise analysis of optical coherence tomography angiography (OCTA) samples. Existing segmentation methods typically perform on 2D projection targets, making it challenging to capture the variance of segmented objects through the 3D volume. To address this limitation, the low-rank adaptation technique is adopted to fine-tune the Segment Anything Model (SAM) version 2, enabling the tracking and segmentation of specified objects across the OCTA scanning layer sequence. To further this work, a prompt point generation strategy in frame sequence and a sparse annotation method to acquire retinal vessel (RV) layer masks are proposed. This method is named SAM-OCTA2 and has been experimented on the OCTA-500 dataset. It achieves state-of-the-art performance in segmenting the foveal avascular zone (FAZ) on regular 2D en-face and effectively tracks local vessels across scanning layer sequences. The code is available at: this https URL.

Title: Generating API Parameter Security Rules with LLM for API Misuse Detection

Authors: Jinghua Liu, Yi Yang, Kai Chen, Miaoqian Lin
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2409.09288
Pdf URL: https://arxiv.org/pdf/2409.09288
Copy Paste: [[2409.09288]] Generating API Parameter Security Rules with LLM for API Misuse Detection(https://arxiv.org/abs/2409.09288)
Keywords: security
Abstract: In this paper, we present a new framework, named GPTAid, for automatic APSRs generation by analyzing API source code with LLM and detecting API misuse caused by incorrect parameter use. To validate the correctness of the LLM-generated APSRs, we propose an execution feedback-checking approach based on the observation that security-critical API misuse is often caused by APSRs violations, and most of them result in runtime errors. Specifically, GPTAid first uses LLM to generate raw APSRs and the Right calling code, and then generates Violation code for each raw APSR by modifying the Right calling code using LLM. Subsequently, GPTAid performs dynamic execution on each piece of Violation code and further filters out the incorrect APSRs based on runtime errors. To further generate concrete APSRs, GPTAid employs a code differential analysis to refine the filtered ones. Particularly, as the programming language is more precise than natural language, GPTAid identifies the key operations within Violation code by differential analysis, and then generates the corresponding concrete APSR based on the aforementioned operations. These concrete APSRs could be precisely interpreted into applicable detection code, which proven to be effective in API misuse detection. Implementing on the dataset containing 200 randomly selected APIs from eight popular libraries, GPTAid achieves a precision of 92.3%. Moreover, it generates 6 times more APSRs than state-of-the-art detectors on a comparison dataset of previously reported bugs and APSRs. We further evaluated GPTAid on 47 applications, 210 unknown security bugs were found potentially resulting in severe security issues (e.g., system crashes), 150 of which have been confirmed by developers after our reports.

Title: Associate Everything Detected: Facilitating Tracking-by-Detection to the Unknown

Authors: Zimeng Fang, Chao Liang, Xue Zhou, Shuyuan Zhu, Xi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09293
Pdf URL: https://arxiv.org/pdf/2409.09293
Copy Paste: [[2409.09293]] Associate Everything Detected: Facilitating Tracking-by-Detection to the Unknown(https://arxiv.org/abs/2409.09293)
Keywords: robust
Abstract: Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at this https URL.

Title: ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion

Authors: Jiajun Zhang, Yuxiang Zhang, Liang An, Mengcheng Li, Hongwen Zhang, Zonghai Hu, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09300
Pdf URL: https://arxiv.org/pdf/2409.09300
Copy Paste: [[2409.09300]] ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion(https://arxiv.org/abs/2409.09300)
Keywords: diffusion
Abstract: Dynamic and dexterous manipulation of objects presents a complex challenge, requiring the synchronization of hand motions with the trajectories of objects to achieve seamless and physically plausible interactions. In this work, we introduce ManiDext, a unified hierarchical diffusion-based framework for generating hand manipulation and grasp poses based on 3D object trajectories. Our key insight is that accurately modeling the contact correspondences between objects and hands during interactions is crucial. Therefore, we propose a continuous correspondence embedding representation that specifies detailed hand correspondences at the vertex level between the object and the hand. This embedding is optimized directly on the hand mesh in a self-supervised manner, with the distance between embeddings reflecting the geodesic distance. Our framework first generates contact maps and correspondence embeddings on the object's surface. Based on these fine-grained correspondences, we introduce a novel approach that integrates the iterative refinement process into the diffusion process during the second stage of hand pose generation. At each step of the denoising process, we incorporate the current hand pose residual as a refinement target into the network, guiding the network to correct inaccurate hand poses. Introducing residuals into each denoising step inherently aligns with traditional optimization process, effectively merging generation and refinement into a single unified framework. Extensive experiments demonstrate that our approach can generate physically plausible and highly realistic motions for various tasks, including single and bimanual hand grasping as well as manipulating both rigid and articulated objects. Code will be available for research purposes.

Title: Registration between Point Cloud Streams and Sequential Bounding Boxes via Gradient Descent

Authors: Xuesong Li, Xinge Zhu, Yuexin Ma, Subhan Khan, Jose Guivant
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2409.09312
Pdf URL: https://arxiv.org/pdf/2409.09312
Copy Paste: [[2409.09312]] Registration between Point Cloud Streams and Sequential Bounding Boxes via Gradient Descent(https://arxiv.org/abs/2409.09312)
Keywords: robust
Abstract: In this paper, we propose an algorithm for registering sequential bounding boxes with point cloud streams. Unlike popular point cloud registration techniques, the alignment of the point cloud and the bounding box can rely on the properties of the bounding box, such as size, shape, and temporal information, which provides substantial support and performance gains. Motivated by this, we propose a new approach to tackle this problem. Specifically, we model the registration process through an overall objective function that includes the final goal and all constraints. We then optimize the function using gradient descent. Our experiments show that the proposed method performs remarkably well with a 40\% improvement in IoU and demonstrates more robust registration between point cloud streams and sequential bounding boxes

Title: ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Authors: Yahan Tu, Rui Hu, Jitao Sang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2409.09318
Pdf URL: https://arxiv.org/pdf/2409.09318
Copy Paste: [[2409.09318]] ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models(https://arxiv.org/abs/2409.09318)
Keywords: large language model
Abstract: Hallucination poses a significant challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are static, which can lead to potential data contamination. This paper introduces ODE, an open-set, dynamic protocol for evaluating object existence hallucinations in MLLMs. Our framework employs graph structures to model associations between real-word concepts and generates novel samples for both general and domain-specific scenarios. The dynamic combination of concepts, along with various combination principles, ensures a broad sample distribution. Experimental results show that MLLMs exhibit higher hallucination rates with ODE-generated samples, effectively avoiding data contamination. Moreover, these samples can also be used for fine-tuning to improve MLLM performance on existing benchmarks.

Title: ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild

Authors: Arya Farkhondeh, Samy Tafasca, Jean-Marc Odobez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09319
Pdf URL: https://arxiv.org/pdf/2409.09319
Copy Paste: [[2409.09319]] ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild(https://arxiv.org/abs/2409.09319)
Keywords: segmentation
Abstract: Hand-Object Interaction (HOI) is gaining significant attention, particularly with the creation of numerous egocentric datasets driven by AR/VR applications. However, third-person view HOI has received less attention, especially in terms of datasets. Most third-person view datasets are curated for action recognition tasks and feature pre-segmented clips of high-level daily activities, leaving a gap for in-the-wild datasets. To address this gap, we propose ChildPlay-Hand, a novel dataset that includes person and object bounding boxes, as well as manipulation actions. ChildPlay-Hand is unique in: (1) providing per-hand annotations; (2) featuring videos in uncontrolled settings with natural interactions, involving both adults and children; (3) including gaze labels from the ChildPlay-Gaze dataset for joint modeling of manipulations and gaze. The manipulation actions cover the main stages of an HOI cycle, such as grasping, holding or operating, and different types of releasing. To illustrate the interest of the dataset, we study two tasks: object in hand detection (OiH), i.e. if a person has an object in their hand, and manipulation stages (ManiS), which is more fine-grained and targets the main stages of manipulation. We benchmark various spatio-temporal and segmentation networks, exploring body vs. hand-region information and comparing pose and RGB modalities. Our findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild.

Title: A Compressive Memory-based Retrieval Approach for Event Argument Extraction

Authors: Wanlong Liu, Enqi Zhang, Li Zhou, Dingyi Zeng, Shaohuan Cheng, Chen Zhang, Malu Zhang, Wenyu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09322
Pdf URL: https://arxiv.org/pdf/2409.09322
Copy Paste: [[2409.09322]] A Compressive Memory-based Retrieval Approach for Event Argument Extraction(https://arxiv.org/abs/2409.09322)
Keywords: extraction
Abstract: Recent works have demonstrated the effectiveness of retrieval augmentation in the Event Argument Extraction (EAE) task. However, existing retrieval-based EAE methods have two main limitations: (1) input length constraints and (2) the gap between the retriever and the inference model. These issues limit the diversity and quality of the retrieved information. In this paper, we propose a Compressive Memory-based Retrieval (CMR) mechanism for EAE, which addresses the two limitations mentioned above. Our compressive memory, designed as a dynamic matrix that effectively caches retrieved information and supports continuous updates, overcomes the limitations of the input length. Additionally, after pre-loading all candidate demonstrations into the compressive memory, the model further retrieves and filters relevant information from memory based on the input query, bridging the gap between the retriever and the inference model. Extensive experiments show that our method achieves new state-of-the-art performance on three public datasets (RAMS, WikiEvents, ACE05), significantly outperforming existing retrieval-based EAE methods.

Title: Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation

Authors: Hui Yi Leong, Yi Fan Gao, Ji Shuai, Uktu Pamuksuz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09324
Pdf URL: https://arxiv.org/pdf/2409.09324
Copy Paste: [[2409.09324]] Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation(https://arxiv.org/abs/2409.09324)
Keywords: large language model
Abstract: Scientific research indicates that for every hour spent in direct patient care, physicians spend nearly two additional hours on administrative tasks, particularly on electronic health records (EHRs) and desk work. This excessive administrative burden not only reduces the time available for patient care but also contributes to physician burnout and inefficiencies in healthcare delivery. To address these challenges, this study introduces MediGen, a fine-tuned large language model (LLM) designed to automate the generation of medical reports from medical dialogues. By leveraging state-of-the-art methodologies for fine-tuning open-source pretrained models, including LLaMA3-8B, MediGen achieves high accuracy in transcribing and summarizing clinical interactions. The fine-tuned LLaMA3-8B model demonstrated promising results, achieving a ROUGE score of 58% and a BERTScore-F1 of 72%, indicating its effectiveness in generating accurate and clinically relevant medical reports. These findings suggest that MediGen has the potential to significantly reduce the administrative workload on physicians, improving both healthcare efficiency and physician well-being.

Title: LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Authors: Deng Junli, Luo Yihao, Yang Xueting, Li Siyou, Wang Wei, Guo Jinyang, Shi Ping
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09326
Pdf URL: https://arxiv.org/pdf/2409.09326
Copy Paste: [[2409.09326]] LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation(https://arxiv.org/abs/2409.09326)
Keywords: robust
Abstract: In the domain of photorealistic avatar generation, the fidelity of audio-driven lip motion synthesis is essential for realistic virtual interactions. Existing methods face two key challenges: a lack of vivacity due to limited diversity in generated lip poses and noticeable anamorphose motions caused by poor temporal coherence. To address these issues, we propose LawDNet, a novel deep-learning architecture enhancing lip synthesis through a Local Affine Warping Deformation mechanism. This mechanism models the intricate lip movements in response to the audio input by controllable non-linear warping fields. These fields consist of local affine transformations focused on abstract keypoints within deep feature maps, offering a novel universal paradigm for feature warping in networks. Additionally, LawDNet incorporates a dual-stream discriminator for improved frame-to-frame continuity and employs face normalization techniques to handle pose and scene variations. Extensive evaluations demonstrate LawDNet's superior robustness and lip movement dynamism performance compared to previous methods. The advancements presented in this paper, including the methodologies, training data, source codes, and pre-trained models, will be made accessible to the research community.

Title: Schr\"odinger Bridge Flow for Unpaired Data Translation

Authors: Valentin De Bortoli, Iryna Korshunova, Andriy Mnih, Arnaud Doucet
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09347
Pdf URL: https://arxiv.org/pdf/2409.09347
Copy Paste: [[2409.09347]] Schr\"odinger Bridge Flow for Unpaired Data Translation(https://arxiv.org/abs/2409.09347)
Keywords: diffusion, generative
Abstract: Mass transport problems arise in many areas of machine learning whereby one wants to compute a map transporting one distribution to another. Generative modeling techniques like Generative Adversarial Networks (GANs) and Denoising Diffusion Models (DDMs) have been successfully adapted to solve such transport problems, resulting in CycleGAN and Bridge Matching respectively. However, these methods do not approximate Optimal Transport (OT) maps, which are known to have desirable properties. Existing techniques approximating OT maps for high-dimensional data-rich problems, such as DDM-based Rectified Flow and Schrödinger Bridge procedures, require fully training a DDM-type model at each iteration, or use mini-batch techniques which can introduce significant errors. We propose a novel algorithm to compute the Schrödinger Bridge, a dynamic entropy-regularised version of OT, that eliminates the need to train multiple DDM-like models. This algorithm corresponds to a discretisation of a flow of path measures, which we call the Schrödinger Bridge Flow, whose only stationary point is the Schrödinger Bridge. We demonstrate the performance of our algorithm on a variety of unpaired data translation tasks.

Title: OPUS: Occupancy Prediction Using a Sparse Set

Authors: Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09350
Pdf URL: https://arxiv.org/pdf/2409.09350
Copy Paste: [[2409.09350]] OPUS: Occupancy Prediction Using a Sparse Set(https://arxiv.org/abs/2409.09350)
Keywords: transformer
Abstract: Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.

Title: Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments

Authors: Xinyi Zheng, Chen Wei, Shenao Wang, Yanjie Zhao, Peiming Gao, Yuanchao Zhang, Kailong Wang, Haoyu Wang
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2409.09356
Pdf URL: https://arxiv.org/pdf/2409.09356
Copy Paste: [[2409.09356]] Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments(https://arxiv.org/abs/2409.09356)
Keywords: attack, robust
Abstract: The exponential growth of open-source package ecosystems, particularly NPM and PyPI, has led to an alarming increase in software supply chain poisoning attacks. Existing static analysis methods struggle with high false positive rates and are easily thwarted by obfuscation and dynamic code execution techniques. While dynamic analysis approaches offer improvements, they often suffer from capturing non-package behaviors and employing simplistic testing strategies that fail to trigger sophisticated malicious behaviors. To address these challenges, we present OSCAR, a robust dynamic code poisoning detection pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a sandbox environment, employs fuzz testing on exported functions and classes, and implements aspect-based behavior monitoring with tailored API hook points. We evaluate OSCAR against six existing tools using a comprehensive benchmark dataset of real-world malicious and benign packages. OSCAR achieves an F1 score of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the current state-of-the-art technologies. Furthermore, for benign packages exhibiting characteristics typical of malicious packages, OSCAR reduces the false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and 39.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly reducing the workload of manual reviews in real-world deployments. In cooperation with Ant Group, a leading financial technology company, we have deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying 10,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months. This work not only bridges the gap between academic research and industrial application in code poisoning detection but also provides a robust and practical solution that has been thoroughly tested in a real-world industrial setting.

Title: Symbolic Regression with a Learned Concept Library

Authors: Arya Grayeli, Atharva Sehgal, Omar Costilla-Reyes, Miles Cranmer, Swarat Chaudhuri
Subjects: cs.LG, cs.AI, cs.NE, cs.SC
Abstract URL: https://arxiv.org/abs/2409.09359
Pdf URL: https://arxiv.org/pdf/2409.09359
Copy Paste: [[2409.09359]] Symbolic Regression with a Learned Concept Library(https://arxiv.org/abs/2409.09359)
Keywords: large language model
Abstract: We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.

Title: LACOSTE: Exploiting stereo and temporal contexts for surgical instrument segmentation

Authors: Qiyuan Wang, Shang Zhao, Zikang Xu, S Kevin Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09360
Pdf URL: https://arxiv.org/pdf/2409.09360
Copy Paste: [[2409.09360]] LACOSTE: Exploiting stereo and temporal contexts for surgical instrument segmentation(https://arxiv.org/abs/2409.09360)
Keywords: robust, segmentation
Abstract: Surgical instrument segmentation is instrumental to minimally invasive surgeries and related applications. Most previous methods formulate this task as single-frame-based instance segmentation while ignoring the natural temporal and stereo attributes of a surgical video. As a result, these methods are less robust against the appearance variation through temporal motion and view change. In this work, we propose a novel LACOSTE model that exploits Location-Agnostic COntexts in Stereo and TEmporal images for improved surgical instrument segmentation. Leveraging a query-based segmentation model as core, we design three performance-enhancing modules. Firstly, we design a disparity-guided feature propagation module to enhance depth-aware features explicitly. To generalize well for even only a monocular video, we apply a pseudo stereo scheme to generate complementary right images. Secondly, we propose a stereo-temporal set classifier, which aggregates stereo-temporal contexts in a universal way for making a consolidated prediction and mitigates transient failures. Finally, we propose a location-agnostic classifier to decouple the location bias from mask prediction and enhance the feature semantics. We extensively validate our approach on three public surgical video datasets, including two benchmarks from EndoVis Challenges and one real radical prostatectomy surgery dataset GraSP. Experimental results demonstrate the promising performances of our method, which consistently achieves comparable or favorable results with previous state-of-the-art approaches.

Title: Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder

Authors: Seunghwan Kim, Seungkyu Lee
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09361
Pdf URL: https://arxiv.org/pdf/2409.09361
Copy Paste: [[2409.09361]] Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder(https://arxiv.org/abs/2409.09361)
Keywords: generative
Abstract: Variational autoencoder (VAE) is an established generative model but is notorious for its blurriness. In this work, we investigate the blurry output problem of VAE and resolve it, exploiting the variance of Gaussian decoder and $\beta$ of beta-VAE. Specifically, we reveal that the indistinguishability of decoder variance and $\beta$ hinders appropriate analysis of the model by random likelihood value, and limits performance improvement by omitting the gain from $\beta$. To address the problem, we propose Beta-Sigma VAE (BS-VAE) that explicitly separates $\beta$ and decoder variance $\sigma^2_x$ in the model. Our method demonstrates not only superior performance in natural image synthesis but also controllable parameters and predictable analysis compared to conventional VAE. In our experimental evaluation, we employ the analysis of rate-distortion curve and proxy metrics on computer vision datasets. The code is available on this https URL

Title: Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

Authors: Yuanjie Lyu, Tong Xu, Zihan Niu, Bo Peng, Jing Ke, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09362
Pdf URL: https://arxiv.org/pdf/2409.09362
Copy Paste: [[2409.09362]] Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM(https://arxiv.org/abs/2409.09362)
Keywords: large language model
Abstract: The prosperity of social media platforms has raised the urgent demand for semantic-rich services, e.g., event and storyline attribution. However, most existing research focuses on clip-level event understanding, primarily through basic captioning tasks, without analyzing the causes of events across an entire movie. This is a significant challenge, as even advanced multimodal large language models (MLLMs) struggle with extensive multimodal information due to limited context length. To address this issue, we propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution, i.e., connecting associated events with their causal semantics, in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip, briefly summarizing the single event. Correspondingly, in the global stage, we strengthen the connections between associated events using an inferential knowledge graph, and design an event-aware prefix that directs the model to focus on associated events rather than all preceding clips, resulting in accurate event attribution. Comprehensive evaluations of two real-world datasets demonstrate that our framework outperforms state-of-the-art methods.

Title: Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs

Authors: Jian Zhao, Shenao Wang, Yanjie Zhao, Xinyi Hou, Kailong Wang, Peiming Gao, Yuanchao Zhang, Chen Wei, Haoyu Wang
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2409.09368
Pdf URL: https://arxiv.org/pdf/2409.09368
Copy Paste: [[2409.09368]] Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs(https://arxiv.org/abs/2409.09368)
Keywords: security, protect, attack, extraction
Abstract: The proliferation of pre-trained models (PTMs) and datasets has led to the emergence of centralized model hubs like Hugging Face, which facilitate collaborative development and reuse. However, recent security reports have uncovered vulnerabilities and instances of malicious attacks within these platforms, highlighting growing security concerns. This paper presents the first systematic study of malicious code poisoning attacks on pre-trained model hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat analysis, develop a taxonomy of model formats, and perform root cause analysis of vulnerable formats. While existing tools like Fickling and ModelScan offer some protection, they face limitations in semantic-level analysis and comprehensive threat detection. To address these challenges, we propose MalHug, an end-to-end pipeline tailored for Hugging Face that combines dataset loading script extraction, model deserialization, in-depth taint analysis, and heuristic pattern matching to detect and classify malicious code poisoning attacks in datasets and models. In collaboration with Ant Group, a leading financial technology company, we have implemented and deployed MalHug on a mirrored Hugging Face instance within their infrastructure, where it has been operational for over three months. During this period, MalHug has monitored more than 705K models and 176K datasets, uncovering 91 malicious models and 9 malicious dataset loading scripts. These findings reveal a range of security threats, including reverse shell, browser credential theft, and system reconnaissance. This work not only bridges a critical gap in understanding the security of the PTM supply chain but also provides a practical, industry-tested solution for enhancing the security of pre-trained model hubs.

Title: BM$^2$: Coupled Schr\"{o}dinger Bridge Matching

Authors: Stefano Peluchetti
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09376
Pdf URL: https://arxiv.org/pdf/2409.09376
Copy Paste: [[2409.09376]] BM$^2$: Coupled Schr\"{o}dinger Bridge Matching(https://arxiv.org/abs/2409.09376)
Keywords: diffusion
Abstract: A Schrödinger bridge establishes a dynamic transport map between two target distributions via a reference process, simultaneously solving an associated entropic optimal transport problem. We consider the setting where samples from the target distributions are available, and the reference diffusion process admits tractable dynamics. We thus introduce Coupled Bridge Matching (BM$^2$), a simple \emph{non-iterative} approach for learning Schrödinger bridges with neural networks. A preliminary theoretical analysis of the convergence properties of BM$^2$ is carried out, supported by numerical experiments that demonstrate the effectiveness of our proposal.

Title: The Midas Touch: Triggering the Capability of LLMs for RM-API Misuse Detection

Authors: Yi Yang, Jinghua Liu, Kai Chen, Miaoqian Lin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.09380
Pdf URL: https://arxiv.org/pdf/2409.09380
Copy Paste: [[2409.09380]] The Midas Touch: Triggering the Capability of LLMs for RM-API Misuse Detection(https://arxiv.org/abs/2409.09380)
Keywords: security, extraction
Abstract: In this paper, we propose an LLM-empowered RM-API misuse detection solution, ChatDetector, which fully automates LLMs for documentation understanding which helps RM-API constraints retrieval and RM-API misuse detection. To correctly retrieve the RM-API constraints, ChatDetector is inspired by the ReAct framework which is optimized based on Chain-of-Thought (CoT) to decompose the complex task into allocation APIs identification, RM-object (allocated/released by RM APIs) extraction and RM-APIs pairing (RM APIs usually exist in pairs). It first verifies the semantics of allocation APIs based on the retrieved RM sentences from API documentation through LLMs. Inspired by the LLMs' performance on various prompting methods,ChatDetector adopts a two-dimensional prompting approach for cross-validation. At the same time, an inconsistency-checking approach between the LLMs' output and the reasoning process is adopted for the allocation APIs confirmation with an off-the-shelf Natural Language Processing (NLP) tool. To accurately pair the RM-APIs, ChatDetector decomposes the task again and identifies the RM-object type first, with which it can then accurately pair the releasing APIs and further construct the RM-API constraints for misuse detection. With the diminished hallucinations, ChatDetector identifies 165 pairs of RM-APIs with a precision of 98.21% compared with the state-of-the-art API detectors. By employing a static detector CodeQL, we ethically report 115 security bugs on the applications integrating on six popular libraries to the developers, which may result in severe issues, such as Denial-of-Services (DoS) and memory corruption. Compared with the end-to-end benchmark method, the result shows that ChatDetector can retrieve at least 47% more RM sentences and 80.85% more RM-API constraints.

Title: LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach

Authors: Kunlong Chen, Junjun Wang, Zhaoqun Chen, Kunjin Chen, Yitian Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2409.09383
Pdf URL: https://arxiv.org/pdf/2409.09383
Copy Paste: [[2409.09383]] LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach(https://arxiv.org/abs/2409.09383)
Keywords: large language model
Abstract: We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at this https URL.

Title: AMBER -- Advanced SegFormer for Multi-Band Image Segmentation: an application to Hyperspectral Imaging

Authors: Andrea Dosi, Massimo Brescia, Stefano Cavuoti, Mariarca D'Aniello, Michele Delli Veneri, Carlo Donadio, Adriano Ettari, Giuseppe Longo, Alvi Rownok, Luca Sannino, Maria Zampella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09386
Pdf URL: https://arxiv.org/pdf/2409.09386
Copy Paste: [[2409.09386]] AMBER -- Advanced SegFormer for Multi-Band Image Segmentation: an application to Hyperspectral Imaging(https://arxiv.org/abs/2409.09386)
Keywords: extraction, transformer, segmentation
Abstract: Deep learning has revolutionized the field of hyperspectral image (HSI) analysis, enabling the extraction of complex and hierarchical features. While convolutional neural networks (CNNs) have been the backbone of HSI classification, their limitations in capturing global contextual features have led to the exploration of Vision Transformers (ViTs). This paper introduces AMBER, an advanced SegFormer specifically designed for multi-band image segmentation. AMBER enhances the original SegFormer by incorporating three-dimensional convolutions to handle hyperspectral data. Our experiments, conducted on the Indian Pines, Pavia University, and PRISMA datasets, show that AMBER outperforms traditional CNN-based methods in terms of Overall Accuracy, Kappa coefficient, and Average Accuracy on the first two datasets, and achieves state-of-the-art performance on the PRISMA dataset.

Title: Tran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos

Authors: Xiaobin Hong, Tarmizi Adam, Masitah Ghazali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09391
Pdf URL: https://arxiv.org/pdf/2409.09391
Copy Paste: [[2409.09391]] Tran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos(https://arxiv.org/abs/2409.09391)
Keywords: robust, transformer
Abstract: Person Re-Identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a Transformer-enhanced Graph Convolutional Network (Tran-GCN) model to improve Person Re-Identification performance in monitoring videos. The model comprises four key components: (1) A Pose Estimation Learning branch is utilized to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) A Transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) A Convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; (4) A Graph Convolutional Module (GCM) integrates local feature information, global feature information, and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID, and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.

Title: Towards Diverse and Efficient Audio Captioning via Diffusion Models

Authors: Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Ruibo Fu, Wei Liang, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09401
Pdf URL: https://arxiv.org/pdf/2409.09401
Copy Paste: [[2409.09401]] Towards Diverse and Efficient Audio Captioning via Diffusion Models(https://arxiv.org/abs/2409.09401)
Keywords: diffusion, generative
Abstract: We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in various captioning tasks, their insufficient performance in terms of generation speed and diversity impede progress in audio understanding and multimedia applications. Our diffusion-based framework offers unique advantages stemming from its inherent stochasticity and holistic context modeling in captioning. Through rigorous evaluation, we demonstrate that DAC not only achieves SOTA performance levels compared to existing benchmarks in the caption quality, but also significantly outperforms them in terms of generation speed and diversity. The success of DAC illustrates that text generation can also be seamlessly integrated with audio and visual generation tasks using a diffusion backbone, paving the way for a unified, audio-related generative model across different modalities.

Title: AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction

Authors: Tianlong Xu, Yi-Fan Zhang, Zhendong Chu, Shen Wang, Qingsong Wen
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2409.09403
Pdf URL: https://arxiv.org/pdf/2409.09403
Copy Paste: [[2409.09403]] AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction(https://arxiv.org/abs/2409.09403)
Keywords: large language model
Abstract: Students frequently make mistakes while solving mathematical problems, and traditional error correction methods are both time-consuming and labor-intensive. This paper introduces an innovative \textbf{V}irtual \textbf{A}I \textbf{T}eacher system designed to autonomously analyze and correct student \textbf{E}rrors (VATE). Leveraging advanced large language models (LLMs), the system uses student drafts as a primary source for error analysis, which enhances understanding of the student's learning process. It incorporates sophisticated prompt engineering and maintains an error pool to reduce computational overhead. The AI-driven system also features a real-time dialogue component for efficient student interaction. Our approach demonstrates significant advantages over traditional and machine learning-based error correction methods, including reduced educational costs, high scalability, and superior generalizability. The system has been deployed on the Squirrel AI learning platform for elementary mathematics education, where it achieves 78.3\% accuracy in error analysis and shows a marked improvement in student learning efficiency. Satisfaction surveys indicate a strong positive reception, highlighting the system's potential to transform educational practices.

Title: Real-world Adversarial Defense against Patch Attacks based on Diffusion Model

Authors: Xingxing Wei, Caixin Kang, Yinpeng Dong, Zhengyi Wang, Shouwei Ruan, Yubo Chen, Hang Su
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09406
Pdf URL: https://arxiv.org/pdf/2409.09406
Copy Paste: [[2409.09406]] Real-world Adversarial Defense against Patch Attacks based on Diffusion Model(https://arxiv.org/abs/2409.09406)
Keywords: defense, attack, robust, diffusion
Abstract: Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender's robust performance against adversarial attacks. The framework's versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.

Title: Weather Prediction Using CNN-LSTM for Time Series Analysis: A Case Study on Delhi Temperature Data

Authors: Bangyu Li, Yang Qian
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2409.09414
Pdf URL: https://arxiv.org/pdf/2409.09414
Copy Paste: [[2409.09414]] Weather Prediction Using CNN-LSTM for Time Series Analysis: A Case Study on Delhi Temperature Data(https://arxiv.org/abs/2409.09414)
Keywords: protect, robust, extraction
Abstract: As global climate change intensifies, accurate weather forecasting is increasingly crucial for sectors such as agriculture, energy management, and environmental protection. Traditional methods, which rely on physical and statistical models, often struggle with complex, nonlinear, and time-varying data, underscoring the need for more advanced techniques. This study explores a hybrid CNN-LSTM model to enhance temperature forecasting accuracy for the Delhi region, using historical meteorological data from 1996 to 2017. We employed both direct and indirect methods, including comprehensive data preprocessing and exploratory analysis, to construct and train our model. The CNN component effectively extracts spatial features, while the LSTM captures temporal dependencies, leading to improved prediction accuracy. Experimental results indicate that the CNN-LSTM model significantly outperforms traditional forecasting methods in terms of both accuracy and stability, with a mean square error (MSE) of 3.26217 and a root mean square error (RMSE) of 1.80615. The hybrid model demonstrates its potential as a robust tool for temperature prediction, offering valuable insights for meteorological forecasting and related fields. Future research should focus on optimizing model architecture, exploring additional feature extraction techniques, and addressing challenges such as overfitting and computational complexity. This approach not only advances temperature forecasting but also provides a foundation for applying deep learning to other time series forecasting tasks.

Title: Enhancing LLM Problem Solving with REAP: Reflection, Explicit Problem Deconstruction, and Advanced Prompting

Authors: Ryan Lingo, Martin Arroyo, Rajeev Chhajer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09415
Pdf URL: https://arxiv.org/pdf/2409.09415
Copy Paste: [[2409.09415]] Enhancing LLM Problem Solving with REAP: Reflection, Explicit Problem Deconstruction, and Advanced Prompting(https://arxiv.org/abs/2409.09415)
Keywords: large language model
Abstract: Large Language Models (LLMs) have transformed natural language processing, yet improving their problem-solving capabilities, particularly for complex, reasoning-intensive tasks, remains a persistent challenge. This paper introduces the REAP (Reflection, Explicit Problem Deconstruction, and Advanced Prompting) method, an innovative approach within the dynamic context generation framework. REAP guides LLMs through reflection on the query, deconstructing it into manageable components, and generating relevant context to enhance the solution process. We evaluated REAP using a dataset designed to expose LLM limitations, comparing zero-shot prompting with REAP-enhanced prompts across six state-of-the-art models: OpenAI's o1-preview, o1-mini, GPT-4o, GPT-4o-mini, Google's Gemini 1.5 Pro, and Claude 3.5 Sonnet. The results demonstrate notable performance gains, with o1-mini improving by 40.97%, GPT-4o by 66.26%, and GPT-4o-mini by 112.93%. Despite the already strong baseline performance of OpenAI's o1-preview, modest gains were observed. Beyond performance improvements, REAP offers a cost-effective solution; for example, GPT-4o-mini, which is approximately 100 times cheaper than o1-preview, delivered competitive results. REAP also improves the clarity of model outputs, making it easier for humans to understand the reasoning behind the results and simplifying the process of identifying and addressing any issues. These findings demonstrate REAP's potential to greatly improve the capabilities of LLMs, providing both better performance and increased cost-efficiency across a wide range of applications.

Title: MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction

Authors: Yan Feng, Alexander Carballo, Keisuke Fujii, Robin Karlsson, Ming Ding, Kazuya Takeda
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09446
Pdf URL: https://arxiv.org/pdf/2409.09446
Copy Paste: [[2409.09446]] MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction(https://arxiv.org/abs/2409.09446)
Keywords: explainability
Abstract: Pedestrian action prediction is of great significance for many applications such as autonomous driving. However, state-of-the-art methods lack explainability to make trustworthy predictions. In this paper, a novel framework called MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. Previous concept-based methods have limitations including: 1) they cannot directly apply to multi-modal cases; 2) they lack locality to attend to details in the inputs; 3) they suffer from mode collapse. These limitations are tackled accordingly through the following approaches: 1) a linear aggregator to integrate the activation results of the concepts into predictions, which associates concepts of different modalities and provides ante-hoc explanations of the relevance between the concepts and the predictions; 2) a channel-wise recalibration module that attends to local spatiotemporal regions, which enables the concepts with locality; 3) a feature regularization loss that encourages the concepts to learn diverse patterns. MulCPred is evaluated on multiple datasets and tasks. Both qualitative and quantitative results demonstrate that MulCPred is promising in improving the explainability of pedestrian action prediction without obvious performance degradation. Furthermore, by removing unrecognizable concepts from MulCPred, the cross-dataset prediction performance is improved, indicating the feasibility of further generalizability of MulCPred.

Title: Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision

Authors: Daniel Khalil, Christina Liu, Pietro Perona, Jennifer J. Sun, Markus Marks
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09455
Pdf URL: https://arxiv.org/pdf/2409.09455
Copy Paste: [[2409.09455]] Learning Keypoints for Multi-Agent Behavior Analysis using Self-Supervision(https://arxiv.org/abs/2409.09455)
Keywords: segmentation
Abstract: The study of social interactions and collective behaviors through multi-agent video analysis is crucial in biology. While self-supervised keypoint discovery has emerged as a promising solution to reduce the need for manual keypoint annotations, existing methods often struggle with videos containing multiple interacting agents, especially those of the same species and color. To address this, we introduce B-KinD-multi, a novel approach that leverages pre-trained video segmentation models to guide keypoint discovery in multi-agent scenarios. This eliminates the need for time-consuming manual annotations on new experimental settings and organisms. Extensive evaluations demonstrate improved keypoint regression and downstream behavioral classification in videos of flies, mice, and rats. Furthermore, our method generalizes well to other species, including ants, bees, and humans, highlighting its potential for broad applications in automated keypoint annotation for multi-agent behavior analysis. Code available under: this https URL

Title: TX-Gen: Multi-Objective Optimization for Sparse Counterfactual Explanations for Time-Series Classification

Authors: Qi Huang, Sofoklis Kitharidis, Thomas Bäck, Niki van Stein
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2409.09461
Pdf URL: https://arxiv.org/pdf/2409.09461
Copy Paste: [[2409.09461]] TX-Gen: Multi-Objective Optimization for Sparse Counterfactual Explanations for Time-Series Classification(https://arxiv.org/abs/2409.09461)
Keywords: interpretability
Abstract: In time-series classification, understanding model decisions is crucial for their application in high-stakes domains such as healthcare and finance. Counterfactual explanations, which provide insights by presenting alternative inputs that change model predictions, offer a promising solution. However, existing methods for generating counterfactual explanations for time-series data often struggle with balancing key objectives like proximity, sparsity, and validity. In this paper, we introduce TX-Gen, a novel algorithm for generating counterfactual explanations based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II). TX-Gen leverages evolutionary multi-objective optimization to find a diverse set of counterfactuals that are both sparse and valid, while maintaining minimal dissimilarity to the original time series. By incorporating a flexible reference-guided mechanism, our method improves the plausibility and interpretability of the counterfactuals without relying on predefined assumptions. Extensive experiments on benchmark datasets demonstrate that TX-Gen outperforms existing methods in generating high-quality counterfactuals, making time-series models more transparent and interpretable.

Title: Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Authors: Nicholas Pangakis, Samuel Wolken
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09467
Pdf URL: https://arxiv.org/pdf/2409.09467
Copy Paste: [[2409.09467]] Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI(https://arxiv.org/abs/2409.09467)
Keywords: protect, generative, large language model
Abstract: Automated text annotation is a compelling use case for generative large language models (LLMs) in social media research. Recent work suggests that LLMs can achieve strong performance on annotation tasks; however, these studies evaluate LLMs on a small number of tasks and likely suffer from contamination due to a reliance on public benchmark datasets. Here, we test a human-centered framework for responsibly evaluating artificial intelligence tools used in automated annotation. We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets from recently published computational social science articles in high-impact journals. For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels. Although the quality of LLM labels is generally high, we find significant variation in LLM performance across tasks, even within datasets. Our findings underscore the importance of a human-centered workflow and careful evaluation standards: Automated annotations significantly diverge from human judgment in numerous scenarios, despite various optimization strategies such as prompt tuning. Grounding automated annotation in validation labels generated by humans is essential for responsible evaluation.

Title: Hacking, The Lazy Way: LLM Augmented Pentesting

Authors: Dhruva Goyal, Sitaraman Subramanian, Aditya Peela
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09493
Pdf URL: https://arxiv.org/pdf/2409.09493
Copy Paste: [[2409.09493]] Hacking, The Lazy Way: LLM Augmented Pentesting(https://arxiv.org/abs/2409.09493)
Keywords: security, robust, large language model
Abstract: Security researchers are continually challenged by the need to stay current with rapidly evolving cybersecurity research, tools, and techniques. This constant cycle of learning, unlearning, and relearning, combined with the repetitive tasks of sifting through documentation and analyzing data, often hinders productivity and innovation. This has led to a disparity where only organizations with substantial resources can access top-tier security experts, while others rely on firms with less skilled researchers who focus primarily on compliance rather than actual security. We introduce "LLM Augmented Pentesting," demonstrated through a tool named "Pentest Copilot," to address this gap. This approach integrates Large Language Models into penetration testing workflows. Our research includes a "chain of thought" mechanism to streamline token usage and boost performance, as well as unique Retrieval Augmented Generation implementation to minimize hallucinations and keep models aligned with the latest techniques. Additionally, we propose a novel file analysis approach, enabling LLMs to understand files. Furthermore, we highlight a unique infrastructure system that supports if implemented, can support in-browser assisted penetration testing, offering a robust platform for cybersecurity professionals, These advancements mark a significant step toward bridging the gap between automated tools and human expertise, offering a powerful solution to the challenges faced by modern cybersecurity teams.

Title: Protecting Vehicle Location Privacy with Contextually-Driven Synthetic Location Generation

Authors: Sourabh Yadav, Chenyang Yu, Xinpeng Xie, Yan Huang, Chenxi Qiu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.09495
Pdf URL: https://arxiv.org/pdf/2409.09495
Copy Paste: [[2409.09495]] Protecting Vehicle Location Privacy with Contextually-Driven Synthetic Location Generation(https://arxiv.org/abs/2409.09495)
Keywords: privacy, protect, attack
Abstract: Geo-obfuscation is a Location Privacy Protection Mechanism used in location-based services that allows users to report obfuscated locations instead of exact ones. A formal privacy criterion, geoindistinguishability (Geo-Ind), requires real locations to be hard to distinguish from nearby locations (by attackers) based on their obfuscated representations. However, Geo-Ind often fails to consider context, such as road networks and vehicle traffic conditions, making it less effective in protecting the location privacy of vehicles, of which the mobility are heavily influenced by these factors. In this paper, we introduce VehiTrack, a new threat model to demonstrate the vulnerability of Geo-Ind in protecting vehicle location privacy from context-aware inference attacks. Our experiments demonstrate that VehiTrack can accurately determine exact vehicle locations from obfuscated data, reducing average inference errors by 61.20% with Laplacian noise and 47.35% with linear programming (LP) compared to traditional Bayesian attacks. By using contextual data like road networks and traffic flow, VehiTrack effectively eliminates a significant number of seemingly "impossible" locations during its search for the actual location of the vehicles. Based on these insights, we propose TransProtect, a new geo-obfuscation approach that limits obfuscation to realistic vehicle movement patterns, complicating attackers' ability to differentiate obfuscated from actual locations. Our results show that TransProtect increases VehiTrack's inference error by 57.75% with Laplacian noise and 27.21% with LP, significantly enhancing protection against these attacks.

Title: Multi-Scale Grouped Prototypes for Interpretable Semantic Segmentation

Authors: Hugo Porta, Emanuele Dalsasso, Diego Marcos, Devis Tuia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09497
Pdf URL: https://arxiv.org/pdf/2409.09497
Copy Paste: [[2409.09497]] Multi-Scale Grouped Prototypes for Interpretable Semantic Segmentation(https://arxiv.org/abs/2409.09497)
Keywords: interpretability, segmentation
Abstract: Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi-scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi-scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi-scale sparse groups of these scale-specific prototypical parts. This provides a deeper understanding of the interactions between multi-scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype-based methods, and narrows the performance gap with the non-interpretable counterpart models. Code is available at this http URL.

Title: One missing piece in Vision and Language: A Survey on Comics Understanding

Authors: Emanuele Vivoli, Andrey Barsky, Mohamed Ali Souibgui, Artemis LLabres, Marco Bertini, Dimosthenis Karatzas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09502
Pdf URL: https://arxiv.org/pdf/2409.09502
Copy Paste: [[2409.09502]] One missing piece in Vision and Language: A Survey on Comics Understanding(https://arxiv.org/abs/2409.09502)
Keywords: segmentation
Abstract: Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challenging AI models with tasks that span image classification, object detection, instance segmentation, and deeper narrative comprehension through sequential panels. However, the unique structure of comics -- characterized by creative variations in style, reading order, and non-linear storytelling -- presents a set of challenges distinct from those in other visual-language domains. In this survey, we present a comprehensive review of Comics Understanding from both dataset and task perspectives. Our contributions are fivefold: (1) We analyze the structure of the comics medium, detailing its distinctive compositional elements; (2) We survey the widely used datasets and tasks in comics research, emphasizing their role in advancing the field; (3) We introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy that redefines vision-language tasks within comics and lays the foundation for future work; (4) We provide a detailed review and categorization of existing methods following the LoCU framework; (5) Finally, we highlight current research challenges and propose directions for future exploration, particularly in the context of vision-language models applied to comics. This survey is the first to propose a task-oriented framework for comics intelligence and aims to guide future research by addressing critical gaps in data availability and task definition. A project associated with this survey is available at this https URL.

Title: Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models

Authors: Alireza Salemi, Hamed Zamani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09510
Pdf URL: https://arxiv.org/pdf/2409.09510
Copy Paste: [[2409.09510]] Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models(https://arxiv.org/abs/2409.09510)
Keywords: privacy, large language model
Abstract: Privacy-preserving methods for personalizing large language models (LLMs) are relatively under-explored. There are two schools of thought on this topic: (1) generating personalized outputs by personalizing the input prompt through retrieval augmentation from the user's personal information (RAG-based methods), and (2) parameter-efficient fine-tuning of LLMs per user that considers efficiency and space limitations (PEFT-based methods). This paper presents the first systematic comparison between two approaches on a wide range of personalization tasks using seven diverse datasets. Our results indicate that RAG-based and PEFT-based personalization methods on average yield 14.92% and 1.07% improvements over the non-personalized LLM, respectively. We find that combining RAG with PEFT elevates these improvements to 15.98%. Additionally, we identify a positive correlation between the amount of user data and PEFT's effectiveness, indicating that RAG is a better choice for cold-start users (i.e., user's with limited personal data).

Title: Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

Authors: Joseph Clinton, Robert Lieck
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2409.09513
Pdf URL: https://arxiv.org/pdf/2409.09513
Copy Paste: [[2409.09513]] Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens(https://arxiv.org/abs/2409.09513)
Keywords: interpretability, transformer
Abstract: Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.

Title: Deep Learning Under Siege: Identifying Security Vulnerabilities and Risk Mitigation Strategies

Authors: Jamal Al-Karaki, Muhammad Al-Zafar Khan, Mostafa Mohamad, Dababrata Chowdhury
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09517
Pdf URL: https://arxiv.org/pdf/2409.09517
Copy Paste: [[2409.09517]] Deep Learning Under Siege: Identifying Security Vulnerabilities and Risk Mitigation Strategies(https://arxiv.org/abs/2409.09517)
Keywords: security
Abstract: With the rise in the wholesale adoption of Deep Learning (DL) models in nearly all aspects of society, a unique set of challenges is imposed. Primarily centered around the architectures of these models, these risks pose a significant challenge, and addressing these challenges is key to their successful implementation and usage in the future. In this research, we present the security challenges associated with the current DL models deployed into production, as well as anticipate the challenges of future DL technologies based on the advancements in computing, AI, and hardware technologies. In addition, we propose risk mitigation techniques to inhibit these challenges and provide metrical evaluations to measure the effectiveness of these metrics.

Title: Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM Empowerment

Authors: Xin Hu, Janet Wang, Jihun Hamm, Rie R Yotsu, Zhengming Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09520
Pdf URL: https://arxiv.org/pdf/2409.09520
Copy Paste: [[2409.09520]] Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAM Empowerment(https://arxiv.org/abs/2409.09520)
Keywords: interpretability, segmentation
Abstract: Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method's effectiveness on lesion diagnosis and interpretability.

Title: An Augmentation-based Model Re-adaptation Framework for Robust Image Segmentation

Authors: Zheming Zuo, Joseph Smith, Jonathan Stonehouse, Boguslaw Obara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09530
Pdf URL: https://arxiv.org/pdf/2409.09530
Copy Paste: [[2409.09530]] An Augmentation-based Model Re-adaptation Framework for Robust Image Segmentation(https://arxiv.org/abs/2409.09530)
Keywords: robust, segmentation
Abstract: Image segmentation is a crucial task in computer vision, with wide-ranging applications in industry. The Segment Anything Model (SAM) has recently attracted intensive attention; however, its application in industrial inspection, particularly for segmenting commercial anti-counterfeit codes, remains challenging. Unlike open-source datasets, industrial settings often face issues such as small sample sizes and complex textures. Additionally, computational cost is a key concern due to the varying number of trainable parameters. To address these challenges, we propose an Augmentation-based Model Re-adaptation Framework (AMRF). This framework leverages data augmentation techniques during training to enhance the generalisation of segmentation models, allowing them to adapt to newly released datasets with temporal disparity. By observing segmentation masks from conventional models (FCN and U-Net) and a pre-trained SAM model, we determine a minimal augmentation set that optimally balances training efficiency and model performance. Our results demonstrate that the fine-tuned FCN surpasses its baseline by 3.29% and 3.02% in cropping accuracy, and 5.27% and 4.04% in classification accuracy on two temporally continuous datasets. Similarly, the fine-tuned U-Net improves upon its baseline by 7.34% and 4.94% in cropping, and 8.02% and 5.52% in classification. Both models outperform the top-performing SAM models (ViT-Large and ViT-Base) by an average of 11.75% and 9.01% in cropping accuracy, and 2.93% and 4.83% in classification accuracy, respectively.

Title: Using Synthetic Data to Mitigate Unfairness and Preserve Privacy through Single-Shot Federated Learning

Authors: Chia-Yuan Wu, Frank E. Curtis, Daniel P. Robinson
Subjects: cs.LG, cs.CY, math.OC
Abstract URL: https://arxiv.org/abs/2409.09532
Pdf URL: https://arxiv.org/pdf/2409.09532
Copy Paste: [[2409.09532]] Using Synthetic Data to Mitigate Unfairness and Preserve Privacy through Single-Shot Federated Learning(https://arxiv.org/abs/2409.09532)
Keywords: privacy, federate, fair
Abstract: To address unfairness issues in federated learning (FL), contemporary approaches typically use frequent model parameter updates and transmissions between the clients and server. In such a process, client-specific information (e.g., local dataset size or data-related fairness metrics) must be sent to the server to compute, e.g., aggregation weights. All of this results in high transmission costs and the potential leakage of client information. As an alternative, we propose a strategy that promotes fair predictions across clients without the need to pass information between the clients and server iteratively and prevents client data leakage. For each client, we first use their local dataset to obtain a synthetic dataset by solving a bilevel optimization problem that addresses unfairness concerns during the learning process. We then pass each client's synthetic dataset to the server, the collection of which is used to train the server model using conventional machine learning techniques (that do not take fairness metrics into account). Thus, we eliminate the need to handle fairness-specific aggregation weights while preserving client privacy. Our approach requires only a single communication between the clients and the server, thus making it computationally cost-effective, able to maintain privacy, and able to ensuring fairness. We present empirical evidence to demonstrate the advantages of our approach. The results illustrate that our method effectively uses synthetic data as a means to mitigate unfairness and preserve client privacy.

Title: COMFORT: A Continual Fine-Tuning Framework for Foundation Models Targeted at Consumer Healthcare

Authors: Chia-Hao Li, Niraj K. Jha
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2409.09549
Pdf URL: https://arxiv.org/pdf/2409.09549
Copy Paste: [[2409.09549]] COMFORT: A Continual Fine-Tuning Framework for Foundation Models Targeted at Consumer Healthcare(https://arxiv.org/abs/2409.09549)
Keywords: privacy, transformer
Abstract: Wearable medical sensors (WMSs) are revolutionizing smart healthcare by enabling continuous, real-time monitoring of user physiological signals, especially in the field of consumer healthcare. The integration of WMSs and modern machine learning (ML) enables unprecedented solutions to efficient early-stage disease detection. Despite the success of Transformers in various fields, their application to sensitive domains, such as smart healthcare, remains underexplored due to limited data accessibility and privacy concerns. To bridge the gap between Transformer-based foundation models and WMS-based disease detection, we propose COMFORT, a continual fine-tuning framework for foundation models targeted at consumer healthcare. COMFORT introduces a novel approach for pre-training a Transformer-based foundation model on a large dataset of physiological signals exclusively collected from healthy individuals with commercially available WMSs. We adopt a masked data modeling (MDM) objective to pre-train this health foundation model. We then fine-tune the model using various parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, to adapt it to various downstream disease detection tasks that rely on WMS data. In addition, COMFORT continually stores the low-rank decomposition matrices obtained from the PEFT algorithms to construct a library for multi-disease detection. The COMFORT library enables scalable and memory-efficient disease detection on edge devices. Our experimental results demonstrate that COMFORT achieves highly competitive performance while reducing memory overhead by up to 52% relative to conventional methods. Thus, COMFORT paves the way for personalized and proactive solutions to efficient and effective early-stage disease detection for consumer healthcare.

Title: ASR Error Correction using Large Language Models

Authors: Rao Ma, Mengjie Qian, Mark Gales, Kate Knill
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.09554
Pdf URL: https://arxiv.org/pdf/2409.09554
Copy Paste: [[2409.09554]] ASR Error Correction using Large Language Models(https://arxiv.org/abs/2409.09554)
Keywords: large language model
Abstract: Error correction (EC) models play a crucial role in refining Automatic Speech Recognition (ASR) transcriptions, enhancing the readability and quality of transcriptions. Without requiring access to the underlying code or model weights, EC can improve performance and provide domain adaptation for black-box ASR systems. This work investigates the use of large language models (LLMs) for error correction across diverse scenarios. 1-best ASR hypotheses are commonly used as the input to EC models. We propose building high-performance EC models using ASR N-best lists which should provide more contextual information for the correction process. Additionally, the generation process of a standard EC model is unrestricted in the sense that any output sequence can be generated. For some scenarios, such as unseen domains, this flexibility may impact performance. To address this, we introduce a constrained decoding approach based on the N-best list or an ASR lattice. Finally, most EC models are trained for a specific ASR system requiring retraining whenever the underlying ASR system is changed. This paper explores the ability of EC models to operate on the output of different ASR systems. This concept is further extended to zero-shot error correction using LLMs, such as ChatGPT. Experiments on three standard datasets demonstrate the efficacy of our proposed methods for both Transducer and attention-based encoder-decoder ASR systems. In addition, the proposed method can serve as an effective method for model ensembling.

Title: A Statistical Viewpoint on Differential Privacy: Hypothesis Testing, Representation and Blackwell's Theorem

Authors: Weijie J. Su
Subjects: cs.CR, cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09558
Pdf URL: https://arxiv.org/pdf/2409.09558
Copy Paste: [[2409.09558]] A Statistical Viewpoint on Differential Privacy: Hypothesis Testing, Representation and Blackwell's Theorem(https://arxiv.org/abs/2409.09558)
Keywords: privacy, robust
Abstract: Differential privacy is widely considered the formal privacy for privacy-preserving data analysis due to its robust and rigorous guarantees, with increasingly broad adoption in public services, academia, and industry. Despite originating in the cryptographic context, in this review paper we argue that, fundamentally, differential privacy can be considered a \textit{pure} statistical concept. By leveraging a theorem due to David Blackwell, our focus is to demonstrate that the definition of differential privacy can be formally motivated from a hypothesis testing perspective, thereby showing that hypothesis testing is not merely convenient but also the right language for reasoning about differential privacy. This insight leads to the definition of $f$-differential privacy, which extends other differential privacy definitions through a representation theorem. We review techniques that render $f$-differential privacy a unified framework for analyzing privacy bounds in data analysis and machine learning. Applications of this differential privacy definition to private deep learning, private convex optimization, shuffled mechanisms, and U.S.~Census data are discussed to highlight the benefits of analyzing privacy bounds under this framework compared to existing alternatives.

Title: Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing?

Authors: Josef Jon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09568
Pdf URL: https://arxiv.org/pdf/2409.09568
Copy Paste: [[2409.09568]] Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing?(https://arxiv.org/abs/2409.09568)
Keywords: large language model
Abstract: This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools even more deeply into our daily lives. As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable even to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts. To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We then conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the language generated by these systems, compared to human translators. We search for potential causes of these limitations rooted in training objectives and decoding algorithms. Our ultimate goal is to develop alternatives that do not enforce uniformity in the distribution of statistical properties in the output and that allow for better global planning of the translation, taking into account the intrinsic ambiguity of the translation task.

Title: Bias Begets Bias: The Impact of Biased Embeddings on Diffusion Models

Authors: Sahil Kuchlous, Marvin Li, Jeffrey G. Wang
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2409.09569
Pdf URL: https://arxiv.org/pdf/2409.09569
Copy Paste: [[2409.09569]] Bias Begets Bias: The Impact of Biased Embeddings on Diffusion Models(https://arxiv.org/abs/2409.09569)
Keywords: protect, fair, diffusion, generative
Abstract: With the growing adoption of Text-to-Image (TTI) systems, the social biases of these models have come under increased scrutiny. Herein we conduct a systematic investigation of one such source of bias for diffusion models: embedding spaces. First, because traditional classifier-based fairness definitions require true labels not present in generative modeling, we propose statistical group fairness criteria based on a model's internal representation of the world. Using these definitions, we demonstrate theoretically and empirically that an unbiased text embedding space for input prompts is a necessary condition for representationally balanced diffusion models, meaning the distribution of generated images satisfy diversity requirements with respect to protected attributes. Next, we investigate the impact of biased embeddings on evaluating the alignment between generated images and prompts, a process which is commonly used to assess diffusion models. We find that biased multimodal embeddings like CLIP can result in lower alignment scores for representationally balanced TTI models, thus rewarding unfair behavior. Finally, we develop a theoretical framework through which biases in alignment evaluation can be studied and propose bias mitigation methods. By specifically adapting the perspective of embedding spaces, we establish new fairness conditions for diffusion model development and evaluation.

Title: NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

Authors: Yiyi Tao, Zhuoyue Wang, Hang Zhang, Lun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09582
Pdf URL: https://arxiv.org/pdf/2409.09582
Copy Paste: [[2409.09582]] NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training(https://arxiv.org/abs/2409.09582)
Keywords: robust, transformer, large language model
Abstract: The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.

Title: Open-World Test-Time Training: Self-Training with Contrast Learning

Authors: Houcheng Su, Mengzhu Wang, Jiao Li, Bingli Wang, Daixian Liu, Zeheng Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09591
Pdf URL: https://arxiv.org/pdf/2409.09591
Copy Paste: [[2409.09591]] Open-World Test-Time Training: Self-Training with Contrast Learning(https://arxiv.org/abs/2409.09591)
Keywords: robust, extraction
Abstract: Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the focus has predominantly been on distinguishing between overall strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, resulting in diminished contrast and premature classification of certain classes as strong OOD. To address this, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that utilizes contrastive learning to augment positive sample pairs. This strategy not only bolsters contrast in the early stages but also significantly enhances model robustness in subsequent stages. In comparison datasets, our OWDCL model has produced the most advanced performance.

Title: Security Testbed for Preempting Attacks against Supercomputing Infrastructure

Authors: Phuong Cao, Zbigniew Kalbarczyk, Ravishankar Iyer
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2409.09602
Pdf URL: https://arxiv.org/pdf/2409.09602
Copy Paste: [[2409.09602]] Security Testbed for Preempting Attacks against Supercomputing Infrastructure(https://arxiv.org/abs/2409.09602)
Keywords: security, attack
Abstract: Preempting attacks targeting supercomputing systems before damage remains the top security priority. The main challenge is that noisy attack attempts and unreliable alerts often mask real attacks, causing permanent damages such as system integrity violations and data breaches. This paper describes a security testbed embedded in live traffic of a supercomputer at the National Center for Supercomputing Applications (NCSA). The objective is to demonstrate attack preemption, i.e., stopping system compromise and data breaches at petascale supercomputers. Deployment of our testbed at NCSA enables the following key contributions: 1) Insights from characterizing unique attack patterns found in real security logs of over 200 security incidents curated in the past two decades at NCSA. 2) Deployment of an attack visualization tool to illustrate the challenges of identifying real attacks in HPC environments and to support security operators in interactive attack analyses. 3) Demonstrate the testbed's utility by running novel models, such as Factor Graph-Based models, to preempt a real-world ransomware family.

Title: DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Authors: Liao Shen, Tianqi Liu, Huiqiang Sun, Xinyi Ye, Baopu Li, Jianming Zhang, Zhiguo Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09605
Pdf URL: https://arxiv.org/pdf/2409.09605
Copy Paste: [[2409.09605]] DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion(https://arxiv.org/abs/2409.09605)
Keywords: diffusion
Abstract: We study the problem of generating intermediate images from image pairs with large motion while maintaining semantic consistency. Due to the large motion, the intermediate semantic information may be absent in input images. Existing methods either limit to small motion or focus on topologically similar objects, leading to artifacts and inconsistency in the interpolation results. To overcome this challenge, we delve into pre-trained image diffusion models for their capabilities in semantic cognition and representations, ensuring consistent expression of the absent intermediate semantic representations with the input. To this end, we propose DreamMover, a novel image interpolation framework with three main components: 1) A natural flow estimator based on the diffusion model that can implicitly reason about the semantic correspondence between two images. 2) To avoid the loss of detailed information during fusion, our key insight is to fuse information in two parts, high-level space and low-level space. 3) To enhance the consistency between the generated images and input, we propose the self-attention concatenation and replacement approach. Lastly, we present a challenging benchmark dataset InterpBench to evaluate the semantic consistency of generated results. Extensive experiments demonstrate the effectiveness of our method. Our project is available at this https URL .

Title: BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS

Authors: Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2409.09606
Pdf URL: https://arxiv.org/pdf/2409.09606
Copy Paste: [[2409.09606]] BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS(https://arxiv.org/abs/2409.09606)
Keywords: secure, security, attack
Abstract: The endless stream of vulnerabilities urgently calls for principled mitigation to confine the effect of exploitation. However, the monolithic architecture of commodity OS kernels, like the Linux kernel, allows an attacker to compromise the entire system by exploiting a vulnerability in any kernel component. Kernel compartmentalization is a promising approach that follows the least-privilege principle. However, existing mechanisms struggle with the trade-off on security, scalability, and performance, given the challenges stemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernel compartmentalization technique that offers bi-directional isolation for unlimited compartments. It leverages Intel's new hardware feature PKS to isolate data and code into mutually untrusted compartments and benefits from its fast compartment switching. With untrust in mind, BULKHEAD introduces a lightweight in-kernel monitor that enforces multiple important security invariants, including data integrity, execute-only memory, and compartment interface integrity. In addition, it provides a locality-aware two-level scheme that scales to unlimited compartments. We implement a prototype system on Linux v6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluation confirms the effectiveness of our approach. As the system-wide impacts, BULKHEAD incurs an average performance overhead of 2.44% for real-world applications with 160 compartmentalized LKMs. While focusing on a specific compartment, ApacheBench tests on ipv6 show an overhead of less than 2%. Moreover, the performance is almost unaffected by the number of compartments, which makes it highly scalable.

Title: TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

Authors: Zihan Su, Junhao Zhuang, Chun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09610
Pdf URL: https://arxiv.org/pdf/2409.09610
Copy Paste: [[2409.09610]] TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer(https://arxiv.org/abs/2409.09610)
Keywords: diffusion
Abstract: Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and , restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to "", making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation.

Title: Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Authors: Yungi Kim, Hyunsoo Ha, Sukyung Lee, Jihoo Kim, Seonghoon Yang, Chanjun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09613
Pdf URL: https://arxiv.org/pdf/2409.09613
Copy Paste: [[2409.09613]] Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora(https://arxiv.org/abs/2409.09613)
Keywords: large language model
Abstract: With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.

Title: HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton-Jacobi PDEs and score-based generative models

Authors: Tingwei Meng, Zongren Zou, Jérôme Darbon, George Em Karniadakis
Subjects: cs.LG, math.OC, stat.CO
Abstract URL: https://arxiv.org/abs/2409.09614
Pdf URL: https://arxiv.org/pdf/2409.09614
Copy Paste: [[2409.09614]] HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton-Jacobi PDEs and score-based generative models(https://arxiv.org/abs/2409.09614)
Keywords: diffusion, generative
Abstract: The interplay between stochastic processes and optimal control has been extensively explored in the literature. With the recent surge in the use of diffusion models, stochastic processes have increasingly been applied to sample generation. This paper builds on the log transform, known as the Cole-Hopf transform in Brownian motion contexts, and extends it within a more abstract framework that includes a linear operator. Within this framework, we found that the well-known relationship between the Cole-Hopf transform and optimal transport is a particular instance where the linear operator acts as the infinitesimal generator of a stochastic process. We also introduce a novel scenario where the linear operator is the adjoint of the generator, linking to Bayesian inference under specific initial and terminal conditions. Leveraging this theoretical foundation, we develop a new algorithm, named the HJ-sampler, for Bayesian inference for the inverse problem of a stochastic differential equation with given terminal observations. The HJ-sampler involves two stages: (1) solving the viscous Hamilton-Jacobi partial differential equations, and (2) sampling from the associated stochastic optimal control problem. Our proposed algorithm naturally allows for flexibility in selecting the numerical solver for viscous HJ PDEs. We introduce two variants of the solver: the Riccati-HJ-sampler, based on the Riccati method, and the SGM-HJ-sampler, which utilizes diffusion models. We demonstrate the effectiveness and flexibility of the proposed methods by applying them to solve Bayesian inverse problems involving various stochastic processes and prior distributions, including applications that address model misspecifications and quantifying model uncertainty.

Title: Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting

Authors: Jianfei Wu, Xubin Wang, Weijia Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09615
Pdf URL: https://arxiv.org/pdf/2409.09615
Copy Paste: [[2409.09615]] Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting(https://arxiv.org/abs/2409.09615)
Keywords: robust, large language model
Abstract: The traditional data annotation process is often labor-intensive, time-consuming, and susceptible to human bias, which complicates the management of increasingly complex datasets. This study explores the potential of large language models (LLMs) as automated data annotators to improve efficiency and consistency in annotation tasks. By employing rationale-driven collaborative few-shot prompting techniques, we aim to improve the performance of LLMs in text annotation. We conduct a rigorous evaluation of six LLMs across four benchmark datasets, comparing seven distinct methodologies. Our results demonstrate that collaborative methods consistently outperform traditional few-shot techniques and other baseline approaches, particularly in complex annotation tasks. Our work provides valuable insights and a robust framework for leveraging collaborative learning methods to tackle challenging text annotation tasks.

Title: Can Large Language Models Grasp Event Signals? Exploring Pure Zero-Shot Event-based Recognition

Authors: Zongyou Yu, Qiang Qu, Xiaoming Chen, Chen Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09628
Pdf URL: https://arxiv.org/pdf/2409.09628
Copy Paste: [[2409.09628]] Can Large Language Models Grasp Event Signals? Exploring Pure Zero-Shot Event-based Recognition(https://arxiv.org/abs/2409.09628)
Keywords: large language model
Abstract: Recent advancements in event-based zero-shot object recognition have demonstrated promising results. However, these methods heavily depend on extensive training and are inherently constrained by the characteristics of CLIP. To the best of our knowledge, this research is the first study to explore the understanding capabilities of large language models (LLMs) for event-based visual content. We demonstrate that LLMs can achieve event-based object recognition without additional training or fine-tuning in conjunction with CLIP, effectively enabling pure zero-shot event-based recognition. Particularly, we evaluate the ability of GPT-4o / 4turbo and two other open-source LLMs to directly recognize event-based visual content. Extensive experiments are conducted across three benchmark datasets, systematically assessing the recognition accuracy of these models. The results show that LLMs, especially when enhanced with well-designed prompts, significantly improve event-based zero-shot recognition performance. Notably, GPT-4o outperforms the compared models and exceeds the recognition accuracy of state-of-the-art event-based zero-shot methods on N-ImageNet by five orders of magnitude. The implementation of this paper is available at \url{this https URL}.

Title: Confidence Estimation for LLM-Based Dialogue State Tracking

Authors: Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tur, Gokhan Tur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09629
Pdf URL: https://arxiv.org/pdf/2409.09629
Copy Paste: [[2409.09629]] Confidence Estimation for LLM-Based Dialogue State Tracking(https://arxiv.org/abs/2409.09629)
Keywords: large language model
Abstract: Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.

Title: A Simple HMM with Self-Supervised Representations for Phone Segmentation

Authors: Gene-Ping Yang, Hao Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09646
Pdf URL: https://arxiv.org/pdf/2409.09646
Copy Paste: [[2409.09646]] A Simple HMM with Self-Supervised Representations for Phone Segmentation(https://arxiv.org/abs/2409.09646)
Keywords: segmentation
Abstract: Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.

Title: SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Authors: Meng Lou, Yunxiang Fu, Yizhou Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09649
Pdf URL: https://arxiv.org/pdf/2409.09649
Copy Paste: [[2409.09649]] SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks(https://arxiv.org/abs/2409.09649)
Keywords: transformer
Abstract: Due to the capability of dynamic state space models (SSMs) in capturing long-range dependencies with near-linear computational complexity, Mamba has shown notable performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, resulting in promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. Moreover, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for Mamba-based vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers: ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new vision backbone network with sparsely cross-connected layers, achieving an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to its counterparts. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5% to 83.5%, while SparX-Swin-T achieves a 1.3% increase in top-1 accuracy compared to Swin-T. Extensive experimental results demonstrate that our new connection mechanism possesses both superior performance and generalization capabilities on various vision tasks.

Title: Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example

Authors: Yuanning Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09652
Pdf URL: https://arxiv.org/pdf/2409.09652
Copy Paste: [[2409.09652]] Unveiling Gender Bias in Large Language Models: Using Teacher's Evaluation in Higher Education As an Example(https://arxiv.org/abs/2409.09652)
Keywords: large language model
Abstract: This paper investigates gender bias in Large Language Model (LLM)-generated teacher evaluations in higher education setting, focusing on evaluations produced by GPT-4 across six academic subjects. By applying a comprehensive analytical framework that includes Odds Ratio (OR) analysis, Word Embedding Association Test (WEAT), sentiment analysis, and contextual analysis, this paper identified patterns of gender-associated language reflecting societal stereotypes. Specifically, words related to approachability and support were used more frequently for female instructors, while words related to entertainment were predominantly used for male instructors, aligning with the concepts of communal and agentic behaviors. The study also found moderate to strong associations between male salient adjectives and male names, though career and family words did not distinctly capture gender biases. These findings align with prior research on societal norms and stereotypes, reinforcing the notion that LLM-generated text reflects existing biases.

Title: Leveraging Open-Source Large Language Models for Native Language Identification

Authors: Yee Man Ng, Ilia Markov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09659
Pdf URL: https://arxiv.org/pdf/2409.09659
Copy Paste: [[2409.09659]] Leveraging Open-Source Large Language Models for Native Language Identification(https://arxiv.org/abs/2409.09659)
Keywords: transformer, generative, large language model
Abstract: Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.

Title: EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Authors: Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09668
Pdf URL: https://arxiv.org/pdf/2409.09668
Copy Paste: [[2409.09668]] EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models(https://arxiv.org/abs/2409.09668)
Keywords: robust, diffusion, generative
Abstract: The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models' performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models' effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model's strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.

Title: SITSMamba for Crop Classification based on Satellite Image Time Series

Authors: Xiaolei Qin, Xin Su, Liangpei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09673
Pdf URL: https://arxiv.org/pdf/2409.09673
Copy Paste: [[2409.09673]] SITSMamba for Crop Classification based on Satellite Image Time Series(https://arxiv.org/abs/2409.09673)
Keywords: transformer
Abstract: Satellite image time series (SITS) data provides continuous observations over time, allowing for the tracking of vegetation changes and growth patterns throughout the seasons and years. Numerous deep learning (DL) approaches using SITS for crop classification have emerged recently, with the latest approaches adopting Transformer for SITS classification. However, the quadratic complexity of self-attention in Transformer poses challenges for classifying long time series. While the cutting-edge Mamba architecture has demonstrated strength in various domains, including remote sensing image interpretation, its capacity to learn temporal representations in SITS data remains unexplored. Moreover, the existing SITS classification methods often depend solely on crop labels as supervision signals, which fails to fully exploit the temporal information. In this paper, we proposed a Satellite Image Time Series Mamba (SITSMamba) method for crop classification based on remote sensing time series data. The proposed SITSMamba contains a spatial encoder based on Convolutional Neural Networks (CNN) and a Mamba-based temporal encoder. To exploit richer temporal information from SITS, we design two branches of decoder used for different tasks. The first branch is a crop Classification Branch (CBranch), which includes a ConvBlock to decode the feature to a crop map. The second branch is a SITS Reconstruction Branch that uses a Linear layer to transform the encoded feature to predict the original input values. Furthermore, we design a Positional Weight (PW) applied to the RBranch to help the model learn rich latent knowledge from SITS. We also design two weighting factors to control the balance of the two branches during training. The code of SITSMamba is available at: this https URL.

Title: Nebula: Efficient, Private and Accurate Histogram Estimation

Authors: Ali Shahin Shamsabadi, Peter Snyder, Ralph Giles, Aurélien Bellet, Hamed Haddadi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.09676
Pdf URL: https://arxiv.org/pdf/2409.09676
Copy Paste: [[2409.09676]] Nebula: Efficient, Private and Accurate Histogram Estimation(https://arxiv.org/abs/2409.09676)
Keywords: privacy
Abstract: We present Nebula, a system for differential private histogram estimation of data distributed among clients. Nebula enables clients to locally subsample and encode their data such that an untrusted server learns only data values that meet an aggregation threshold to satisfy differential privacy guarantees. Compared with other private histogram estimation systems, Nebula uniquely achieves all of the following: \textit{i)} a strict upper bound on privacy leakage; \textit{ii)} client privacy under realistic trust assumptions; \textit{iii)} significantly better utility compared to standard local differential privacy systems; and \textit{iv)} avoiding trusted third-parties, multi-party computation, or trusted hardware. We provide both a formal evaluation of Nebula's privacy, utility and efficiency guarantees, along with an empirical evaluation on three real-world datasets. We demonstrate that clients can encode and upload their data efficiently (only 0.0058 seconds running time and 0.0027 MB data communication) and privately (strong differential privacy guarantees $\varepsilon=1$). On the United States Census dataset, the Nebula's untrusted aggregation server estimates histograms with above 88\% better utility than the existing local deployment of differential privacy. Additionally, we describe a variant that allows clients to submit multi-dimensional data, with similar privacy, utility, and performance. Finally, we provide an open source implementation of Nebula.

Title: E-Commerce Inpainting with Mask Guidance in Controlnet for Reducing Overcompletion

Authors: Guandong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09681
Pdf URL: https://arxiv.org/pdf/2409.09681
Copy Paste: [[2409.09681]] E-Commerce Inpainting with Mask Guidance in Controlnet for Reducing Overcompletion(https://arxiv.org/abs/2409.09681)
Keywords: diffusion
Abstract: E-commerce image generation has always been one of the core demands in the e-commerce field. The goal is to restore the missing background that matches the main product given. In the post-AIGC era, diffusion models are primarily used to generate product images, achieving impressive results. This paper systematically analyzes and addresses a core pain point in diffusion model generation: overcompletion, which refers to the difficulty in maintaining product features. We propose two solutions: 1. Using an instance mask fine-tuned inpainting model to mitigate this phenomenon; 2. Adopting a train-free mask guidance approach, which incorporates refined product masks as constraints when combining ControlNet and UNet to generate the main product, thereby avoiding overcompletion of the product. Our method has achieved promising results in practical applications and we hope it can serve as an inspiring technical report in this field.

Title: Training Safe Neural Networks with Global SDP Bounds

Authors: Roman Soletskyi, David "davidad" Dalrymple
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2409.09687
Pdf URL: https://arxiv.org/pdf/2409.09687
Copy Paste: [[2409.09687]] Training Safe Neural Networks with Global SDP Bounds(https://arxiv.org/abs/2409.09687)
Keywords: robust
Abstract: This paper presents a novel approach to training neural networks with formal safety guarantees using semidefinite programming (SDP) for verification. Our method focuses on verifying safety over large, high-dimensional input regions, addressing limitations of existing techniques that focus on adversarial robustness bounds. We introduce an ADMM-based training scheme for an accurate neural network classifier on the Adversarial Spheres dataset, achieving provably perfect recall with input dimensions up to $d=40$. This work advances the development of reliable neural network verification methods for high-dimensional systems, with potential applications in safe RL policies.

Title: Predicting building types and functions at transnational scale

Authors: Jonas Fill, Michael Eichelbeck, Michael Ebner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09692
Pdf URL: https://arxiv.org/pdf/2409.09692
Copy Paste: [[2409.09692]] Predicting building types and functions at transnational scale(https://arxiv.org/abs/2409.09692)
Keywords: transformer
Abstract: Building-specific knowledge such as building type and function information is important for numerous energy applications. However, comprehensive datasets containing this information for individual households are missing in many regions of Europe. For the first time, we investigate whether it is feasible to predict building types and functional classes at a European scale based on only open GIS datasets available across countries. We train a graph neural network (GNN) classifier on a large-scale graph dataset consisting of OpenStreetMap (OSM) buildings across the EU, Norway, Switzerland, and the UK. To efficiently perform training using the large-scale graph, we utilize localized subgraphs. A graph transformer model achieves a high Cohen's kappa coefficient of 0.754 when classifying buildings into 9 classes, and a very high Cohen's kappa coefficient of 0.844 when classifying buildings into the residential and non-residential classes. The experimental results imply three core novel contributions to literature. Firstly, we show that building classification across multiple countries is possible using a multi-source dataset consisting of information about 2D building shape, land use, degree of urbanization, and countries as input, and OSM tags as ground truth. Secondly, our results indicate that GNN models that consider contextual information about building neighborhoods improve predictive performance compared to models that only consider individual buildings and ignore the neighborhood. Thirdly, we show that training with GNNs on localized subgraphs instead of standard GNNs improves performance for the task of building classification.

Title: GFlowNet Pretraining with Inexpensive Rewards

Authors: Mohit Pandey, Gopeshh Subbaraj, Emmanuel Bengio
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2409.09702
Pdf URL: https://arxiv.org/pdf/2409.09702
Copy Paste: [[2409.09702]] GFlowNet Pretraining with Inexpensive Rewards(https://arxiv.org/abs/2409.09702)
Keywords: robust, generative
Abstract: Generative Flow Networks (GFlowNets), a class of generative models have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from unnormalized reward distributions. Previous works in this direction often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using offline drug-like molecule datasets, which conditions A-GFNs on inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further our method by implementing a goal-conditioned fine-tuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on the ZINC15 offline dataset and employ robust evaluation metrics to show the effectiveness of our approach when compared to other relevant baseline methods in drug design.

Title: AlpaPICO: Extraction of PICO Frames from Clinical Trial Documents Using LLMs

Authors: Madhusudan Ghosh, Shrimon Mukherjee, Asmit Ganguly, Partha Basuchowdhuri, Sudip Kumar Naskar, Debasis Ganguly
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09704
Pdf URL: https://arxiv.org/pdf/2409.09704
Copy Paste: [[2409.09704]] AlpaPICO: Extraction of PICO Frames from Clinical Trial Documents Using LLMs(https://arxiv.org/abs/2409.09704)
Keywords: extraction, large language model
Abstract: In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at \url{this https URL}.

Title: ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration

Authors: Ning-Chi Huang, Chi-Chih Chang, Wei-Cheng Lin, Endri Taka, Diana Marculescu, Kai-Chiang Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09708
Pdf URL: https://arxiv.org/pdf/2409.09708
Copy Paste: [[2409.09708]] ELSA: Exploiting Layer-wise N:M Sparsity for Vision Transformer Acceleration(https://arxiv.org/abs/2409.09708)
Keywords: transformer
Abstract: $N{:}M$ sparsity is an emerging model compression method supported by more and more accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing $N{:}M$ sparsity methods compress neural networks with a uniform setting for all layers in a network or heuristically determine the layer-wise configuration by considering the number of parameters in each layer. However, very few methods have been designed for obtaining a layer-wise customized $N{:}M$ sparse configuration for vision transformers (ViTs), which usually consist of transformer blocks involving the same number of parameters. In this work, to address the challenge of selecting suitable sparse configuration for ViTs on $N{:}M$ sparsity-supporting accelerators, we propose ELSA, Exploiting Layer-wise $N{:}M$ Sparsity for ViTs. Considering not only all $N{:}M$ sparsity levels supported by a given accelerator but also the expected throughput improvement, our methodology can reap the benefits of accelerators supporting mixed sparsity by trading off negligible accuracy loss with both memory usage and inference time reduction for ViT models. For instance, our approach achieves a noteworthy 2.9$\times$ reduction in FLOPs for both Swin-B and DeiT-B with only a marginal degradation of accuracy on ImageNet. Our code will be released upon paper acceptance.

Title: Finetuning CLIP to Reason about Pairwise Differences

Authors: Dylan Sam, Devin Willmott, Joao D. Semedo, J. Zico Kolter
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2409.09721
Pdf URL: https://arxiv.org/pdf/2409.09721
Copy Paste: [[2409.09721]] Finetuning CLIP to Reason about Pairwise Differences(https://arxiv.org/abs/2409.09721)
Keywords: large language model
Abstract: Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.

Title: MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

Authors: Yaning Zhang, Tianyi Wang, Zitong Yu, Zan Gao, Linlin Shen, Shengyong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09724
Pdf URL: https://arxiv.org/pdf/2409.09724
Copy Paste: [[2409.09724]] MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection(https://arxiv.org/abs/2409.09724)
Keywords: robust, diffusion
Abstract: The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

Title: From Challenges and Pitfalls to Recommendations and Opportunities: Implementing Federated Learning in Healthcare

Authors: Ming Li, Pengcheng Xu, Junjie Hu, Zeyu Tang, Guang Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09727
Pdf URL: https://arxiv.org/pdf/2409.09727
Copy Paste: [[2409.09727]] From Challenges and Pitfalls to Recommendations and Opportunities: Implementing Federated Learning in Healthcare(https://arxiv.org/abs/2409.09727)
Keywords: security, privacy, federate
Abstract: Federated learning holds great potential for enabling large-scale healthcare research and collaboration across multiple centres while ensuring data privacy and security are not compromised. Although numerous recent studies suggest or utilize federated learning based methods in healthcare, it remains unclear which ones have potential clinical utility. This review paper considers and analyzes the most recent studies up to May 2024 that describe federated learning based methods in healthcare. After a thorough review, we find that the vast majority are not appropriate for clinical use due to their methodological flaws and/or underlying biases which include but are not limited to privacy concerns, generalization issues, and communication costs. As a result, the effectiveness of federated learning in healthcare is significantly compromised. To overcome these challenges, we provide recommendations and promising opportunities that might be implemented to resolve these problems and improve the quality of model development in federated learning with healthcare.

Title: PersonaMark: Personalized LLM watermarking for model protection and user attribution

Authors: Yuehan Zhang, Peizhuo Lv, Yinpeng Liu, Yongqiang Ma, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, Jiawei Liu
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2409.09739
Pdf URL: https://arxiv.org/pdf/2409.09739
Copy Paste: [[2409.09739]] PersonaMark: Personalized LLM watermarking for model protection and user attribution(https://arxiv.org/abs/2409.09739)
Keywords: protect, watermark
Abstract: The rapid development of LLMs brings both convenience and potential threats. As costumed and private LLMs are widely applied, model copyright protection has become important. Text watermarking is emerging as a promising solution to AI-generated text detection and model protection issues. However, current text watermarks have largely ignored the critical need for injecting different watermarks for different users, which could help attribute the watermark to a specific individual. In this paper, we explore the personalized text watermarking scheme for LLM copyright protection and other scenarios, ensuring accountability and traceability in content generation. Specifically, we propose a novel text watermarking method PersonaMark that utilizes sentence structure as the hidden medium for the watermark information and optimizes the sentence-level generation algorithm to minimize disruption to the model's natural generation process. By employing a personalized hashing function to inject unique watermark signals for different users, personalized watermarked text can be obtained. Since our approach performs on sentence level instead of token probability, the text quality is highly preserved. The injection process of unique watermark signals for different users is time-efficient for a large number of users with the designed multi-user hashing function. As far as we know, we achieved personalized text watermarking for the first time through this. We conduct an extensive evaluation of four different LLMs in terms of perplexity, sentiment polarity, alignment, readability, etc. The results demonstrate that our method maintains performance with minimal perturbation to the model's behavior, allows for unbiased insertion of watermark information, and exhibits strong watermark recognition capabilities.

Title: OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data

Authors: Sebastian Wette, Florian Heinrichs
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09742
Pdf URL: https://arxiv.org/pdf/2409.09742
Copy Paste: [[2409.09742]] OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data(https://arxiv.org/abs/2409.09742)
Keywords: segmentation
Abstract: Time series are ubiquitous and occur naturally in a variety of applications -- from data recorded by sensors in manufacturing processes, over financial data streams to climate data. Different tasks arise, such as regression, classification or segmentation of the time series. However, to reliably solve these challenges, it is important to filter out abnormal observations that deviate from the usual behavior of the time series. While many anomaly detection methods exist for independent data and stationary time series, these methods are not applicable to non-stationary time series. To allow for non-stationarity in the data, while simultaneously detecting anomalies, we propose OML-AD, a novel approach for anomaly detection (AD) based on online machine learning (OML). We provide an implementation of OML-AD within the Python library River and show that it outperforms state-of-the-art baseline methods in terms of accuracy and computational efficiency.

Title: Taming the Ransomware Threats: Leveraging Prospect Theory for Rational Payment Decisions

Authors: Pranjal Sharma
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.09744
Pdf URL: https://arxiv.org/pdf/2409.09744
Copy Paste: [[2409.09744]] Taming the Ransomware Threats: Leveraging Prospect Theory for Rational Payment Decisions(https://arxiv.org/abs/2409.09744)
Keywords: attack
Abstract: Day by day, the frequency of ransomware attacks on organizations is experiencing a significant surge. High-profile incidents involving major entities like Las Vegas giants MGM Resorts, Caesar Entertainment, and Boeing underscore the profound impact, posing substantial business barriers. When a sudden cyberattack occurs, organizations often find themselves at a loss, with a looming countdown to pay the ransom, leading to a cascade of impromptu and unfavourable decisions. This paper adopts a novel approach, leveraging Prospect Theory, to elucidate the tactics employed by cyber attackers to entice organizations into paying the ransom. Furthermore, it introduces an algorithm based on Prospect Theory and an Attack Recovery Plan, enabling organizations to make informed decisions on whether to consent to the ransom demands or resist. This algorithm Ransomware Risk Analysis and Decision Support (RADS) uses Prospect Theory to re-instantiate the shifted reference manipulated as perceived gains by attackers and adjusts for the framing effect created due to time urgency. Additionally, leveraging application criticality and incorporating Prospect Theory's insights into under/over weighing probabilities, RADS facilitates informed decision-making that transcends the simplistic framework of "consent" or "resistance," enabling organizations to achieve optimal decisions.

Title: Explore the Hallucination on Low-level Perception for MLLMs

Authors: Yinan Sun, Zicheng Zhang, Haoning Wu, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09748
Pdf URL: https://arxiv.org/pdf/2409.09748
Copy Paste: [[2409.09748]] Explore the Hallucination on Low-level Perception for MLLMs(https://arxiv.org/abs/2409.09748)
Keywords: robust, large language model
Abstract: The rapid development of Multi-modality Large Language Models (MLLMs) has significantly influenced various aspects of industry and daily life, showcasing impressive capabilities in visual perception and understanding. However, these models also exhibit hallucinations, which limit their reliability as AI systems, especially in tasks involving low-level visual perception and understanding. We believe that hallucinations stem from a lack of explicit self-awareness in these models, which directly impacts their overall performance. In this paper, we aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. To this end, we present QL-Bench, a benchmark settings to simulate human responses to low-level vision, investigating self-awareness in low-level visual perception through visual question answering related to low-level attributes such as clarity and lighting. Specifically, we construct the LLSAVisionQA dataset, comprising 2,990 single images and 1,999 image pairs, each accompanied by an open-ended question about its low-level features. Through the evaluation of 15 MLLMs, we demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped. Notably, for the same model, simpler questions are often answered more accurately than complex ones. However, self-awareness appears to improve when addressing more challenging questions. We hope that our benchmark will motivate further research, particularly focused on enhancing the self-awareness of MLLMs in tasks involving low-level visual perception and understanding.

Title: Automated Lesion Segmentation in Whole-Body PET/CT in a multitracer setting

Authors: Qiaoyi Xue, Youdan Feng, Jiayi Liu, Tianming Xu, Kaixin Shen, Chuyun Shen, Yuhang Shi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09766
Pdf URL: https://arxiv.org/pdf/2409.09766
Copy Paste: [[2409.09766]] Automated Lesion Segmentation in Whole-Body PET/CT in a multitracer setting(https://arxiv.org/abs/2409.09766)
Keywords: segmentation
Abstract: This study explores a workflow for automated segmentation of lesions in FDG and PSMA PET/CT images. Due to the substantial differences in image characteristics between FDG and PSMA, specialized preprocessing steps are required. Utilizing YOLOv8 for data classification, the FDG and PSMA images are preprocessed separately before feeding them into the segmentation models, aiming to improve lesion segmentation accuracy. The study focuses on evaluating the performance of automated segmentation workflow for multitracer PET images. The findings are expected to provide critical insights for enhancing diagnostic workflows and patient-specific treatment plans. Our code will be open-sourced and available at this https URL.

Title: Towards Multi-view Graph Anomaly Detection with Similarity-Guided Contrastive Clustering

Authors: Lecheng Zheng, John R. Birge, Yifang Zhang, Jingrui He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09770
Pdf URL: https://arxiv.org/pdf/2409.09770
Copy Paste: [[2409.09770]] Towards Multi-view Graph Anomaly Detection with Similarity-Guided Contrastive Clustering(https://arxiv.org/abs/2409.09770)
Keywords: robust
Abstract: Anomaly detection on graphs plays an important role in many real-world applications. Usually, these data are composed of multiple types (e.g., user information and transaction records for financial data), thus exhibiting view heterogeneity. Therefore, it can be challenging to leverage such multi-view information and learn the graph's contextual information to identify rare anomalies. To tackle this problem, many deep learning-based methods utilize contrastive learning loss as a regularization term to learn good representations. However, many existing contrastive-based methods show that traditional contrastive learning losses fail to consider the semantic information (e.g., class membership information). In addition, we theoretically show that clustering-based contrastive learning also easily leads to a sub-optimal solution. To address these issues, in this paper, we proposed an autoencoder-based clustering framework regularized by a similarity-guided contrastive loss to detect anomalous nodes. Specifically, we build a similarity map to help the model learn robust representations without imposing a hard margin constraint between the positive and negative pairs. Theoretically, we show that the proposed similarity-guided loss is a variant of contrastive learning loss, and how it alleviates the issue of unreliable pseudo-labels with the connection to graph spectral clustering. Experimental results on several datasets demonstrate the effectiveness and efficiency of our proposed framework.

Title: Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

Authors: Haoyuan Sun, Bo Xia, Yongzhe Chang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09774
Pdf URL: https://arxiv.org/pdf/2409.09774
Copy Paste: [[2409.09774]] Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization(https://arxiv.org/abs/2409.09774)
Keywords: large language model
Abstract: Direct Preference Optimization (DPO) has recently expanded its successful application from aligning large language models (LLMs) to aligning text-to-image models with human preferences, which has generated considerable interest within the community. However, we have observed that these approaches rely solely on minimizing the reverse Kullback-Leibler divergence during alignment process between the fine-tuned model and the reference model, neglecting the incorporation of other divergence constraints. In this study, we focus on extending reverse Kullback-Leibler divergence in the alignment paradigm of text-to-image models to $f$-divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of the alignment paradigm under the $f$-divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on image-text alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

Title: DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Autonomous Driving

Authors: Haisheng Su, Wei Wu, Junchi Yan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2409.09777
Pdf URL: https://arxiv.org/pdf/2409.09777
Copy Paste: [[2409.09777]] DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Autonomous Driving(https://arxiv.org/abs/2409.09777)
Keywords: diffusion
Abstract: Current end-to-end autonomous driving methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized in a planning-oriented spirit with a fully differentiable framework, existing end-to-end driving systems without ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, owing to the rasterized scene representation learning and redundant information transmission. In this paper, we revisit the human driving behavior and propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection, tracking and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. Besides, both position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thus facilitating the training stability and convergence of the whole framework. Extensive experiments conducted on nuScenes dataset demonstrate the superior planning performance and great efficiency of DiFSD, which significantly reduces the average L2 error by \textbf{66\%} and collision rate by \textbf{77\%} than UniAD while achieves \textbf{8.2$\times$} faster running efficiency.

Title: Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions

Authors: Siqiao Mu, Diego Klabjan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09778
Pdf URL: https://arxiv.org/pdf/2409.09778
Copy Paste: [[2409.09778]] Rewind-to-Delete: Certified Machine Unlearning for Nonconvex Functions(https://arxiv.org/abs/2409.09778)
Keywords: privacy
Abstract: Machine unlearning algorithms aim to efficiently remove data from a model without retraining it from scratch, in order to enforce data privacy, remove corrupted or outdated data, or respect a user's ``right to be forgotten." Certified machine unlearning is a strong theoretical guarantee that quantifies the extent to which data is erased from the model weights. Most prior works in certified unlearning focus on models trained on convex or strongly convex loss functions, which benefit from convenient convergence guarantees and the existence of global minima. For nonconvex objectives, existing algorithms rely on limiting assumptions and expensive computations that hinder practical implementations. In this work, we propose a simple first-order algorithm for unlearning on general nonconvex loss functions which unlearns by ``rewinding" to an earlier step during the learning process and then performs gradient descent on the loss function of the retained data points. Our algorithm is black-box, in that it can be directly applied to models pretrained with vanilla gradient descent with no prior consideration of unlearning. We prove $(\epsilon, \delta)$ certified unlearning and performance guarantees that establish the privacy-utility-complexity tradeoff of our algorithm, with special consideration for nonconvex functions that satisfy the Polyak-Lojasiewicz inequality.

Title: Underwater Image Enhancement via Dehazing and Color Restoration

Authors: Chengqin Wu, Shuai Yu, Qingson Hu, Jingxiang Xu, Lijun Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2409.09779
Pdf URL: https://arxiv.org/pdf/2409.09779
Copy Paste: [[2409.09779]] Underwater Image Enhancement via Dehazing and Color Restoration(https://arxiv.org/abs/2409.09779)
Keywords: extraction, transformer
Abstract: With the rapid development of marine engineering projects such as marine resource extraction and oceanic surveys, underwater visual imaging and analysis has become a critical technology. Unfortunately, due to the inevitable non-linear attenuation of light in underwater environments, underwater images and videos often suffer from low contrast, blurriness, and color degradation, which significantly complicate the subsequent research. Existing underwater image enhancement methods often treat the haze and color cast as a unified degradation process and disregard their independence and interdependence, which limits the performance improvement. Here, we propose a Vision Transformer (ViT)-based network (referred to as WaterFormer) to improve the underwater image quality. WaterFormer contains three major components: a dehazing block (DehazeFormer Block) to capture the self-correlated haze features and extract deep-level features, a Color Restoration Block (CRB) to capture self-correlated color cast features, and a Channel Fusion Block (CFB) to capture fusion features within the network. To ensure authenticity, a soft reconstruction layer based on the underwater imaging physics model is included. To improve the quality of the enhanced images, we introduce the Chromatic Consistency Loss and Sobel Color Loss to train the network. Comprehensive experimental results demonstrate that WaterFormer outperforms other state-of-the-art methods in enhancing underwater images.

Title: Enhancing Lesion Segmentation in PET/CT Imaging with Deep Learning and Advanced Data Preprocessing Techniques

Authors: Jiayi Liu, Qiaoyi Xue, Youdan Feng, Tianming Xu, Kaixin Shen, Chuyun Shen, Yuhang Shi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09784
Pdf URL: https://arxiv.org/pdf/2409.09784
Copy Paste: [[2409.09784]] Enhancing Lesion Segmentation in PET/CT Imaging with Deep Learning and Advanced Data Preprocessing Techniques(https://arxiv.org/abs/2409.09784)
Keywords: robust, segmentation
Abstract: The escalating global cancer burden underscores the critical need for precise diagnostic tools in oncology. This research employs deep learning to enhance lesion segmentation in PET/CT imaging, utilizing a dataset of 900 whole-body FDG-PET/CT and 600 PSMA-PET/CT studies from the AutoPET challenge III. Our methodical approach includes robust preprocessing and data augmentation techniques to ensure model robustness and generalizability. We investigate the influence of non-zero normalization and modifications to the data augmentation pipeline, such as the introduction of RandGaussianSharpen and adjustments to the Gamma transform parameter. This study aims to contribute to the standardization of preprocessing and augmentation strategies in PET/CT imaging, potentially improving the diagnostic accuracy and the personalized management of cancer patients. Our code will be open-sourced and available at this https URL.

Title: Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition, Speaker Tagging, and Emotion Recognition

Authors: Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.09785
Pdf URL: https://arxiv.org/pdf/2409.09785
Copy Paste: [[2409.09785]] Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition, Speaker Tagging, and Emotion Recognition(https://arxiv.org/abs/2409.09785)
Keywords: generative, large language model
Abstract: Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

Title: BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching

Authors: RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09787
Pdf URL: https://arxiv.org/pdf/2409.09787
Copy Paste: [[2409.09787]] BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching(https://arxiv.org/abs/2409.09787)
Keywords: robust, diffusion
Abstract: Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY MATCHING, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling potential (DW-4). The experimental results demonstrate that BEnDEM can achieve state-of-the-art performance while being more robust.

Title: Multiple Rotation Averaging with Constrained Reweighting Deep Matrix Factorization

Authors: Shiqi Li, Jihua Zhu, Yifan Xie, Naiwen Hu, Mingchen Zhu, Zhongyu Li, Di Wang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2409.09790
Pdf URL: https://arxiv.org/pdf/2409.09790
Copy Paste: [[2409.09790]] Multiple Rotation Averaging with Constrained Reweighting Deep Matrix Factorization(https://arxiv.org/abs/2409.09790)
Keywords: robust
Abstract: Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in unconstrained linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What's more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.

Title: Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data

Authors: Xu Sun, Zixuan Qin, Shun Zhang, Yuexian Wang, Li Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09792
Pdf URL: https://arxiv.org/pdf/2409.09792
Copy Paste: [[2409.09792]] Enhancing Data Quality through Self-learning on Imbalanced Financial Risk Data(https://arxiv.org/abs/2409.09792)
Keywords: robust
Abstract: In the financial risk domain, particularly in credit default prediction and fraud detection, accurate identification of high-risk class instances is paramount, as their occurrence can have significant economic implications. Although machine learning models have gained widespread adoption for risk prediction, their performance is often hindered by the scarcity and diversity of high-quality data. This limitation stems from factors in datasets such as small risk sample sizes, high labeling costs, and severe class imbalance, which impede the models' ability to learn effectively and accurately forecast critical events. This study investigates data pre-processing techniques to enhance existing financial risk datasets by introducing TriEnhance, a straightforward technique that entails: (1) generating synthetic samples specifically tailored to the minority class, (2) filtering using binary feedback to refine samples, and (3) self-learning with pseudo-labels. Our experiments across six benchmark datasets reveal the efficacy of TriEnhance, with a notable focus on improving minority class calibration, a key factor for developing more robust financial risk prediction systems.

Title: Federated Learning in Adversarial Environments: Testbed Design and Poisoning Resilience in Cybersecurity

Authors: Hao Jian Huang, Bekzod Iskandarov, Mizanur Rahman, Hakan T. Otal, M. Abdullah Canbaz
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09794
Pdf URL: https://arxiv.org/pdf/2409.09794
Copy Paste: [[2409.09794]] Federated Learning in Adversarial Environments: Testbed Design and Poisoning Resilience in Cybersecurity(https://arxiv.org/abs/2409.09794)
Keywords: security, privacy, attack, robust, federate
Abstract: This paper presents the design and implementation of a Federated Learning (FL) testbed, focusing on its application in cybersecurity and evaluating its resilience against poisoning attacks. Federated Learning allows multiple clients to collaboratively train a global model while keeping their data decentralized, addressing critical needs for data privacy and security, particularly in sensitive fields like cybersecurity. Our testbed, built using the Flower framework, facilitates experimentation with various FL frameworks, assessing their performance, scalability, and ease of integration. Through a case study on federated intrusion detection systems, we demonstrate the testbed's capabilities in detecting anomalies and securing critical infrastructure without exposing sensitive network data. Comprehensive poisoning tests, targeting both model and data integrity, evaluate the system's robustness under adversarial conditions. Our results show that while federated learning enhances data privacy and distributed learning, it remains vulnerable to poisoning attacks, which must be mitigated to ensure its reliability in real-world applications.

Title: Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

Authors: Hui Shen, Zhongwei Wan, Xin Wang, Mi Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09808
Pdf URL: https://arxiv.org/pdf/2409.09808
Copy Paste: [[2409.09808]] Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion(https://arxiv.org/abs/2409.09808)
Keywords: transformer
Abstract: Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.

Title: PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple Operators for Forecasting Fluid Dynamics

Authors: Yuxuan Liu, Jingmin Sun, Xinjie He, Griffin Pinney, Zecheng Zhang, Hayden Schaeffer
Subjects: cs.LG, math.NA, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2409.09811
Pdf URL: https://arxiv.org/pdf/2409.09811
Copy Paste: [[2409.09811]] PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple Operators for Forecasting Fluid Dynamics(https://arxiv.org/abs/2409.09811)
Keywords: transformer
Abstract: We propose PROSE-FD, a zero-shot multimodal PDE foundational model for simultaneous prediction of heterogeneous two-dimensional physical systems related to distinct fluid dynamics settings. These systems include shallow water equations and the Navier-Stokes equations with incompressible and compressible flow, regular and complex geometries, and different buoyancy settings. This work presents a new transformer-based multi-operator learning approach that fuses symbolic information to perform operator-based data prediction, i.e. non-autoregressive. By incorporating multiple modalities in the inputs, the PDE foundation model builds in a pathway for including mathematical descriptions of the physical behavior. We pre-train our foundation model on 6 parametric families of equations collected from 13 datasets, including over 60K trajectories. Our model outperforms popular operator learning, computer vision, and multi-physics models, in benchmark forward prediction tasks. We test our architecture choices with ablation studies.

Title: Causal Inference with Large Language Model: A Survey

Authors: Jing Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09822
Pdf URL: https://arxiv.org/pdf/2409.09822
Copy Paste: [[2409.09822]] Causal Inference with Large Language Model: A Survey(https://arxiv.org/abs/2409.09822)
Keywords: large language model
Abstract: Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in natural language processing (NLP), particularly with the advent of large language models (LLMs), have introduced promising opportunities for traditional causal inference tasks. This paper reviews recent progress in applying LLMs to causal inference, encompassing various tasks spanning different levels of causation. We summarize the main causal problems and approaches, and present a comparison of their evaluation results in different causal scenarios. Furthermore, we discuss key findings and outline directions for future research, underscoring the potential implications of integrating LLMs in advancing causal inference methodologies.

Title: GP-GPT: Large Language Model for Gene-Phenotype Mapping

Authors: Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09825
Pdf URL: https://arxiv.org/pdf/2409.09825
Copy Paste: [[2409.09825]] GP-GPT: Large Language Model for Gene-Phenotype Mapping(https://arxiv.org/abs/2409.09825)
Keywords: large language model
Abstract: Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

Title: Latent Diffusion Models for Controllable RNA Sequence Generation

Authors: Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2409.09828
Pdf URL: https://arxiv.org/pdf/2409.09828
Copy Paste: [[2409.09828]] Latent Diffusion Models for Controllable RNA Sequence Generation(https://arxiv.org/abs/2409.09828)
Keywords: diffusion
Abstract: This paper presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences. RNA is a particularly dynamic and versatile molecule in biological processes. RNA sequences exhibit high variability and diversity, characterized by their variable lengths, flexible three-dimensional structures, and diverse functions. We utilize pretrained BERT-type models to encode raw RNAs into token-level biologically meaningful representations. A Q-Former is employed to compress these representations into a fixed-length set of latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we train reward networks to estimate functional properties of RNA from the latent variables. We employ gradient-based guidance during the backward diffusion process, aiming to generate RNA sequences that are optimized for higher rewards. Empirical experiments confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological indicators. We fine-tuned the diffusion model on untranslated regions (UTRs) of mRNA and optimize sample sequences for protein translation efficiencies. Our guided diffusion model effectively generates diverse UTR sequences with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), surpassing baselines. These results hold promise for studies on RNA sequence-function relationships, protein synthesis, and enhancing therapeutic RNA design.

Title: Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling

Authors: Samuel Belkadi, Libo Ren, Nicolo Micheletti, Lifeng Han, Goran Nenadic
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09831
Pdf URL: https://arxiv.org/pdf/2409.09831
Copy Paste: [[2409.09831]] Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling(https://arxiv.org/abs/2409.09831)
Keywords: privacy, protect
Abstract: In this paper, we present a system that generates synthetic free-text medical records, such as discharge summaries, admission notes and doctor correspondences, using Masked Language Modeling (MLM). Our system is designed to preserve the critical information of the records while introducing significant diversity and minimizing re-identification risk. The system incorporates a de-identification component that uses Philter to mask Protected Health Information (PHI), followed by a Medical Entity Recognition (NER) model to retain key medical information. We explore various masking ratios and mask-filling techniques to balance the trade-off between diversity and fidelity in the synthetic outputs without affecting overall readability. Our results demonstrate that the system can produce high-quality synthetic data with significant diversity while achieving a HIPAA-compliant PHI recall rate of 0.96 and a low re-identification risk of 0.035. Furthermore, downstream evaluations using a NER task reveal that the synthetic data can be effectively used to train models with performance comparable to those trained on real data. The flexibility of the system allows it to be adapted for specific use cases, making it a valuable tool for privacy-preserving data generation in medical research and healthcare applications.

Title: A Benchmark Dataset with Larger Context for Non-Factoid Question Answering over Islamic Text

Authors: Faiza Qamar, Seemab Latif, Rabia Latif
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09844
Pdf URL: https://arxiv.org/pdf/2409.09844
Copy Paste: [[2409.09844]] A Benchmark Dataset with Larger Context for Non-Factoid Question Answering over Islamic Text(https://arxiv.org/abs/2409.09844)
Keywords: robust
Abstract: Accessing and comprehending religious texts, particularly the Quran (the sacred scripture of Islam) and Ahadith (the corpus of the sayings or traditions of the Prophet Muhammad), in today's digital era necessitates efficient and accurate Question-Answering (QA) systems. Yet, the scarcity of QA systems tailored specifically to the detailed nature of inquiries about the Quranic Tafsir (explanation, interpretation, context of Quran for clarity) and Ahadith poses significant challenges. To address this gap, we introduce a comprehensive dataset meticulously crafted for QA purposes within the domain of Quranic Tafsir and Ahadith. This dataset comprises a robust collection of over 73,000 question-answer pairs, standing as the largest reported dataset in this specialized domain. Importantly, both questions and answers within the dataset are meticulously enriched with contextual information, serving as invaluable resources for training and evaluating tailored QA systems. However, while this paper highlights the dataset's contributions and establishes a benchmark for evaluating QA performance in the Quran and Ahadith domains, our subsequent human evaluation uncovered critical insights regarding the limitations of existing automatic evaluation techniques. The discrepancy between automatic evaluation metrics, such as ROUGE scores, and human assessments became apparent. The human evaluation indicated significant disparities: the model's verdict consistency with expert scholars ranged between 11% to 20%, while its contextual understanding spanned a broader spectrum of 50% to 90%. These findings underscore the necessity for evaluation techniques that capture the nuances and complexities inherent in understanding religious texts, surpassing the limitations of traditional automatic metrics.

Title: A Survey of Out-of-distribution Generalization for Graph Machine Learning from a Causal View

Authors: Jing Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09858
Pdf URL: https://arxiv.org/pdf/2409.09858
Copy Paste: [[2409.09858]] A Survey of Out-of-distribution Generalization for Graph Machine Learning from a Causal View(https://arxiv.org/abs/2409.09858)
Keywords: robust, fair
Abstract: Graph machine learning (GML) has been successfully applied across a wide range of tasks. Nonetheless, GML faces significant challenges in generalizing over out-of-distribution (OOD) data, which raises concerns about its wider applicability. Recent advancements have underscored the crucial role of causality-driven approaches in overcoming these generalization challenges. Distinct from traditional GML methods that primarily rely on statistical dependencies, causality-focused strategies delve into the underlying causal mechanisms of data generation and model prediction, thus significantly improving the generalization of GML across different environments. This paper offers a thorough review of recent progress in causality-involved GML generalization. We elucidate the fundamental concepts of employing causality to enhance graph model generalization and categorize the various approaches, providing detailed descriptions of their methodologies and the connections among them. Furthermore, we explore the incorporation of causality in other related important areas of trustworthy GML, such as explanation, fairness, and robustness. Concluding with a discussion on potential future research directions, this review seeks to articulate the continuing development and future potential of causality in enhancing the trustworthiness of graph machine learning.

Title: Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective

Authors: Ningfei Wang, Shaoyuan Xie, Takami Sato, Yunpeng Luo, Kaidi Xu, Qi Alfred Chen
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2409.09860
Pdf URL: https://arxiv.org/pdf/2409.09860
Copy Paste: [[2409.09860]] Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective(https://arxiv.org/abs/2409.09860)
Keywords: attack
Abstract: Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating the attack effects on academic TSR models, leaving the impacts of such attacks on real-world commercial TSR systems largely unclear. In this paper, we conduct the first large-scale measurement of physical-world adversarial attacks against commercial TSR systems. Our testing results reveal that it is possible for existing attack works from academia to have highly reliable (100\%) attack success against certain commercial TSR system functionality, but such attack capabilities are not generalizable, leading to much lower-than-expected attack success rates overall. We find that one potential major factor is a spatial memorization design that commonly exists in today's commercial TSR systems. We design new attack success metrics that can mathematically model the impacts of such design on the TSR system-level attack success, and use them to revisit existing attacks. Through these efforts, we uncover 7 novel observations, some of which directly challenge the observations or claims in prior works due to the introduction of the new metrics.

Title: Towards Kinetic Manipulation of the Latent Space

Authors: Diego Porres
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09867
Pdf URL: https://arxiv.org/pdf/2409.09867
Copy Paste: [[2409.09867]] Towards Kinetic Manipulation of the Latent Space(https://arxiv.org/abs/2409.09867)
Keywords: extraction, generative
Abstract: The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at this https URL.

Title: REG: Refined Generalized Focal Loss for Road Asset Detection on Thai Highways Using Vision-Based Detection and Segmentation Models

Authors: Teerapong Panboonyuen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09877
Pdf URL: https://arxiv.org/pdf/2409.09877
Copy Paste: [[2409.09877]] REG: Refined Generalized Focal Loss for Road Asset Detection on Thai Highways Using Vision-Based Detection and Segmentation Models(https://arxiv.org/abs/2409.09877)
Keywords: robust, segmentation
Abstract: This paper introduces a novel framework for detecting and segmenting critical road assets on Thai highways using an advanced Refined Generalized Focal Loss (REG) formulation. Integrated into state-of-the-art vision-based detection and segmentation models, the proposed method effectively addresses class imbalance and the challenges of localizing small, underrepresented road elements, including pavilions, pedestrian bridges, information signs, single-arm poles, bus stops, warning signs, and concrete guardrails. To improve both detection and segmentation accuracy, a multi-task learning strategy is adopted, optimizing REG across multiple tasks. REG is further enhanced by incorporating a spatial-contextual adjustment term, which accounts for the spatial distribution of road assets, and a probabilistic refinement that captures prediction uncertainty in complex environments, such as varying lighting conditions and cluttered backgrounds. Our rigorous mathematical formulation demonstrates that REG minimizes localization and classification errors by applying adaptive weighting to hard-to-detect instances while down-weighting easier examples. Experimental results show a substantial performance improvement, achieving a mAP50 of 80.34 and an F1-score of 77.87, significantly outperforming conventional methods. This research underscores the capability of advanced loss function refinements to enhance the robustness and accuracy of road asset detection and segmentation, thereby contributing to improved road safety and infrastructure management. For an in-depth discussion of the mathematical background and related methods, please refer to previous work available at \url{this https URL}.

Title: Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank

Authors: Shashank Gupta, Harrie Oosterhuis, Maarten de Rijke
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2409.09881
Pdf URL: https://arxiv.org/pdf/2409.09881
Copy Paste: [[2409.09881]] Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank(https://arxiv.org/abs/2409.09881)
Keywords: robust
Abstract: Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach. PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

Title: Flexible Diffusion Scopes with Parameterized Laplacian for Heterophilic Graph Learning

Authors: Qincheng Lu, Jiaqi Zhu, Sitao Luan, Xiao-Wen Chang
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2409.09888
Pdf URL: https://arxiv.org/pdf/2409.09888
Copy Paste: [[2409.09888]] Flexible Diffusion Scopes with Parameterized Laplacian for Heterophilic Graph Learning(https://arxiv.org/abs/2409.09888)
Keywords: diffusion
Abstract: The ability of Graph Neural Networks (GNNs) to capture long-range and global topology information is limited by the scope of conventional graph Laplacian, leading to unsatisfactory performance on some datasets, particularly on heterophilic graphs. To address this limitation, we propose a new class of parameterized Laplacian matrices, which provably offers more flexibility in controlling the diffusion distance between nodes than the conventional graph Laplacian, allowing long-range information to be adaptively captured through diffusion on graph. Specifically, we first prove that the diffusion distance and spectral distance on graph have an order-preserving relationship. With this result, we demonstrate that the parameterized Laplacian can accelerate the diffusion of long-range information, and the parameters in the Laplacian enable flexibility of the diffusion scopes. Based on the theoretical results, we propose topology-guided rewiring mechanism to capture helpful long-range neighborhood information for heterophilic graphs. With this mechanism and the new Laplacian, we propose two GNNs with flexible diffusion scopes: namely the Parameterized Diffusion based Graph Convolutional Networks (PD-GCN) and Graph Attention Networks (PD-GAT). Synthetic experiments reveal the high correlations between the parameters of the new Laplacian and the performance of parameterized GNNs under various graph homophily levels, which verifies that our new proposed GNNs indeed have the ability to adjust the parameters to adaptively capture the global information for different levels of heterophilic graphs. They also outperform the state-of-the-art (SOTA) models on 6 out of 7 real-world benchmark datasets, which further confirms their superiority.

Title: Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

Authors: Qilong Zhangli, Di Liu, Abhishek Aich, Dimitris Metaxas, Samuel Schulter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09893
Pdf URL: https://arxiv.org/pdf/2409.09893
Copy Paste: [[2409.09893]] Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation(https://arxiv.org/abs/2409.09893)
Keywords: robust, segmentation
Abstract: Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.

Title: Estimating Wage Disparities Using Foundation Models

Authors: Keyon Vafa, Susan Athey, David M. Blei
Subjects: cs.LG, econ.EM, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2409.09894
Pdf URL: https://arxiv.org/pdf/2409.09894
Copy Paste: [[2409.09894]] Estimating Wage Disparities Using Foundation Models(https://arxiv.org/abs/2409.09894)
Keywords: large language model
Abstract: One thread of empirical work in social science focuses on decomposing group differences in outcomes into unexplained components and components explained by observable factors. In this paper, we study gender wage decompositions, which require estimating the portion of the gender wage gap explained by career histories of workers. Classical methods for decomposing the wage gap employ simple predictive models of wages which condition on a small set of simple summaries of labor history. The problem is that these predictive models cannot take advantage of the full complexity of a worker's history, and the resulting decompositions thus suffer from omitted variable bias (OVB), where covariates that are correlated with both gender and wages are not included in the model. Here we explore an alternative methodology for wage gap decomposition that employs powerful foundation models, such as large language models, as the predictive engine. Foundation models excel at making accurate predictions from complex, high-dimensional inputs. We use a custom-built foundation model, designed to predict wages from full labor histories, to decompose the gender wage gap. We prove that the way such models are usually trained might still lead to OVB, but develop fine-tuning algorithms that empirically mitigate this issue. Our model captures a richer representation of career history than simple models and predicts wages more accurately. In detail, we first provide a novel set of conditions under which an estimator of the wage gap based on a fine-tuned foundation model is $\sqrt{n}$-consistent. Building on the theory, we then propose methods for fine-tuning foundation models that minimize OVB. Using data from the Panel Study of Income Dynamics, we find that history explains more of the gender wage gap than standard econometric models can measure, and we identify elements of history that are important for reducing OVB.

Title: GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

Authors: Vitor Guizilini, Pavel Tokmakov, Achal Dave, Rares Ambrus
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2409.09896
Pdf URL: https://arxiv.org/pdf/2409.09896
Copy Paste: [[2409.09896]] GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion(https://arxiv.org/abs/2409.09896)
Keywords: diffusion
Abstract: 3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of generating accurate predictions across domains. As a result, state of the art approaches show impressive performance in zero-shot relative and metric depth estimation. Recently, diffusion models have exhibited remarkable scalability and generalizable properties in their learned representations. However, because these models repurpose tools originally designed for image generation, they can only operate on dense ground-truth, which is not available for most depth labels, especially in real-world settings. In this paper we present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We use image features with 3D geometric positional encodings to condition the diffusion process both globally and locally, generating depth predictions at a pixel-level. With comprehensive experiments across eight indoor and outdoor datasets, we show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.

Title: Rediscovering the Latent Dimensions of Personality with Large Language Models as Trait Descriptors

Authors: Joseph Suh, Suhong Moon, Minwoo Kang, David M. Chan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.09905
Pdf URL: https://arxiv.org/pdf/2409.09905
Copy Paste: [[2409.09905]] Rediscovering the Latent Dimensions of Personality with Large Language Models as Trait Descriptors(https://arxiv.org/abs/2409.09905)
Keywords: large language model
Abstract: Assessing personality traits using large language models (LLMs) has emerged as an interesting and challenging area of research. While previous methods employ explicit questionnaires, often derived from the Big Five model of personality, we hypothesize that LLMs implicitly encode notions of personality when modeling next-token responses. To demonstrate this, we introduce a novel approach that uncovers latent personality dimensions in LLMs by applying singular value de-composition (SVD) to the log-probabilities of trait-descriptive adjectives. Our experiments show that LLMs "rediscover" core personality traits such as extraversion, agreeableness, conscientiousness, neuroticism, and openness without relying on direct questionnaire inputs, with the top-5 factors corresponding to Big Five traits explaining 74.3% of the variance in the latent space. Moreover, we can use the derived principal components to assess personality along the Big Five dimensions, and achieve improvements in average personality prediction accuracy of up to 5% over fine-tuned models, and up to 21% over direct LLM-based scoring techniques.

Title: Rapid Adaptation of Earth Observation Foundation Models for Segmentation

Authors: Karthick Panner Selvam, Raul Ramos-Pollan, Freddie Kalaitzis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.09907
Pdf URL: https://arxiv.org/pdf/2409.09907
Copy Paste: [[2409.09907]] Rapid Adaptation of Earth Observation Foundation Models for Segmentation(https://arxiv.org/abs/2409.09907)
Keywords: segmentation
Abstract: This study investigates the efficacy of Low-Rank Adaptation (LoRA) in fine-tuning Earth Observation (EO) foundation models for flood segmentation. We hypothesize that LoRA, a parameter-efficient technique, can significantly accelerate the adaptation of large-scale EO models to this critical task while maintaining high performance. We apply LoRA to fine-tune a state-of-the-art EO foundation model pre-trained on diverse satellite imagery, using a curated dataset of flood events. Our results demonstrate that LoRA-based fine-tuning (r-256) improves F1 score by 6.66 points and IoU by 0.11 compared to a frozen encoder baseline, while significantly reducing computational costs. Notably, LoRA outperforms full fine-tuning, which proves computationally infeasible on our hardware. We further assess generalization through out-of-distribution (OOD) testing on a geographically distinct flood event. While LoRA configurations show improved OOD performance over the baseline. This work contributes to research on efficient adaptation of foundation models for specialized EO tasks, with implications for rapid response systems in disaster management. Our findings demonstrate LoRA's potential for enabling faster deployment of accurate flood segmentation models in resource-constrained, time-critical scenarios.

Title: SFR-RAG: Towards Contextually Faithful LLMs

Authors: Xuan-Phi Nguyen, Shrey Pandit, Senthil Purushwalkam, Austin Xu, Hailin Chen, Yifei Ming, Zixuan Ke, Silvio Savarese, Caiming Xong, Shafiq Joty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09916
Pdf URL: https://arxiv.org/pdf/2409.09916
Copy Paste: [[2409.09916]] SFR-RAG: Towards Contextually Faithful LLMs(https://arxiv.org/abs/2409.09916)
Keywords: generative, large language model
Abstract: Retrieval Augmented Generation (RAG), a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance, has emerged as a pivotal area in generative AI. The LLMs used in RAG applications are required to faithfully and completely comprehend the provided context and users' questions, avoid hallucination, handle unanswerable, counterfactual or otherwise low-quality and irrelevant contexts, perform complex multi-hop reasoning and produce reliable citations. In this paper, we introduce SFR-RAG, a small LLM that is instruction-tuned with an emphasis on context-grounded generation and hallucination minimization. We also present ContextualBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks, such as HotpotQA and TriviaQA, with consistent RAG settings to ensure reproducibility and consistency in model assessments. Experimental results demonstrate that our SFR-RAG-9B model outperforms leading baselines such as Command-R+ (104B) and GPT-4o, achieving state-of-the-art results in 3 out of 7 benchmarks in ContextualBench with significantly fewer parameters. The model is also shown to be resilient to alteration in the contextual information and behave appropriately when relevant context is removed. Additionally, the SFR-RAG model maintains competitive performance in general instruction-following tasks and function-calling capabilities.

Title: Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Authors: Vinay Samuel, Yue Zhou, Henry Peng Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09927
Pdf URL: https://arxiv.org/pdf/2409.09927
Copy Paste: [[2409.09927]] Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges(https://arxiv.org/abs/2409.09927)
Keywords: robust, large language model
Abstract: As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address this gap and provide a dual investigation of SOTA LLM contamination status and detection method robustness, we evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets often used in modern LLM evaluation. Our analysis reveals that (1) Current methods have non-trivial limitations in their assumptions and practical applications; (2) Notable difficulties exist in detecting contamination introduced during instruction fine-tuning with answer augmentation; and (3) Limited consistencies between SOTA contamination detection techniques. These findings highlight the complexity of contamination detection in advanced LLMs and the urgent need for further research on robust and generalizable contamination evaluation. Our code is available at this https URL.

Title: High-Security Hardware Module with PUF and Hybrid Cryptography for Data Security

Authors: Joshua Tito Amael, Oskar Natan, Jazi Eko Istiyanto
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2409.09928
Pdf URL: https://arxiv.org/pdf/2409.09928
Copy Paste: [[2409.09928]] High-Security Hardware Module with PUF and Hybrid Cryptography for Data Security(https://arxiv.org/abs/2409.09928)
Keywords: security, attack
Abstract: This research highlights the rapid development of technology in the industry, particularly Industry 4.0, supported by fundamental technologies such as the Internet of Things (IoT), cloud computing, big data, and data analysis. Despite providing efficiency, these developments also bring negative impacts, such as increased cyber-attacks, especially in manufacturing. One standard attack in the industry is the man-in-the-middle (MITM) attack, which can have severe consequences for the physical data transfer, particularly on the integrity of sensor and actuator data in industrial machines. This research proposes a solution by developing a hardware security module (HSM) using a field-programmable gate array (FPGA) with physical unclonable function (PUF) authentication and a hybrid encryption data security system. Experimental results show that this research improves some criteria in industrial cybersecurity, ensuring critical data security from cyber-attacks in industrial machines.

Title: Fault Analysis And Predictive Maintenance Of Induction Motor Using Machine Learning

Authors: Kavana Venkatesh, Neethi M
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09944
Pdf URL: https://arxiv.org/pdf/2409.09944
Copy Paste: [[2409.09944]] Fault Analysis And Predictive Maintenance Of Induction Motor Using Machine Learning(https://arxiv.org/abs/2409.09944)
Keywords: protect
Abstract: Induction motors are one of the most crucial electrical equipment and are extensively used in industries in a wide range of applications. This paper presents a machine learning model for the fault detection and classification of induction motor faults by using three phase voltages and currents as inputs. The aim of this work is to protect vital electrical components and to prevent abnormal event progression through early detection and diagnosis. This work presents a fast forward artificial neural network model to detect some of the commonly occurring electrical faults like overvoltage, under voltage, single phasing, unbalanced voltage, overload, ground fault. A separate model free monitoring system wherein the motor itself acts like a sensor is presented and the only monitored signals are the input given to the motor. Limits for current and voltage values are set for the faulty and healthy conditions, which is done by a classifier. Real time data from a 0.33 HP induction motor is used to train and test the neural network. The model so developed analyses the voltage and current values given at a particular instant and classifies the data into no fault or the specific fault. The model is then interfaced with a real motor to accurately detect and classify the faults so that further necessary action can be taken.

Title: Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Authors: Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2409.09947
Pdf URL: https://arxiv.org/pdf/2409.09947
Copy Paste: [[2409.09947]] Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations(https://arxiv.org/abs/2409.09947)
Keywords: large language model
Abstract: Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

Title: Enhancing Industrial Cybersecurity: SoftHSM Implementation on SBCs for Mitigating MITM Attacks

Authors: Joshua Tito Amael, Jazi Eko Istiyanto, Oskar Natan
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2409.09948
Pdf URL: https://arxiv.org/pdf/2409.09948
Copy Paste: [[2409.09948]] Enhancing Industrial Cybersecurity: SoftHSM Implementation on SBCs for Mitigating MITM Attacks(https://arxiv.org/abs/2409.09948)
Keywords: security, protect, attack, extraction
Abstract: The rapid growth of industrial technology, driven by automation, IoT, and cloud computing, has also increased the risk of cyberattacks, such as Man-in-the-Middle (MITM) attacks. A standard solution to protect data is using a Hardware Security Module (HSM), but its high implementation cost has led to the development of a more affordable alternative: SoftHSM. This software-based module manages encryption and decryption keys using cryptographic algorithms. This study simulates the use of SoftHSM on a single-board computer (SBC) to enhance industrial system security and cost-effectively mitigate MITM attacks. The security system integrates AES and RSA cryptographic algorithms, with SoftHSM handling RSA key storage. The results show that HSM protects RSA private keys from extraction attempts, ensuring data security. In terms of performance, the system achieved an average encryption time of 3.29 seconds, a slot access time of 0.018 seconds, and a decryption time of 2.558 seconds. It also demonstrated efficient memory usage, with 37.24% for encryption and 24.24% for decryption, while consuming 5.20 V and 0.72 A during processing.

Title: Optimal ablation for interpretability

Authors: Maximilian Li, Lucas Janson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.09951
Pdf URL: https://arxiv.org/pdf/2409.09951
Copy Paste: [[2409.09951]] Optimal ablation for interpretability(https://arxiv.org/abs/2409.09951)
Keywords: interpretability
Abstract: Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.

Title: Artificial Intelligence-Based Opportunistic Coronary Calcium Screening in the Veterans Affairs National Healthcare System

Authors: Raffi Hagopian, Timothy Strebel, Simon Bernatz, Gregory A Myers, Erik Offerman, Eric Zuniga, Cy Y Kim, Angie T Ng, James A Iwaz, Sunny P Singh, Evan P Carey, Michael J Kim, R Spencer Schaefer, Jeannie Yu, Amilcare Gentili, Hugo JWL Aerts
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.09968
Pdf URL: https://arxiv.org/pdf/2409.09968
Copy Paste: [[2409.09968]] Artificial Intelligence-Based Opportunistic Coronary Calcium Screening in the Veterans Affairs National Healthcare System(https://arxiv.org/abs/2409.09968)
Keywords: fair, segmentation
Abstract: Coronary artery calcium (CAC) is highly predictive of cardiovascular events. While millions of chest CT scans are performed annually in the United States, CAC is not routinely quantified from scans done for non-cardiac purposes. A deep learning algorithm was developed using 446 expert segmentations to automatically quantify CAC on non-contrast, non-gated CT scans (AI-CAC). Our study differs from prior works as we leverage imaging data across the Veterans Affairs national healthcare system, from 98 medical centers, capturing extensive heterogeneity in imaging protocols, scanners, and patients. AI-CAC performance on non-gated scans was compared against clinical standard ECG-gated CAC scoring. Non-gated AI-CAC differentiated zero vs. non-zero and less than 100 vs. 100 or greater Agatston scores with accuracies of 89.4% (F1 0.93) and 87.3% (F1 0.89), respectively, in 795 patients with paired gated scans within a year of a non-gated CT scan. Non-gated AI-CAC was predictive of 10-year all-cause mortality (CAC 0 vs. >400 group: 25.4% vs. 60.2%, Cox HR 3.49, p < 0.005), and composite first-time stroke, MI, or death (CAC 0 vs. >400 group: 33.5% vs. 63.8%, Cox HR 3.00, p < 0.005). In a screening dataset of 8,052 patients with low-dose lung cancer-screening CTs (LDCT), 3,091/8,052 (38.4%) individuals had AI-CAC >400. Four cardiologists qualitatively reviewed LDCT images from a random sample of >400 AI-CAC patients and verified that 527/531 (99.2%) would benefit from lipid-lowering therapy. To the best of our knowledge, this is the first non-gated CT CAC algorithm developed across a national healthcare system, on multiple imaging protocols, without filtering intra-cardiac hardware, and compared against a strong gated CT reference. We report superior performance relative to previous CAC algorithms evaluated against paired gated scans that included patients with intra-cardiac hardware.

Title: 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction

Authors: Atsuya Nakata, Takao Yamanaka
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2409.09969
Pdf URL: https://arxiv.org/pdf/2409.09969
Copy Paste: [[2409.09969]] 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction(https://arxiv.org/abs/2409.09969)
Keywords: generative
Abstract: Omni-directional images have been increasingly used in various applications, including virtual reality and SNS (Social Networking Services). However, their availability is comparatively limited in contrast to normal field of view (NFoV) images, since specialized cameras are required to take omni-directional images. Consequently, several methods have been proposed based on generative adversarial networks (GAN) to synthesize omni-directional images, but these approaches have shown difficulties in training of the models, due to instability and/or significant time consumption in the training. To address these problems, this paper proposes a novel omni-directional image synthesis method, 2S-ODIS (Two-Stage Omni-Directional Image Synthesis), which generated high-quality omni-directional images but drastically reduced the training time. This was realized by utilizing the VQGAN (Vector Quantized GAN) model pre-trained on a large-scale NFoV image database such as ImageNet without fine-tuning. Since this pre-trained model does not represent distortions of omni-directional images in the equi-rectangular projection (ERP), it cannot be applied directly to the omni-directional image synthesis in ERP. Therefore, two-stage structure was adopted to first create a global coarse image in ERP and then refine the image by integrating multiple local NFoV images in the higher resolution to compensate the distortions in ERP, both of which are based on the pre-trained VQGAN model. As a result, the proposed method, 2S-ODIS, achieved the reduction of the training time from 14 days in OmniDreamer to four days in higher image quality.

Title: Comprehensive Study on Sentiment Analysis: From Rule-based to modern LLM based system

Authors: Shailja Gupta, Rajesh Ranjan, Surya Narayan Singh
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2409.09989
Pdf URL: https://arxiv.org/pdf/2409.09989
Copy Paste: [[2409.09989]] Comprehensive Study on Sentiment Analysis: From Rule-based to modern LLM based system(https://arxiv.org/abs/2409.09989)
Keywords: large language model
Abstract: This paper provides a comprehensive survey of sentiment analysis within the context of artificial intelligence (AI) and large language models (LLMs). Sentiment analysis, a critical aspect of natural language processing (NLP), has evolved significantly from traditional rule-based methods to advanced deep learning techniques. This study examines the historical development of sentiment analysis, highlighting the transition from lexicon-based and pattern-based approaches to more sophisticated machine learning and deep learning models. Key challenges are discussed, including handling bilingual texts, detecting sarcasm, and addressing biases. The paper reviews state-of-the-art approaches, identifies emerging trends, and outlines future research directions to advance the field. By synthesizing current methodologies and exploring future opportunities, this survey aims to understand sentiment analysis in the AI and LLM context thoroughly.

Title: SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning

Authors: Amogh Joshi, Adarsh Kumar Kosta, Kaushik Roy
Subjects: cs.LG, cs.NE, cs.RO
Abstract URL: https://arxiv.org/abs/2409.09990
Pdf URL: https://arxiv.org/pdf/2409.09990
Copy Paste: [[2409.09990]] SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning(https://arxiv.org/abs/2409.09990)
Keywords: explainability
Abstract: The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.

Title: FreeMark: A Non-Invasive White-Box Watermarking for Deep Neural Networks

Authors: Yuzhang Chen, Jiangnan Zhu, Yujie Gu, Minoru Kuribayashi, Kouichi Sakurai
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.09996
Pdf URL: https://arxiv.org/pdf/2409.09996
Copy Paste: [[2409.09996]] FreeMark: A Non-Invasive White-Box Watermarking for Deep Neural Networks(https://arxiv.org/abs/2409.09996)
Keywords: secure, protect, attack, extraction, watermark
Abstract: Deep neural networks (DNNs) have achieved significant success in real-world applications. However, safeguarding their intellectual property (IP) remains extremely challenging. Existing DNN watermarking for IP protection often require modifying DNN models, which reduces model performance and limits their practicality. This paper introduces FreeMark, a novel DNN watermarking framework that leverages cryptographic principles without altering the original host DNN model, thereby avoiding any reduction in model performance. Unlike traditional DNN watermarking methods, FreeMark innovatively generates secret keys from a pre-generated watermark vector and the host model using gradient descent. These secret keys, used to extract watermark from the model's activation values, are securely stored with a trusted third party, enabling reliable watermark extraction from suspect models. Extensive experiments demonstrate that FreeMark effectively resists various watermark removal attacks while maintaining high watermark capacity.

Title: SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Authors: Ke Shen, Mayank Kejriwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10007
Pdf URL: https://arxiv.org/pdf/2409.10007
Copy Paste: [[2409.10007]] SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL(https://arxiv.org/abs/2409.10007)
Keywords: large language model
Abstract: In recent years,Text-to-SQL, the problem of automatically converting questions posed in natural language to formal SQL queries, has emerged as an important problem at the intersection of natural language processing and data management research. Large language models (LLMs) have delivered impressive performance when used in an off-the-shelf performance, but still fall significantly short of expected expert-level performance. Errors are especially probable when a nuanced understanding is needed of database schemas, questions, and SQL clauses to do proper Text-to-SQL conversion. We introduce SelECT-SQL, a novel in-context learning solution that uses an algorithmic combination of chain-of-thought (CoT) prompting, self-correction, and ensemble methods to yield a new state-of-the-art result on challenging Text-to-SQL benchmarks. Specifically, when configured using GPT-3.5-Turbo as the base LLM, SelECT-SQL achieves 84.2% execution accuracy on the Spider leaderboard's development set, exceeding both the best results of other baseline GPT-3.5-Turbo-based solutions (81.1%), and the peak performance (83.5%) of the GPT-4 result reported on the leaderboard.

Title: HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making

Authors: Sumera Anjum, Hanzhi Zhang, Wenjun Zhou, Eun Jin Paek, Xiaopeng Zhao, Yunhe Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10011
Pdf URL: https://arxiv.org/pdf/2409.10011
Copy Paste: [[2409.10011]] HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making(https://arxiv.org/abs/2409.10011)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: this https URL.

Title: AttnMod: Attention-Based New Art Styles

Authors: Shih-Chieh Su
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10028
Pdf URL: https://arxiv.org/pdf/2409.10028
Copy Paste: [[2409.10028]] AttnMod: Attention-Based New Art Styles(https://arxiv.org/abs/2409.10028)
Keywords: diffusion
Abstract: Imagine a human artist looking at the generated photo of a diffusion model, and hoping to create a painting out of it. There could be some feature of the object in the photo that the artist wants to emphasize, some color to disperse, some silhouette to twist, or some part of the scene to be materialized. These intentions can be viewed as the modification of the cross attention from the text prompt onto UNet, during the desoising diffusion. This work presents AttnMod, to modify attention for creating new unpromptable art styles out of existing diffusion models. The style-creating behavior is studied across different setups.

Title: On the Diagram of Thought

Authors: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2409.10038
Pdf URL: https://arxiv.org/pdf/2409.10038
Copy Paste: [[2409.10038]] On the Diagram of Thought(https://arxiv.org/abs/2409.10038)
Keywords: robust, large language model
Abstract: We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at this https URL.

Title: Benchmarking Large Language Model Uncertainty for Prompt Optimization

Authors: Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2409.10044
Pdf URL: https://arxiv.org/pdf/2409.10044
Copy Paste: [[2409.10044]] Benchmarking Large Language Model Uncertainty for Prompt Optimization(https://arxiv.org/abs/2409.10044)
Keywords: large language model
Abstract: Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at this https URL.

Title: Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective

Authors: Van-Cuong Pham, Thien Huu Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10053
Pdf URL: https://arxiv.org/pdf/2409.10053
Copy Paste: [[2409.10053]] Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective(https://arxiv.org/abs/2409.10053)
Keywords: large language model
Abstract: Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.

Title: A Response to: A Note on "Privacy Preserving n-Party Scalar Product Protocol"

Authors: Florian van Daalen, Lianne Ippel, Andre Dekker, Inigo Bermejo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.10057
Pdf URL: https://arxiv.org/pdf/2409.10057
Copy Paste: [[2409.10057]] A Response to: A Note on "Privacy Preserving n-Party Scalar Product Protocol"(https://arxiv.org/abs/2409.10057)
Keywords: security, privacy
Abstract: We reply to the comments on our proposed privacy preserving n-party scalar product protocol made by Liu. In their comment Liu raised concerns regarding the security and scalability of the $n$-party scalar product protocol. In this reply, we show that their concerns are unfounded and that the $n$-party scalar product protocol is safe for its intended purposes. Their concerns regarding the security are based on a misunderstanding of the protocol. Additionally, while the scalability of the protocol puts limitations on its use, the protocol still has numerous practical applications when applied in the correct scenarios. Specifically within vertically partitioned scenarios, which often involve few parties, the protocol remains practical. In this reply we clarify Liu's misunderstanding. Additionally, we explain why the protocols scaling is not a practical problem in its intended application.

Title: Towards Physically-Realizable Adversarial Attacks in Embodied Vision Navigation

Authors: Meng Chen, Jiawei Tu, Chao Qi, Yonghao Dang, Feng Zhou, Wei Wei, Jianqin Yin
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2409.10071
Pdf URL: https://arxiv.org/pdf/2409.10071
Copy Paste: [[2409.10071]] Towards Physically-Realizable Adversarial Attacks in Embodied Vision Navigation(https://arxiv.org/abs/2409.10071)
Keywords: attack
Abstract: The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch's texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [this https URL].

Title: Steinmetz Neural Networks for Complex-Valued Data

Authors: Shyam Venkatasubramanian, Ali Pezeshki, Vahid Tarokh
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2409.10075
Pdf URL: https://arxiv.org/pdf/2409.10075
Copy Paste: [[2409.10075]] Steinmetz Neural Networks for Complex-Valued Data(https://arxiv.org/abs/2409.10075)
Keywords: robust
Abstract: In this work, we introduce a new approach to processing complex-valued data using DNNs consisting of parallel real-valued subnetworks with coupled outputs. Our proposed class of architectures, referred to as Steinmetz Neural Networks, leverages multi-view learning to construct more interpretable representations within the latent space. Subsequently, we present the Analytic Neural Network, which implements a consistency penalty that encourages analytic signal representations in the Steinmetz neural network's latent space. This penalty enforces a deterministic and orthogonal relationship between the real and imaginary components. Utilizing an information-theoretic construction, we demonstrate that the upper bound on the generalization error posited by the analytic neural network is lower than that of the general class of Steinmetz neural networks. Our numerical experiments demonstrate the improved performance and robustness to additive noise, afforded by our proposed networks on benchmark datasets and synthetic examples.

Title: LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain

Authors: Le Xiao, Yunfei Xu, Jing Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10077
Pdf URL: https://arxiv.org/pdf/2409.10077
Copy Paste: [[2409.10077]] LLM-DER:A Named Entity Recognition Method Based on Large Language Models for Chinese Coal Chemical Domain(https://arxiv.org/abs/2409.10077)
Keywords: large language model
Abstract: Domain-specific Named Entity Recognition (NER), whose goal is to recognize domain-specific entities and their categories, provides an important support for constructing domain knowledge graphs. Currently, deep learning-based methods are widely used and effective in NER tasks, but due to the reliance on large-scale labeled data. As a result, the scarcity of labeled data in a specific domain will limit its application.Therefore, many researches started to introduce few-shot methods and achieved some results. However, the entity structures in specific domains are often complex, and the current few-shot methods are difficult to adapt to NER tasks with complex features.Taking the Chinese coal chemical industry domain as an example,there exists a complex structure of multiple entities sharing a single entity, as well as multiple relationships for the same pair of entities, which affects the NER task under the sample less this http URL this paper, we propose a Large Language Models (LLMs)-based entity recognition framework LLM-DER for the domain-specific entity recognition problem in Chinese, which enriches the entity information by generating a list of relationships containing entity types through LLMs, and designing a plausibility and consistency evaluation method to remove misrecognized entities, which can effectively solve the complex structural entity recognition problem in a specific domain.The experimental results of this paper on the Resume dataset and the self-constructed coal chemical dataset Coal show that LLM-DER performs outstandingly in domain-specific entity recognition, not only outperforming the existing GPT-3.5-turbo baseline, but also exceeding the fully-supervised baseline, verifying its effectiveness in entity recognition.

Title: DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion

Authors: Yuchen Guo, Ruoxiang Xu, Rongcheng Li, Zhenghao Wu, Weifeng Su
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10080
Pdf URL: https://arxiv.org/pdf/2409.10080
Copy Paste: [[2409.10080]] DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion(https://arxiv.org/abs/2409.10080)
Keywords: extraction
Abstract: Multi-modality image fusion aims to integrate complementary data information from different imaging modalities into a single image. Existing methods often generate either blurry fused images that lose fine-grained semantic information or unnatural fused images that appear perceptually cropped from the inputs. In this work, we propose a novel two-phase discriminative autoencoder framework, termed DAE-Fuse, that generates sharp and natural fused images. In the adversarial feature extraction phase, we introduce two discriminative blocks into the encoder-decoder architecture, providing an additional adversarial loss to better guide feature extraction by reconstructing the source images. While the two discriminative blocks are adapted in the attention-guided cross-modality fusion phase to distinguish the structural differences between the fused output and the source inputs, injecting more naturalness into the results. Extensive experiments on public infrared-visible, medical image fusion, and downstream object detection datasets demonstrate our method's superiority and generalizability in both quantitative and qualitative evaluations.

Title: MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior

Authors: Weijing Tao, Xiaofeng Yang, Miaomiao Cui, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10090
Pdf URL: https://arxiv.org/pdf/2409.10090
Copy Paste: [[2409.10090]] MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior(https://arxiv.org/abs/2409.10090)
Keywords: diffusion
Abstract: This work presents MotionCom, a training-free motion-aware diffusion based image composition, enabling automatic and seamless integration of target objects into new scenes with dynamically coherent results without finetuning or optimization. Traditional approaches in this area suffer from two significant limitations: they require manual planning for object placement and often generate static compositions lacking motion realism. MotionCom addresses these issues by utilizing a Large Vision Language Model (LVLM) for intelligent planning, and a Video Diffusion prior for motion-infused image synthesis, streamlining the composition process. Our multi-modal Chain-of-Thought (CoT) prompting with LVLM automates the strategic placement planning of foreground objects, considering their potential motion and interaction within the scenes. Complementing this, we propose a novel method MotionPaint to distill motion-aware information from pretrained video diffusion models in the generation phase, ensuring that these objects are not only seamlessly integrated but also endowed with realistic motion. Extensive quantitative and qualitative results highlight MotionCom's superiority, showcasing its efficiency in streamlining the planning process and its capability to produce compositions that authentically depict motion and interaction.

Title: DDoS: Diffusion Distribution Similarity for Out-of-Distribution Detection

Authors: Kun Fang, Qinghua Tao, Zuopeng Yang, Xiaolin Huang, Jie Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.10094
Pdf URL: https://arxiv.org/pdf/2409.10094
Copy Paste: [[2409.10094]] DDoS: Diffusion Distribution Similarity for Out-of-Distribution Detection(https://arxiv.org/abs/2409.10094)
Keywords: protect, diffusion
Abstract: Out-of-Distribution (OoD) detection determines whether the given samples are from the training distribution of the classifier-under-protection, i.e., the In-Distribution (InD), or from a different OoD. Latest researches introduce diffusion models pre-trained on InD data to advocate OoD detection by transferring an OoD image into a generated one that is close to InD, so that one could capture the distribution disparities between original and generated images to detect OoD data. Existing diffusion-based detectors adopt perceptual metrics on the two images to measure such disparities, but ignore a fundamental fact: Perceptual metrics are devised essentially for human-perceived similarities of low-level image patterns, e.g., textures and colors, and are not advisable in evaluating distribution disparities, since images with different low-level patterns could possibly come from the same distribution. To address this issue, we formulate a diffusion-based detection framework that considers the distribution similarity between a tested image and its generated counterpart via a novel proper similarity metric in the informative feature space and probability space learned by the classifier-under-protection. An anomaly-removal strategy is further presented to enlarge such distribution disparities by removing abnormal OoD information in the feature space to facilitate the detection. Extensive empirical results unveil the insufficiency of perceptual metrics and the effectiveness of our distribution similarity framework with new state-of-the-art detection performance.

Title: Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Authors: Huy-Dung Nguyen, Anass Bairouk, Mirjana Maras, Wei Xiao, Tsun-Hsuan Wang, Patrick Chareyre, Ramin Hasani, Marc Blanchon, Daniela Rus
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10095
Pdf URL: https://arxiv.org/pdf/2409.10095
Copy Paste: [[2409.10095]] Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference(https://arxiv.org/abs/2409.10095)
Keywords: segmentation
Abstract: Autonomous driving holds great potential to transform road safety and traffic efficiency by minimizing human error and reducing congestion. A key challenge in realizing this potential is the accurate estimation of steering angles, which is essential for effective vehicle navigation and control. Recent breakthroughs in deep learning have made it possible to estimate steering angles directly from raw camera inputs. However, the limited available navigation data can hinder optimal feature learning, impacting the system's performance in complex driving scenarios. In this paper, we propose a shared encoder trained on multiple computer vision tasks critical for urban navigation, such as depth, pose, and 3D scene flow estimation, as well as semantic, instance, panoptic, and motion segmentation. By incorporating diverse visual information used by humans during navigation, this unified encoder might enhance steering angle estimation. To achieve effective multi-task learning within a single encoder, we introduce a multi-scale feature network for pose estimation to improve depth learning. Additionally, we employ knowledge distillation from a multi-backbone model pretrained on these navigation tasks to stabilize training and boost performance. Our findings demonstrate that a shared backbone trained on diverse visual tasks is capable of providing overall perception capabilities. While our performance in steering angle estimation is comparable to existing methods, the integration of human-like perception through multi-task learning holds significant potential for advancing autonomous driving systems. More details and the pretrained model are available at this https URL.

Title: Robust Reinforcement Learning with Dynamic Distortion Risk Measures

Authors: Anthony Coache, Sebastian Jaimungal
Subjects: cs.LG, q-fin.CP, q-fin.PM, q-fin.RM, stat.ML
Abstract URL: https://arxiv.org/abs/2409.10096
Pdf URL: https://arxiv.org/pdf/2409.10096
Copy Paste: [[2409.10096]] Robust Reinforcement Learning with Dynamic Distortion Risk Measures(https://arxiv.org/abs/2409.10096)
Keywords: robust
Abstract: In a reinforcement learning (RL) setting, the agent's optimal strategy heavily depends on her risk preferences and the underlying model dynamics of the training environment. These two aspects influence the agent's ability to make well-informed and time-consistent decisions when facing testing environments. In this work, we devise a framework to solve robust risk-aware RL problems where we simultaneously account for environmental uncertainty and risk with a class of dynamic robust distortion risk measures. Robustness is introduced by considering all models within a Wasserstein ball around a reference model. We estimate such dynamic robust risk measures using neural networks by making use of strictly consistent scoring functions, derive policy gradient formulae using the quantile representation of distortion risk measures, and construct an actor-critic algorithm to solve this class of robust risk-aware RL problems. We demonstrate the performance of our algorithm on a portfolio allocation example.

Title: Adaptive Segmentation-Based Initialization for Steered Mixture of Experts Image Regression

Authors: Yi-Hsin Li, Sebastian Knorr, Mårten Sjöström, Thomas Sikora
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10101
Pdf URL: https://arxiv.org/pdf/2409.10101
Copy Paste: [[2409.10101]] Adaptive Segmentation-Based Initialization for Steered Mixture of Experts Image Regression(https://arxiv.org/abs/2409.10101)
Keywords: segmentation
Abstract: Kernel image regression methods have shown to provide excellent efficiency in many image processing task, such as image and light-field compression, Gaussian Splatting, denoising and super-resolution. The estimation of parameters for these methods frequently employ gradient descent iterative optimization, which poses significant computational burden for many applications. In this paper, we introduce a novel adaptive segmentation-based initialization method targeted for optimizing Steered-Mixture-of Experts (SMoE) gating networks and Radial-Basis-Function (RBF) networks with steering kernels. The novel initialization method allocates kernels into pre-calculated image segments. The optimal number of kernels, kernel positions, and steering parameters are derived per segment in an iterative optimization and kernel sparsification procedure. The kernel information from "local" segments is then transferred into a "global" initialization, ready for use in iterative optimization of SMoE, RBF, and related kernel image regression methods. Results show that drastic objective and subjective quality improvements are achievable compared to widely used regular grid initialization, "state-of-the-art" K-Means initialization and previously introduced segmentation-based initialization methods, while also drastically improving the sparsity of the regression models. For same quality, the novel initialization results in models with around 50% reduction of kernels. In addition, a significant reduction of convergence time is achieved, with overall run-time savings of up to 50%. The segmentation-based initialization strategy itself admits heavy parallel computation; in theory, it may be divided into as many tasks as there are segments in the images. By accessing only four parallel GPUs, run-time savings of already 50% for initialization are achievable.

Title: Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT

Authors: Ryota Komatsu, Takahiro Shinozaki
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.10103
Pdf URL: https://arxiv.org/pdf/2409.10103
Copy Paste: [[2409.10103]] Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT(https://arxiv.org/abs/2409.10103)
Keywords: transformer, segmentation
Abstract: Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. Recent advances highlight the potential of deriving discrete symbols from the features correlated with linguistic units, which enables text-less training across diverse tasks. In particular, sentence-level Self-Distillation of the pretrained HuBERT (SD-HuBERT) induces syllabic structures within latent speech frame representations extracted from an intermediate Transformer layer. In SD-HuBERT, sentence-level representation is accumulated from speech frame features through self-attention layers using a special CLS token. However, we observe that the information aggregated in the CLS token correlates more with speaker identity than with linguistic content. To address this, we propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information. Our method introduces speaker perturbation as data augmentation and adopts a frame-level training objective to prevent the CLS token from aggregating paralinguistic information. Experimental results show that our approach surpasses the current state-of-the-art method in most syllable segmentation and syllabic unit quality metrics on Librispeech, underscoring its effectiveness in promoting syllabic organization within speech-only models.

Title: Analysing Attacks on Blockchain Systems in a Layer-based Approach

Authors: Joydip Das, Syed Ashraf Al Tasin, Md. Forhad Rabbi, Md Sadek Ferdous
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2409.10109
Pdf URL: https://arxiv.org/pdf/2409.10109
Copy Paste: [[2409.10109]] Analysing Attacks on Blockchain Systems in a Layer-based Approach(https://arxiv.org/abs/2409.10109)
Keywords: attack
Abstract: Blockchain is a growing decentralized system built for transparency and immutability. There have been several major attacks on blockchain-based systems, leaving a gap in the trustability of this system. This article presents a comprehensive study of 23 attacks on blockchain systems and categorizes them using a layer-based approach. This approach provides an in-depth analysis of the feasibility and motivation of these attacks. In addition, a framework is proposed that enables a systematic analysis of the impact and interconnection of these attacks, thereby providing a means of identifying potential attack vectors and designing appropriate countermeasures to strengthen any blockchain system.

Title: Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

Authors: Kodjo Mawuena Amekoe, Mustapha Lebbah, Gregoire Jaffre, Hanene Azzag, Zaineb Chelly Dagdia
Subjects: cs.LG, cs.CE, cs.NE
Abstract URL: https://arxiv.org/abs/2409.10111
Pdf URL: https://arxiv.org/pdf/2409.10111
Copy Paste: [[2409.10111]] Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection(https://arxiv.org/abs/2409.10111)
Keywords: interpretability
Abstract: Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{this https URL}

Title: StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models

Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Junfeng Fang, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10132
Pdf URL: https://arxiv.org/pdf/2409.10132
Copy Paste: [[2409.10132]] StruEdit: Structured Outputs Enable the Fast and Accurate Knowledge Editing for Large Language Models(https://arxiv.org/abs/2409.10132)
Keywords: large language model
Abstract: As the modern tool of choice for question answering, large language models (LLMs) are expected to deliver answers with up-to-date knowledge. To achieve such ideal question-answering systems, locating and then editing outdated knowledge in the natural language outputs is a general target of popular knowledge editing methods. However, this target is challenging, as both identifying which tokens to edit in the reasoning steps and ensuring the coherence of the revised reasoning chain are difficult tasks. We argue that these challenges stem from the unstructured nature of natural language outputs. To address the above challenges, we propose $\textbf{Stru}$ctural $\textbf{Edit}$ing ($\textbf{StruEdit}$), an improved baseline for knowledge editing. We first prompt LLMs to produce structured outputs consisting of reasoning triplets. Then, StruEdit removes any potentially outdated knowledge and efficiently refills the structured outputs with up-to-date information in a single step. Experimental results show that StruEdit consistently delivers the highest accuracy with lowest latency compared with other knowledge editing methods.

Title: PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Authors: Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10141
Pdf URL: https://arxiv.org/pdf/2409.10141
Copy Paste: [[2409.10141]] PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion(https://arxiv.org/abs/2409.10141)
Keywords: diffusion, generative
Abstract: Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

Title: AALF: Almost Always Linear Forecasting

Authors: Matthias Jakobs, Thomas Liebig
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.10142
Pdf URL: https://arxiv.org/pdf/2409.10142
Copy Paste: [[2409.10142]] AALF: Almost Always Linear Forecasting(https://arxiv.org/abs/2409.10142)
Keywords: interpretability
Abstract: Recent works for time-series forecasting more and more leverage the high predictive power of Deep Learning models. With this increase in model complexity, however, comes a lack in understanding of the underlying model decision process, which is problematic for high-stakes decision making. At the same time, simple, interpretable forecasting methods such as Linear Models can still perform very well, sometimes on-par, with Deep Learning approaches. We argue that simple models are good enough most of the time, and forecasting performance can be improved by choosing a Deep Learning method only for certain predictions, increasing the overall interpretability of the forecasting process. In this context, we propose a novel online model selection framework which uses meta-learning to identify these predictions and only rarely uses a non-interpretable, large model. An extensive empirical study on various real-world datasets shows that our selection methodology outperforms state-of-the-art online model selections methods in most cases. We find that almost always choosing a simple Linear Model for forecasting results in competitive performance, suggesting that the need for opaque black-box models in time-series forecasting is smaller than recent works would suggest.

Title: LLMs4OL 2024 Overview: The 1st Large Language Models for Ontology Learning Challenge

Authors: Hamed Babaei Giglou, Jennifer D'Souza, Sören Auer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10146
Pdf URL: https://arxiv.org/pdf/2409.10146
Copy Paste: [[2409.10146]] LLMs4OL 2024 Overview: The 1st Large Language Models for Ontology Learning Challenge(https://arxiv.org/abs/2409.10146)
Keywords: large language model
Abstract: This paper outlines the LLMs4OL 2024, the first edition of the Large Language Models for Ontology Learning Challenge. LLMs4OL is a community development initiative collocated with the 23rd International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing the web with structured knowledge to improve interoperability. By leveraging LLMs, the challenge aims to advance understanding and innovation in OL, aligning with the goals of the Semantic Web to create a more intelligent and user-friendly web. In this paper, we give an overview of the 2024 edition of the LLMs4OL challenge and summarize the contributions.

Title: AutoPET Challenge III: Testing the Robustness of Generalized Dice Focal Loss trained 3D Residual UNet for FDG and PSMA Lesion Segmentation from Whole-Body PET/CT Images

Authors: Shadab Ahamed
Subjects: cs.CV, cs.AI, physics.med-ph
Abstract URL: https://arxiv.org/abs/2409.10151
Pdf URL: https://arxiv.org/pdf/2409.10151
Copy Paste: [[2409.10151]] AutoPET Challenge III: Testing the Robustness of Generalized Dice Focal Loss trained 3D Residual UNet for FDG and PSMA Lesion Segmentation from Whole-Body PET/CT Images(https://arxiv.org/abs/2409.10151)
Keywords: robust, segmentation
Abstract: Automated segmentation of cancerous lesions in PET/CT scans is a crucial first step in quantitative image analysis. However, training deep learning models for segmentation with high accuracy is particularly challenging due to the variations in lesion size, shape, and radiotracer uptake. These lesions can appear in different parts of the body, often near healthy organs that also exhibit considerable uptake, making the task even more complex. As a result, creating an effective segmentation model for routine PET/CT image analysis is challenging. In this study, we utilized a 3D Residual UNet model and employed the Generalized Dice Focal Loss function to train the model on the AutoPET Challenge 2024 dataset. We conducted a 5-fold cross-validation and used an average ensembling technique using the models from the five folds. In the preliminary test phase for Task-1, the average ensemble achieved a mean Dice Similarity Coefficient (DSC) of 0.6687, mean false negative volume (FNV) of 10.9522 ml and mean false positive volume (FPV) 2.9684 ml. More details about the algorithm can be found on our GitHub repository: this https URL. The training code has been shared via the repository: this https URL.

Title: Quantile Regression for Distributional Reward Models in RLHF

Authors: Nicolai Dorka
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2409.10164
Pdf URL: https://arxiv.org/pdf/2409.10164
Copy Paste: [[2409.10164]] Quantile Regression for Distributional Reward Models in RLHF(https://arxiv.org/abs/2409.10164)
Keywords: large language model
Abstract: Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at this https URL.

Title: ExelMap: Explainable Element-based HD-Map Change Detection and Update

Authors: Lena Wild, Ludvig Ericson, Rafael Valencia, Patric Jensfelt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10178
Pdf URL: https://arxiv.org/pdf/2409.10178
Copy Paste: [[2409.10178]] ExelMap: Explainable Element-based HD-Map Change Detection and Update(https://arxiv.org/abs/2409.10178)
Keywords: fair, explainability
Abstract: Acquisition and maintenance are central problems in deploying high-definition (HD) maps for autonomous driving, with two lines of research prevalent in current literature: Online HD map generation and HD map change detection. However, the generated map's quality is currently insufficient for safe deployment, and many change detection approaches fail to precisely localize and extract the changed map elements, hence lacking explainability and hindering a potential fleet-based cooperative HD map update. In this paper, we propose the novel task of explainable element-based HD map change detection and update. In extending recent approaches that use online mapping techniques informed with an outdated map prior for HD map updating, we present ExelMap, an explainable element-based map updating strategy that specifically identifies changed map elements. In this context, we discuss how currently used metrics fail to capture change detection performance, while allowing for unfair comparison between prior-less and prior-informed map generation methods. Finally, we present an experimental study on real-world changes related to pedestrian crossings of the Argoverse 2 Map Change Dataset. To the best of our knowledge, this is the first comprehensive problem investigation of real-world end-to-end element-based HD map change detection and update, and ExelMap the first proposed solution.

Title: RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models

Authors: Başak Melis Öcal, Maxim Tatarchenko, Sezer Karaoglu, Theo Gevers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10180
Pdf URL: https://arxiv.org/pdf/2409.10180
Copy Paste: [[2409.10180]] RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models(https://arxiv.org/abs/2409.10180)
Keywords: diffusion
Abstract: Point cloud completion aims to recover the complete 3D shape of an object from partial observations. While approaches relying on synthetic shape priors achieved promising results in this domain, their applicability and generalizability to real-world data are still limited. To tackle this problem, we propose a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements. To better deal with noisy observations without resorting to training on synthetic data, we leverage additional geometric cues. Specifically, RealDiff simulates a diffusion process at the missing object parts while conditioning the generation on the partial input to address the multimodal nature of the task. We further regularize the training by matching object silhouettes and depth maps, predicted by our method, with the externally estimated ones. Experimental results show that our method consistently outperforms state-of-the-art methods in real-world point cloud completion.

Title: Enhancing RL Safety with Counterfactual LLM Reasoning

Authors: Dennis Gross, Helge Spieker
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2409.10188
Pdf URL: https://arxiv.org/pdf/2409.10188
Copy Paste: [[2409.10188]] Enhancing RL Safety with Counterfactual LLM Reasoning(https://arxiv.org/abs/2409.10188)
Keywords: large language model
Abstract: Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

Title: LLMs for clinical risk prediction

Authors: Mohamed Rezk, Patricia Cabanillas Silva, Fried-Michael Dahlweid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10191
Pdf URL: https://arxiv.org/pdf/2409.10191
Copy Paste: [[2409.10191]] LLMs for clinical risk prediction(https://arxiv.org/abs/2409.10191)
Keywords: large language model
Abstract: This study compares the efficacy of GPT-4 and clinalytix Medical AI in predicting the clinical risk of delirium development. Findings indicate that GPT-4 exhibited significant deficiencies in identifying positive cases and struggled to provide reliable probability estimates for delirium risk, while clinalytix Medical AI demonstrated superior accuracy. A thorough analysis of the large language model's (LLM) outputs elucidated potential causes for these discrepancies, consistent with limitations reported in extant literature. These results underscore the challenges LLMs face in accurately diagnosing conditions and interpreting complex clinical data. While LLMs hold substantial potential in healthcare, they are currently unsuitable for independent clinical decision-making. Instead, they should be employed in assistive roles, complementing clinical expertise. Continued human oversight remains essential to ensure optimal outcomes for both patients and healthcare providers.

Title: PrePaMS: Privacy-Preserving Participant Management System for Studies with Rewards and Prerequisites

Authors: Echo Meißner, Frank Kargl, Benjamin Erb, Felix Engelmann
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.10192
Pdf URL: https://arxiv.org/pdf/2409.10192
Copy Paste: [[2409.10192]] PrePaMS: Privacy-Preserving Participant Management System for Studies with Rewards and Prerequisites(https://arxiv.org/abs/2409.10192)
Keywords: secure, privacy, protect
Abstract: Taking part in surveys, experiments, and studies is often compensated by rewards to increase the number of participants and encourage attendance. While privacy requirements are usually considered for participation, privacy aspects of the reward procedure are mostly ignored. To this end, we introduce PrePaMS, an efficient participation management system that supports prerequisite checks and participation rewards in a privacy-preserving way. Our system organizes participations with potential (dis-)qualifying dependencies and enables secure reward payoffs. By leveraging a set of proven cryptographic primitives and mechanisms such as anonymous credentials and zero-knowledge proofs, participations are protected so that service providers and organizers cannot derive the identity of participants even within the reward process. In this paper, we have designed and implemented a prototype of PrePaMS to show its effectiveness and evaluated its performance under realistic workloads. PrePaMS covers the information whether subjects have participated in surveys, experiments, or studies. When combined with other secure solutions for the actual data collection within these events, PrePaMS can represent a cornerstone for more privacy-preserving empirical research.

Title: Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Authors: Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2409.10197
Pdf URL: https://arxiv.org/pdf/2409.10197
Copy Paste: [[2409.10197]] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models(https://arxiv.org/abs/2409.10197)
Keywords: large language model
Abstract: Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at this https URL.

Title: Garment Attribute Manipulation with Multi-level Attention

Authors: Vittorio Casula, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Chiara Pero, Carmen Bisogni, Marco Bertini, Alberto Del Bimbo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10206
Pdf URL: https://arxiv.org/pdf/2409.10206
Copy Paste: [[2409.10206]] Garment Attribute Manipulation with Multi-level Attention(https://arxiv.org/abs/2409.10206)
Keywords: transformer
Abstract: In the rapidly evolving field of online fashion shopping, the need for more personalized and interactive image retrieval systems has become paramount. Existing methods often struggle with precisely manipulating specific garment attributes without inadvertently affecting others. To address this challenge, we propose GAMMA (Garment Attribute Manipulation with Multi-level Attention), a novel framework that integrates attribute-disentangled representations with a multi-stage attention-based architecture. GAMMA enables targeted manipulation of fashion image attributes, allowing users to refine their searches with high accuracy. By leveraging a dual-encoder Transformer and memory block, our model achieves state-of-the-art performance on popular datasets like Shopping100k and DeepFashion.

Title: Robust Bird's Eye View Segmentation by Adapting DINOv2

Authors: Merve Rabia Barın, Görkay Aydemir, Fatma Güney
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10228
Pdf URL: https://arxiv.org/pdf/2409.10228
Copy Paste: [[2409.10228]] Robust Bird's Eye View Segmentation by Adapting DINOv2(https://arxiv.org/abs/2409.10228)
Keywords: robust, segmentation
Abstract: Extracting a Bird's Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.

Title: From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs

Authors: Navya Jain, Zekun Wu, Cristian Munoz, Airlie Hilliard, Adriano Koshiyama, Emre Kazim, Philip Treleaven
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10245
Pdf URL: https://arxiv.org/pdf/2409.10245
Copy Paste: [[2409.10245]] From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs(https://arxiv.org/abs/2409.10245)
Keywords: interpretability, explainability
Abstract: As the demand for human-like interactions with LLMs continues to grow, so does the interest in manipulating their personality traits, which has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability. IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLORA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and Llama-2-7B-chat began generating emojis, despite their absence in the PEFT data. For instance, Llama-2-7B-chat generated emojis in 99.5% of extraversion-related test instances, while Mistral-8B-Instruct did so in 92.5% of openness-related test instances. Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT's superiority over IKE in personality manipulation; and finally, analyzing and validating emoji usage through explainability methods such as mechanistic interpretability and in-context learning explainability methods.

Title: BAFNet: Bilateral Attention Fusion Network for Lightweight Semantic Segmentation of Urban Remote Sensing Images

Authors: Wentao Wang, Xili Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.10269
Pdf URL: https://arxiv.org/pdf/2409.10269
Copy Paste: [[2409.10269]] BAFNet: Bilateral Attention Fusion Network for Lightweight Semantic Segmentation of Urban Remote Sensing Images(https://arxiv.org/abs/2409.10269)
Keywords: segmentation
Abstract: Large-scale semantic segmentation networks often achieve high performance, while their application can be challenging when faced with limited sample sizes and computational resources. In scenarios with restricted network size and computational complexity, models encounter significant challenges in capturing long-range dependencies and recovering detailed information in images. We propose a lightweight bilateral semantic segmentation network called bilateral attention fusion network (BAFNet) to efficiently segment high-resolution urban remote sensing images. The model consists of two paths, namely dependency path and remote-local path. The dependency path utilizes large kernel attention to acquire long-range dependencies in the image. Besides, multi-scale local attention and efficient remote attention are designed to construct remote-local path. Finally, a feature aggregation module is designed to effectively utilize the different features of the two paths. Our proposed method was tested on public high-resolution urban remote sensing datasets Vaihingen and Potsdam, with mIoU reaching 83.20% and 86.53%, respectively. As a lightweight semantic segmentation model, BAFNet not only outperforms advanced lightweight models in accuracy but also demonstrates comparable performance to non-lightweight state-of-the-art methods on two datasets, despite a tenfold variance in floating-point operations and a fifteenfold difference in network parameters.

Title: Performance of Human Annotators in Object Detection and Segmentation of Remotely Sensed Data

Authors: Roni Blushtein-Livnon, Tal Svoray, Michael Dorman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10272
Pdf URL: https://arxiv.org/pdf/2409.10272
Copy Paste: [[2409.10272]] Performance of Human Annotators in Object Detection and Segmentation of Remotely Sensed Data(https://arxiv.org/abs/2409.10272)
Keywords: segmentation
Abstract: This study introduces a laboratory experiment designed to assess the influence of annotation strategies, levels of imbalanced data, and prior experience, on the performance of human annotators. The experiment focuses on labeling aerial imagery, using ArcGIS Pro tools, to detect and segment small-scale photovoltaic solar panels, selected as a case study for rectangular objects. The experiment is conducted using images with a pixel size of 0.15\textbf{$m$}, involving both expert and non-expert participants, across different setup strategies and target-background ratio datasets. Our findings indicate that human annotators generally perform more effectively in object detection than in segmentation tasks. A marked tendency to commit more Type II errors (False Negatives, i.e., undetected objects) than Type I errors (False Positives, i.e. falsely detecting objects that do not exist) was observed across all experimental setups and conditions, suggesting a consistent bias in detection and segmentation processes. Performance was better in tasks with higher target-background ratios (i.e., more objects per unit area). Prior experience did not significantly impact performance and may, in some cases, even lead to overestimation in segmentation. These results provide evidence that human annotators are relatively cautious and tend to identify objects only when they are confident about them, prioritizing underestimation over overestimation. Annotators' performance is also influenced by object scarcity, showing a decline in areas with extremely imbalanced datasets and a low ratio of target-to-background. These findings may enhance annotation strategies for remote sensing research while efficient human annotators are crucial in an era characterized by growing demands for high-quality training data to improve segmentation and detection models.

Title: Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation

Authors: Neil De La Fuente, Mireia Majó, Irina Luzko, Henry Córdova, Gloria Fernández-Esparrach, Jorge Bernal
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2409.10286
Pdf URL: https://arxiv.org/pdf/2409.10286
Copy Paste: [[2409.10286]] Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation(https://arxiv.org/abs/2409.10286)
Keywords: robust
Abstract: Accurate and robust medical image classification is a challenging task, especially in application domains where available annotated datasets are small and present high imbalance between target classes. Considering that data acquisition is not always feasible, especially for underrepresented classes, our approach introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space interpolation to improve discrimination capabilities. By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance. The method presented in this paper relies on the interpolation of latent representations within each class, thus enriching the training set and improving the model's generalizability and diagnostic accuracy. The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images. By combining real and synthetic data, an increase of over 18\% in the accuracy of the most challenging underrepresented class was observed. The proposed strategy not only benefited the underrepresented class but also led to a general improvement in other metrics, including a 6\% increase in global accuracy and precision.

Title: On Synthetic Texture Datasets: Challenges, Creation, and Curation

Authors: Blaine Hoak, Patrick McDaniel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10297
Pdf URL: https://arxiv.org/pdf/2409.10297
Copy Paste: [[2409.10297]] On Synthetic Texture Datasets: Challenges, Creation, and Curation(https://arxiv.org/abs/2409.10297)
Keywords: robust, interpretability, diffusion, generative
Abstract: The influence of textures on machine learning models has been an ongoing investigation, specifically in texture bias/learning, interpretability, and robustness. However, due to the lack of large and diverse texture data available, the findings in these works have been limited, as more comprehensive evaluations have not been feasible. Image generative models are able to provide data creation at scale, but utilizing these models for texture synthesis has been unexplored and poses additional challenges both in creating accurate texture images and validating those images. In this work, we introduce an extensible methodology and corresponding new dataset for generating high-quality, diverse texture images capable of supporting a broad set of texture-based tasks. Our pipeline consists of: (1) developing prompts from a range of descriptors to serve as input to text-to-image models, (2) adopting and adapting Stable Diffusion pipelines to generate and filter the corresponding images, and (3) further filtering down to the highest quality images. Through this, we create the Prompted Textures Dataset (PTD), a dataset of 362,880 texture images that span 56 textures. During the process of generating images, we find that NSFW safety filters in image generation pipelines are highly sensitive to texture (and flag up to 60\% of our texture images), uncovering a potential bias in these models and presenting unique challenges when working with texture data. Through both standard metrics and a human evaluation, we find that our dataset is high quality and diverse.

Title: Fuse4Seg: Image-Level Fusion Based Multi-Modality Medical Image Segmentation

Authors: Yuchen Guo, Weifeng Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10328
Pdf URL: https://arxiv.org/pdf/2409.10328
Copy Paste: [[2409.10328]] Fuse4Seg: Image-Level Fusion Based Multi-Modality Medical Image Segmentation(https://arxiv.org/abs/2409.10328)
Keywords: segmentation
Abstract: Although multi-modality medical image segmentation holds significant potential for enhancing the diagnosis and understanding of complex diseases by integrating diverse imaging modalities, existing methods predominantly rely on feature-level fusion strategies. We argue the current feature-level fusion strategy is prone to semantic inconsistencies and misalignments across various imaging modalities because it merges features at intermediate layers in a neural network without evaluative control. To mitigate this, we introduce a novel image-level fusion based multi-modality medical image segmentation method, Fuse4Seg, which is a bi-level learning framework designed to model the intertwined dependencies between medical image segmentation and medical image fusion. The image-level fusion process is seamlessly employed to guide and enhance the segmentation results through a layered optimization approach. Besides, the knowledge gained from the segmentation module can effectively enhance the fusion module. This ensures that the resultant fused image is a coherent representation that accurately amalgamates information from all modalities. Moreover, we construct a BraTS-Fuse benchmark based on BraTS dataset, which includes 2040 paired original images, multi-modal fusion images, and ground truth. This benchmark not only serves image-level medical segmentation but is also the largest dataset for medical image fusion to date. Extensive experiments on several public datasets and our benchmark demonstrate the superiority of our approach over prior state-of-the-art (SOTA) methodologies.

Title: InfoDisent: Explainability of Image Classification Models by Information Disentanglement

Authors: Łukasz Struski, Jacek Tabor
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10329
Pdf URL: https://arxiv.org/pdf/2409.10329
Copy Paste: [[2409.10329]] InfoDisent: Explainability of Image Classification Models by Information Disentanglement(https://arxiv.org/abs/2409.10329)
Keywords: explainability, transformer
Abstract: Understanding the decisions made by image classification networks is a critical area of research in deep learning. This task is traditionally divided into two distinct approaches: post-hoc methods and intrinsic methods. Post-hoc methods, such as GradCam, aim to interpret the decisions of pre-trained models by identifying regions of the image where the network focuses its attention. However, these methods provide only a high-level overview, making it difficult to fully understand the network's decision-making process. Conversely, intrinsic methods, like prototypical parts models, offer a more detailed understanding of network predictions but are constrained by specific architectures, training methods, and datasets. In this paper, we introduce InfoDisent, a hybrid model that combines the advantages of both approaches. By utilizing an information bottleneck, InfoDisent disentangles the information in the final layer of a pre-trained deep network, enabling the breakdown of classification decisions into basic, understandable atomic components. Unlike standard prototypical parts approaches, InfoDisent can interpret the decisions of pre-trained classification networks and be used for making classification decisions, similar to intrinsic models. We validate the effectiveness of InfoDisent on benchmark datasets such as ImageNet, CUB-200-2011, Stanford Cars, and Stanford Dogs for both convolutional and transformer backbones.

Title: Execution-time opacity control for timed automata

Authors: Étienne André, Marie Duflot, Laetitia Laversa, Engel Lefaucheux
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2409.10336
Pdf URL: https://arxiv.org/pdf/2409.10336
Copy Paste: [[2409.10336]] Execution-time opacity control for timed automata(https://arxiv.org/abs/2409.10336)
Keywords: attack
Abstract: Timing leaks in timed automata (TA) can occur whenever an attacker is able to deduce a secret by observing some timed behavior. In execution-time opacity, the attacker aims at deducing whether a private location was visited, by observing only the execution time. It can be decided whether a TA is opaque in this setting. In this work, we tackle control, and show that we are able to decide whether a TA can be controlled at runtime to ensure opacity. Our method is constructive, in the sense that we can exhibit such a controller. We also address the case when the attacker cannot have an infinite precision in its observations.

Title: Security, Trust and Privacy challenges in AI-driven 6G Networks

Authors: Helena Rifa-Pous, Victor Garcia-Font, Carlos Nunez-Gomez, Julian Salas
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2409.10337
Pdf URL: https://arxiv.org/pdf/2409.10337
Copy Paste: [[2409.10337]] Security, Trust and Privacy challenges in AI-driven 6G Networks(https://arxiv.org/abs/2409.10337)
Keywords: security, privacy, attack, robust
Abstract: The advent of 6G networks promises unprecedented advancements in wireless communication, offering wider bandwidth and lower latency compared to its predecessors. This article explores the evolving infrastructure of 6G networks, emphasizing the transition towards a more disaggregated structure and the integration of artificial intelligence (AI) technologies. Furthermore, it explores the security, trust and privacy challenges and attacks in 6G networks, particularly those related to the use of AI. It presents a classification of network attacks stemming from its AI-centric architecture and explores technologies designed to detect or mitigate these emerging threats. The paper concludes by examining the implications and risks linked to the utilization of AI in ensuring a robust network.

Title: The 20 questions game to distinguish large language models

Authors: Gurvan Richardeau, Erwan Le Merrer, Camilla Penzo, Gilles Tredan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10338
Pdf URL: https://arxiv.org/pdf/2409.10338
Copy Paste: [[2409.10338]] The 20 questions game to distinguish large language models(https://arxiv.org/abs/2409.10338)
Keywords: steal, large language model
Abstract: In a parallel with the 20 questions game, we present a method to determine whether two large language models (LLMs), placed in a black-box context, are the same or not. The goal is to use a small set of (benign) binary questions, typically under 20. We formalize the problem and first establish a baseline using a random selection of questions from known benchmark datasets, achieving an accuracy of nearly 100% within 20 questions. After showing optimal bounds for this problem, we introduce two effective questioning heuristics able to discriminate 22 LLMs by using half as many questions for the same task. These methods offer significant advantages in terms of stealth and are thus of interest to auditors or copyright owners facing suspicions of model leaks.

Title: Hyperedge Modeling in Hypergraph Neural Networks by using Densest Overlapping Subgraphs

Authors: Mehrad Soltani, Luis Rueda
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2409.10340
Pdf URL: https://arxiv.org/pdf/2409.10340
Copy Paste: [[2409.10340]] Hyperedge Modeling in Hypergraph Neural Networks by using Densest Overlapping Subgraphs(https://arxiv.org/abs/2409.10340)
Keywords: robust
Abstract: Hypergraphs tackle the limitations of traditional graphs by introducing {\em hyperedges}. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges. Also, the underlying message-passing mechanisms in Hypergraph Neural Networks (HGNNs) are in the form of vertex-hyperedge-vertex, which let HGNNs capture and utilize richer and more complex structural information than traditional Graph Neural Networks (GNNs). More recently, the idea of overlapping subgraphs has emerged. These subgraphs can capture more information about subgroups of vertices without limiting one vertex belonging to just one group, allowing vertices to belong to multiple groups or subgraphs. In addition, one of the most important problems in graph clustering is to find densest overlapping subgraphs (DOS). In this paper, we propose a solution to the DOS problem via Agglomerative Greedy Enumeration (DOSAGE) algorithm as a novel approach to enhance the process of generating the densest overlapping subgraphs and, hence, a robust construction of the hypergraphs. Experiments on standard benchmarks show that the DOSAGE algorithm significantly outperforms the HGNNs and six other methods on the node classification task.

Title: Detecting Sexism in German Online Newspaper Comments with Open-Source Text Embeddings (Team GDA, GermEval2024 Shared Task 1: GerMS-Detect, Subtasks 1 and 2, Closed Track)

Authors: Florian Bremm, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10341
Pdf URL: https://arxiv.org/pdf/2409.10341
Copy Paste: [[2409.10341]] Detecting Sexism in German Online Newspaper Comments with Open-Source Text Embeddings (Team GDA, GermEval2024 Shared Task 1: GerMS-Detect, Subtasks 1 and 2, Closed Track)(https://arxiv.org/abs/2409.10341)
Keywords: robust
Abstract: Sexism in online media comments is a pervasive challenge that often manifests subtly, complicating moderation efforts as interpretations of what constitutes sexism can vary among individuals. We study monolingual and multilingual open-source text embeddings to reliably detect sexism and misogyny in German-language online comments from an Austrian newspaper. We observed classifiers trained on text embeddings to mimic closely the individual judgements of human annotators. Our method showed robust performance in the GermEval 2024 GerMS-Detect Subtask 1 challenge, achieving an average macro F1 score of 0.597 (4th place, as reported on Codabench). It also accurately predicted the distribution of human annotations in GerMS-Detect Subtask 2, with an average Jensen-Shannon distance of 0.301 (2nd place). The computational efficiency of our approach suggests potential for scalable applications across various languages and linguistic contexts.

Title: Taming Diffusion Models for Image Restoration: A Review

Authors: Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10353
Pdf URL: https://arxiv.org/pdf/2409.10353
Copy Paste: [[2409.10353]] Taming Diffusion Models for Image Restoration: A Review(https://arxiv.org/abs/2409.10353)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring, dehazing, etc. In this review paper, we introduce key constructions in diffusion models and survey contemporary techniques that make use of diffusion models in solving general IR tasks. Furthermore, we point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.

Title: 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?

Authors: Téo Guichoux, Laure Soulier, Nicolas Obin, Catherine Pelachaud
Subjects: cs.CV, cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2409.10357
Pdf URL: https://arxiv.org/pdf/2409.10357
Copy Paste: [[2409.10357]] 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?(https://arxiv.org/abs/2409.10357)
Keywords: generative
Abstract: Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.

Title: Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning

Authors: Amin Karimi Monsefi, Mengxi Zhou, Nastaran Karimi Monsefi, Ser-Nam Lim, Wei-Lun Chao, Rajiv Ramnath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10362
Pdf URL: https://arxiv.org/pdf/2409.10362
Copy Paste: [[2409.10362]] Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning(https://arxiv.org/abs/2409.10362)
Keywords: segmentation
Abstract: We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.

Title: Robust image representations with counterfactual contrastive learning

Authors: Mélanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia, Galvin Khara, Ben Glocker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10365
Pdf URL: https://arxiv.org/pdf/2409.10365
Copy Paste: [[2409.10365]] Robust image representations with counterfactual contrastive learning(https://arxiv.org/abs/2409.10365)
Keywords: robust
Abstract: Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.

Title: Mamba-ST: State Space Model for Efficient Style Transfer

Authors: Filippo Botti, Alex Ergasti, Leonardo Rossi, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10385
Pdf URL: https://arxiv.org/pdf/2409.10385
Copy Paste: [[2409.10385]] Mamba-ST: State Space Model for Efficient Style Transfer(https://arxiv.org/abs/2409.10385)
Keywords: diffusion, transformer
Abstract: The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at this https URL.

Title: Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Authors: Hanbo Bi, Yingchao Feng, Wenhui Diao, Peijin Wang, Yongqiang Mao, Kun Fu, Hongqi Wang, Xian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10389
Pdf URL: https://arxiv.org/pdf/2409.10389
Copy Paste: [[2409.10389]] Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation(https://arxiv.org/abs/2409.10389)
Keywords: segmentation
Abstract: For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.

Title: A Knowledge-Enhanced Disease Diagnosis Method Based on Prompt Learning and BERT Integration

Authors: Zhang Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10403
Pdf URL: https://arxiv.org/pdf/2409.10403
Copy Paste: [[2409.10403]] A Knowledge-Enhanced Disease Diagnosis Method Based on Prompt Learning and BERT Integration(https://arxiv.org/abs/2409.10403)
Keywords: interpretability
Abstract: This paper proposes a knowledge-enhanced disease diagnosis method based on a prompt learning framework. The method retrieves structured knowledge from external knowledge graphs related to clinical cases, encodes it, and injects it into the prompt templates to enhance the language model's understanding and reasoning capabilities for the task.We conducted experiments on three public datasets: CHIP-CTC, IMCS-V2-NER, and KUAKE-QTR. The results show that the proposed method significantly outperforms existing models across multiple evaluation metrics, with an F1 score improvement of 2.4% on the CHIP-CTC dataset, 3.1% on the IMCS-V2-NER dataset,and 4.2% on the KUAKE-QTR dataset. Additionally,ablation studies confirmed the critical role of the knowledge injection module,as the removal of this module resulted in a significant drop in F1 score. The experimental results demonstrate that the proposed method not only effectively improves the accuracy of disease diagnosis but also enhances the interpretability of the predictions, providing more reliable support and evidence for clinical diagnosis.

Title: A Large-Scale Privacy Assessment of Android Third-Party SDKs

Authors: Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2409.10411
Pdf URL: https://arxiv.org/pdf/2409.10411
Copy Paste: [[2409.10411]] A Large-Scale Privacy Assessment of Android Third-Party SDKs(https://arxiv.org/abs/2409.10411)
Keywords: privacy, protect, large language model
Abstract: Third-party Software Development Kits (SDKs) are widely adopted in Android app development, to effortlessly accelerate development pipelines and enhance app functionality. However, this convenience raises substantial concerns about unauthorized access to users' privacy-sensitive information, which could be further abused for illegitimate purposes like user tracking or monetization. Our study offers a targeted analysis of user privacy protection among Android third-party SDKs, filling a critical gap in the Android software supply chain. It focuses on two aspects of their privacy practices, including data exfiltration and behavior-policy compliance (or privacy compliance), utilizing techniques of taint analysis and large language models. It covers 158 widely-used SDKs from two key SDK release platforms, the official one and a large alternative one. From them, we identified 338 instances of privacy data exfiltration. On the privacy compliance, our study reveals that more than 30% of the examined SDKs fail to provide a privacy policy to disclose their data handling practices. Among those that provide privacy policies, 37% of them over-collect user data, and 88% falsely claim access to sensitive data. We revisit the latest versions of the SDKs after 12 months. Our analysis demonstrates a persistent lack of improvement in these concerning trends. Based on our findings, we propose three actionable recommendations to mitigate the privacy leakage risks and enhance privacy protection for Android users. Our research not only serves as an urgent call for industry attention but also provides crucial insights for future regulatory interventions.

Title: Learning Semi-Supervised Medical Image Segmentation from Spatial Registration

Authors: Qianying Liu, Paul Henderson, Xiao Gu, Hang Dai, Fani Deligianni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10422
Pdf URL: https://arxiv.org/pdf/2409.10422
Copy Paste: [[2409.10422]] Learning Semi-Supervised Medical Image Segmentation from Spatial Registration(https://arxiv.org/abs/2409.10422)
Keywords: segmentation
Abstract: Semi-supervised medical image segmentation has shown promise in training models with limited labeled data and abundant unlabeled data. However, state-of-the-art methods ignore a potentially valuable source of unsupervised semantic information -- spatial registration transforms between image volumes. To address this, we propose CCT-R, a contrastive cross-teaching framework incorporating registration information. To leverage the semantic information available in registrations between volume pairs, CCT-R incorporates two proposed modules: Registration Supervision Loss (RSL) and Registration-Enhanced Positive Sampling (REPS). The RSL leverages segmentation knowledge derived from transforms between labeled and unlabeled volume pairs, providing an additional source of pseudo-labels. REPS enhances contrastive learning by identifying anatomically-corresponding positives across volumes using registration transforms. Experimental results on two challenging medical segmentation benchmarks demonstrate the effectiveness and superiority of CCT-R across various semi-supervised settings, with as few as one labeled case. Our code is available at this https URL.

Title: Signed Graph Autoencoder for Explainable and Polarization-Aware Network Embeddings

Authors: Nikolaos Nakis, Chrysoula Kosma, Giannis Nikolentzos, Michalis Chatzianastasis, Iakovos Evdaimon, Michalis Vazirgiannis
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2409.10452
Pdf URL: https://arxiv.org/pdf/2409.10452
Copy Paste: [[2409.10452]] Signed Graph Autoencoder for Explainable and Polarization-Aware Network Embeddings(https://arxiv.org/abs/2409.10452)
Keywords: generative
Abstract: Autoencoders based on Graph Neural Networks (GNNs) have garnered significant attention in recent years for their ability to extract informative latent representations, characterizing the structure of complex topologies, such as graphs. Despite the prevalence of Graph Autoencoders, there has been limited focus on developing and evaluating explainable neural-based graph generative models specifically designed for signed networks. To address this gap, we propose the Signed Graph Archetypal Autoencoder (SGAAE) framework. SGAAE extracts node-level representations that express node memberships over distinct extreme profiles, referred to as archetypes, within the network. This is achieved by projecting the graph onto a learned polytope, which governs its polarization. The framework employs a recently proposed likelihood for analyzing signed networks based on the Skellam distribution, combined with relational archetypal analysis and GNNs. Our experimental evaluation demonstrates the SGAAEs' capability to successfully infer node memberships over the different underlying latent structures while extracting competing communities formed through the participation of the opposing views in the network. Additionally, we introduce the 2-level network polarization problem and show how SGAAE is able to characterize such a setting. The proposed model achieves high performance in different tasks of signed link prediction across four real-world datasets, outperforming several baseline models.

Title: MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

Authors: Lehong Wu, Lilang Lin, Jiahang Zhang, Yiyang Ma, Jiaying Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10473
Pdf URL: https://arxiv.org/pdf/2409.10473
Copy Paste: [[2409.10473]] MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion(https://arxiv.org/abs/2409.10473)
Keywords: diffusion, generative
Abstract: Self-supervised learning has proved effective for skeleton-based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, leading to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data. Our project is available at this https URL.

Title: SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Authors: Qi Qian, Haiyang Xu, Ming Yan, Juhua Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2409.10476
Pdf URL: https://arxiv.org/pdf/2409.10476
Copy Paste: [[2409.10476]] SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing(https://arxiv.org/abs/2409.10476)
Keywords: diffusion
Abstract: Diffusion models demonstrate impressive image generation performance with text guidance. Inspired by the learning process of diffusion, existing images can be edited according to text by DDIM inversion. However, the vanilla DDIM inversion is not optimized for classifier-free guidance and the accumulated error will result in the undesired performance. While many algorithms are developed to improve the framework of DDIM inversion for editing, in this work, we investigate the approximation error in DDIM inversion and propose to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework. Moreover, a better guidance scale (i.e., 0.5) than default settings can be derived theoretically. Experiments on PIE-Bench show that our proposal can improve the performance of DDIM inversion dramatically without sacrificing efficiency.

Title: Schrodinger's Memory: Large Language Models

Authors: Wei Wang, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10482
Pdf URL: https://arxiv.org/pdf/2409.10482
Copy Paste: [[2409.10482]] Schrodinger's Memory: Large Language Models(https://arxiv.org/abs/2409.10482)
Keywords: large language model
Abstract: Memory is the foundation of LLMs' functionality, yet past research has lacked an in-depth exploration of their memory capabilities and underlying theory. In this paper, we apply UAT theory to explain the memory mechanism of LLMs and propose a new approach for evaluating LLM performance by comparing the memory capacities of different models. Through extensive experiments, we validate our theory and the memory abilities of LLMs. Finally, we compare the capabilities of the human brain and LLMs, highlighting both their similarities and differences in terms of working mechanisms.

Title: Do Pre-trained Vision-Language Models Encode Object States?

Authors: Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, Chen Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10488
Pdf URL: https://arxiv.org/pdf/2409.10488
Copy Paste: [[2409.10488]] Do Pre-trained Vision-Language Models Encode Object States?(https://arxiv.org/abs/2409.10488)
Keywords: generative
Abstract: For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.

Title: Flash STU: Fast Spectral Transform Units

Authors: Y. Isabel Liu, Windsor Nguyen, Yagiz Devre, Evan Dogariu, Anirudha Majumdar, Elad Hazan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2409.10489
Pdf URL: https://arxiv.org/pdf/2409.10489
Copy Paste: [[2409.10489]] Flash STU: Fast Spectral Transform Units(https://arxiv.org/abs/2409.10489)
Keywords: transformer
Abstract: This paper describes an efficient, open source PyTorch implementation of the Spectral Transform Unit. We investigate sequence prediction tasks over several modalities including language, robotics, and simulated dynamical systems. We find that for the same parameter count, the STU and its variants outperform the Transformer as well as other leading state space models across various modalities.

Title: Partial Distribution Matching via Partial Wasserstein Adversarial Networks

Authors: Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2409.10499
Pdf URL: https://arxiv.org/pdf/2409.10499
Copy Paste: [[2409.10499]] Partial Distribution Matching via Partial Wasserstein Adversarial Networks(https://arxiv.org/abs/2409.10499)
Keywords: robust
Abstract: This paper studies the problem of distribution matching (DM), which is a fundamental machine learning problem seeking to robustly align two probability distributions. Our approach is established on a relaxed formulation, called partial distribution matching (PDM), which seeks to match a fraction of the distributions instead of matching them completely. We theoretically derive the Kantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy, and develop a partial Wasserstein adversarial network (PWAN) that efficiently approximates the PW discrepancy based on this dual form. Partial matching can then be achieved by optimizing the network using gradient descent. Two practical tasks, point set registration and partial domain adaptation are investigated, where the goals are to partially match distributions in 3D space and high-dimensional feature space respectively. The experiment results confirm that the proposed PWAN effectively produces highly robust matching results, performing better or on par with the state-of-the-art methods.

Title: Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles

Authors: Kulin Shah, Nishanth Dikkala, Xin Wang, Rina Panigrahy
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2409.10502
Pdf URL: https://arxiv.org/pdf/2409.10502
Copy Paste: [[2409.10502]] Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles(https://arxiv.org/abs/2409.10502)
Keywords: transformer, large language model
Abstract: Causal language modeling using the Transformer architecture has yielded remarkable capabilities in Large Language Models (LLMs) over the last few years. However, the extent to which fundamental search and reasoning capabilities emerged within LLMs remains a topic of ongoing debate. In this work, we study if causal language modeling can learn a complex task such as solving Sudoku puzzles. To solve a Sudoku, the model is first required to search over all empty cells of the puzzle to decide on a cell to fill and then apply an appropriate strategy to fill the decided cell. Sometimes, the application of a strategy only results in thinning down the possible values in a cell rather than concluding the exact value of the cell. In such cases, multiple strategies are applied one after the other to fill a single cell. We observe that Transformer models trained on this synthetic task can indeed learn to solve Sudokus (our model solves $94.21\%$ of the puzzles fully correctly) when trained on a logical sequence of steps taken by a solver. We find that training Transformers with the logical sequence of steps is necessary and without such training, they fail to learn Sudoku. We also extend our analysis to Zebra puzzles (known as Einstein puzzles) and show that the model solves $92.04 \%$ of the puzzles fully correctly. In addition, we study the internal representations of the trained Transformer and find that through linear probing, we can decode information about the set of possible values in any given cell from them, pointing to the presence of a strong reasoning engine implicit in the Transformer weights.

Title: DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction

Authors: John Wu, David Wu, Jimeng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2409.10504
Pdf URL: https://arxiv.org/pdf/2409.10504
Copy Paste: [[2409.10504]] DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction(https://arxiv.org/abs/2409.10504)
Keywords: interpretability, large language model
Abstract: Predicting high-dimensional or extreme multilabels, such as in medical coding, requires both accuracy and interpretability. Existing works often rely on local interpretability methods, failing to provide comprehensive explanations of the overall mechanism behind each label prediction within a multilabel set. We propose a mechanistic interpretability module called DIctionary Label Attention (\method) that disentangles uninterpretable dense embeddings into a sparse embedding space, where each nonzero element (a dictionary feature) represents a globally learned medical concept. Through human evaluations, we show that our sparse embeddings are more human understandable than its dense counterparts by at least 50 percent. Our automated dictionary feature identification pipeline, leveraging large language models (LLMs), uncovers thousands of learned medical concepts by examining and summarizing the highest activating tokens for each dictionary feature. We represent the relationships between dictionary features and medical codes through a sparse interpretable matrix, enhancing the mechanistic and global understanding of the model's predictions while maintaining competitive performance and scalability without extensive human annotation.

Title: RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Authors: Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2409.10516
Pdf URL: https://arxiv.org/pdf/2409.10516
Copy Paste: [[2409.10516]] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval(https://arxiv.org/abs/2409.10516)
Keywords: transformer, large language model
Abstract: Transformer-based large Language Models (LLMs) become increasingly important in various domains. However, the quadratic time complexity of attention operation poses a significant challenge for scaling to longer contexts due to the extremely high inference latency and GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to accelerate attention computation. To leverage the dynamic sparse property of attention, RetrievalAttention builds approximate nearest neighbor search (ANNS) indexes upon KV vectors in CPU memory and retrieves the most relevant ones via vector search during generation. Due to the out-of-distribution (OOD) between query vectors and key vectors, off-the-shelf ANNS indexes still need to scan O(N) (usually 30% of all keys) data for accurate retrieval, which fails to exploit the high sparsity. RetrievalAttention first identifies the OOD challenge of ANNS-based attention, and addresses it via an attention-aware vector search algorithm that can adapt to queries and only access 1--3% of data, thus achieving a sub-linear time complexity. RetrievalAttention greatly reduces the inference cost of long-context LLM with much lower GPU memory requirements while maintaining the model accuracy. Especially, RetrievalAttention only needs 16GB GPU memory for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds on a single NVIDIA RTX4090 (24GB).