2024-07-23

Title: A Foundation Model for Soccer

Authors: Ethan Baron, Daniel Hocevar, Zach Salehe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14558
Pdf URL: https://arxiv.org/pdf/2407.14558
Copy Paste: [[2407.14558]] A Foundation Model for Soccer(https://arxiv.org/abs/2407.14558)
Keywords: transformer
Abstract: We propose a foundation model for soccer, which is able to predict subsequent actions in a soccer match from a given input sequence of actions. As a proof of concept, we train a transformer architecture on three seasons of data from a professional soccer league. We quantitatively and qualitatively compare the performance of this transformer architecture to two baseline models: a Markov model and a multi-layer perceptron. Additionally, we discuss potential applications of our model. We provide an open-source implementation of our methods at this https URL.

Title: Automated and Holistic Co-design of Neural Networks and ASICs for Enabling In-Pixel Intelligence

Authors: Shubha R. Kharel, Prashansa Mukim, Piotr Maj, Grzegorz W. Deptuch, Shinjae Yoo, Yihui Ren, Soumyajit Mandal
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2407.14560
Pdf URL: https://arxiv.org/pdf/2407.14560
Copy Paste: [[2407.14560]] Automated and Holistic Co-design of Neural Networks and ASICs for Enabling In-Pixel Intelligence(https://arxiv.org/abs/2407.14560)
Keywords: extraction
Abstract: Extreme edge-AI systems, such as those in readout ASICs for radiation detection, must operate under stringent hardware constraints such as micron-level dimensions, sub-milliwatt power, and nanosecond-scale speed while providing clear accuracy advantages over traditional architectures. Finding ideal solutions means identifying optimal AI and ASIC design choices from a design space that has explosively expanded during the merger of these domains, creating non-trivial couplings which together act upon a small set of solutions as constraints tighten. It is impractical, if not impossible, to manually determine ideal choices among possibilities that easily exceed billions even in small-size problems. Existing methods to bridge this gap have leveraged theoretical understanding of hardware to f architecture search. However, the assumptions made in computing such theoretical metrics are too idealized to provide sufficient guidance during the difficult search for a practical implementation. Meanwhile, theoretical estimates for many other crucial metrics (like delay) do not even exist and are similarly variable, dependent on parameters of the process design kit (PDK). To address these challenges, we present a study that employs intelligent search using multi-objective Bayesian optimization, integrating both neural network search and ASIC synthesis in the loop. This approach provides reliable feedback on the collective impact of all cross-domain design choices. We showcase the effectiveness of our approach by finding several Pareto-optimal design choices for effective and efficient neural networks that perform real-time feature extraction from input pulses within the individual pixels of a readout ASIC.

Title: Learning Visual Grounding from Generative Vision and Language Model

Authors: Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14563
Pdf URL: https://arxiv.org/pdf/2407.14563
Copy Paste: [[2407.14563]] Learning Visual Grounding from Generative Vision and Language Model(https://arxiv.org/abs/2407.14563)
Keywords: generative, segmentation
Abstract: Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. We thus prompt a VLM to generate object-level descriptions by feeding it object regions from existing object detection datasets. We further propose attribute modeling to explicitly capture the important object attributes, and spatial relation modeling to capture inter-object relationship, both of which are common linguistic pattern in referring expression. Our constructed dataset (500K images, 1M objects, 16M referring expressions) is one of the largest grounding datasets to date, and the first grounding dataset with purely model-generated queries and human-annotated objects. To verify the quality of this data, we conduct zero-shot transfer experiments to the popular RefCOCO benchmarks for both referring expression comprehension (REC) and segmentation (RES) tasks. On both tasks, our model significantly outperform the state-of-the-art approaches without using human annotated visual grounding data. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world. Code and models will be released.

Title: SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy

Authors: Tingkai Zhang, Chaoyu Chen, Cong Liao, Jun Wang, Xudong Zhao, Hang Yu, Jianchao Wang, Jianguo Li, Wenhui Shi
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2407.14568
Pdf URL: https://arxiv.org/pdf/2407.14568
Copy Paste: [[2407.14568]] SQLfuse: Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy(https://arxiv.org/abs/2407.14568)
Keywords: robust, large language model
Abstract: Text-to-SQL conversion is a critical innovation, simplifying the transition from complex SQL to intuitive natural language queries, especially significant given SQL's prevalence in the job market across various roles. The rise of Large Language Models (LLMs) like GPT-3.5 and GPT-4 has greatly advanced this field, offering improved natural language understanding and the ability to generate nuanced SQL statements. However, the potential of open-source LLMs in Text-to-SQL applications remains underexplored, with many frameworks failing to leverage their full capabilities, particularly in handling complex database queries and incorporating feedback for iterative refinement. Addressing these limitations, this paper introduces SQLfuse, a robust system integrating open-source LLMs with a suite of tools to enhance Text-to-SQL translation's accuracy and usability. SQLfuse features four modules: schema mining, schema linking, SQL generation, and a SQL critic module, to not only generate but also continuously enhance SQL query quality. Demonstrated by its leading performance on the Spider Leaderboard and deployment by Ant Group, SQLfuse showcases the practical merits of open-source LLMs in diverse business contexts.

Title: Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization

Authors: Orson Mengara
Subjects: cs.LG, cs.CR, q-fin.CP, q-fin.PR, q-fin.ST
Abstract URL: https://arxiv.org/abs/2407.14573
Pdf URL: https://arxiv.org/pdf/2407.14573
Copy Paste: [[2407.14573]] Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization(https://arxiv.org/abs/2407.14573)
Keywords: attack, transformer, generative, large language model
Abstract: Since the advent of generative artificial intelligence, every company and researcher has been rushing to develop their own generative models, whether commercial or not. Given the large number of users of these powerful new tools, there is currently no intrinsically verifiable way to explain from the ground up what happens when LLMs (large language models) learn. For example, those based on automatic speech recognition systems, which have to rely on huge and astronomical amounts of data collected from all over the web to produce fast and efficient results, In this article, we develop a backdoor attack called MarketBackFinal 2.0, based on acoustic data poisoning, MarketBackFinal 2.0 is mainly based on modern stock market models. In order to show the possible vulnerabilities of speech-based transformers that may rely on LLMs.

Title: Adversarial Databases Improve Success in Retrieval-based Large Language Models

Authors: Sean Wu, Michael Koo, Li Yo Kao, Andy Black, Lesley Blum, Fabien Scalzo, Ira Kurtz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14609
Pdf URL: https://arxiv.org/pdf/2407.14609
Copy Paste: [[2407.14609]] Adversarial Databases Improve Success in Retrieval-based Large Language Models(https://arxiv.org/abs/2407.14609)
Keywords: robust, large language model
Abstract: Open-source LLMs have shown great potential as fine-tuned chatbots, and demonstrate robust abilities in reasoning and surpass many existing benchmarks. Retrieval-Augmented Generation (RAG) is a technique for improving the performance of LLMs on tasks that the models weren't explicitly trained on, by leveraging external knowledge databases. Numerous studies have demonstrated the effectiveness of RAG to more successfully accomplish downstream tasks when using vector datasets that consist of relevant background information. It has been implicitly assumed by those in the field that if adversarial background information is utilized in this context, that the success of using a RAG-based approach would be nonexistent or even negatively impact the results. To address this assumption, we tested several open-source LLMs on the ability of RAG to improve their success in answering multiple-choice questions (MCQ) in the medical subspecialty field of Nephrology. Unlike previous studies, we examined the effect of RAG in utilizing both relevant and adversarial background databases. We set up several open-source LLMs, including Llama 3, Phi-3, Mixtral 8x7b, Zephyr$\beta$, and Gemma 7B Instruct, in a zero-shot RAG pipeline. As adversarial sources of information, text from the Bible and a Random Words generated database were used for comparison. Our data show that most of the open-source LLMs improve their multiple-choice test-taking success as expected when incorporating relevant information vector databases. Surprisingly however, adversarial Bible text significantly improved the success of many LLMs and even random word text improved test taking ability of some of the models. In summary, our results demonstrate for the first time the countertintuitive ability of adversarial information datasets to improve the RAG-based LLM success.

Title: Evaluating language models as risk scores

Authors: André F. Cruz, Moritz Hardt, Celestine Mendler-Dünner
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2407.14614
Pdf URL: https://arxiv.org/pdf/2407.14614
Copy Paste: [[2407.14614]] Evaluating language models as risk scores(https://arxiv.org/abs/2407.14614)
Keywords: large language model
Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

Title: The Research of Group Re-identification from Multiple Cameras

Authors: Hao Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14620
Pdf URL: https://arxiv.org/pdf/2407.14620
Copy Paste: [[2407.14620]] The Research of Group Re-identification from Multiple Cameras(https://arxiv.org/abs/2407.14620)
Keywords: extraction
Abstract: Object re-identification is of increasing importance in visual surveillance. Most existing works focus on re-identify individual from multiple cameras while the application of group re-identification (Re-ID) is rarely discussed. We redefine Group Re-identification as a process which includes pedestrian detection, feature extraction, graph model construction, and graph matching. Group re-identification is very challenging since it is not only interfered by view-point and human pose variations in the traditional re-identification tasks, but also suffered from the challenges in group layout change and group member variation. To address the above challenges, this paper introduces a novel approach which leverages the multi-granularity information inside groups to facilitate group re-identification. We first introduce a multi-granularity Re-ID process, which derives features for multi-granularity objects (people/people-subgroups) in a group and iteratively evaluates their importances during group Re-ID, so as to handle group-wise misalignments due to viewpoint change and group dynamics. We further introduce a multi-order matching scheme. It adaptively selects representative people/people-subgroups in each group and integrates the multi-granularity information from these people/people-subgroups to obtain group-wise matching, hence achieving a more reliable matching score between groups. Experimental results on various datasets demonstrate the effectiveness of our approach.

Title: BOND: Aligning LLMs with Best-of-N Distillation

Authors: Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2407.14622
Pdf URL: https://arxiv.org/pdf/2407.14622
Copy Paste: [[2407.14622]] BOND: Aligning LLMs with Best-of-N Distillation(https://arxiv.org/abs/2407.14622)
Keywords: large language model
Abstract: Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

Title: CVE-LLM : Automatic vulnerability evaluation in medical device industry using large language models

Authors: Rikhiya Ghosh, Oladimeji Farri, Hans-Martin von Stockhausen, Martin Schmitt, George Marica Vasile
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14640
Pdf URL: https://arxiv.org/pdf/2407.14640
Copy Paste: [[2407.14640]] CVE-LLM : Automatic vulnerability evaluation in medical device industry using large language models(https://arxiv.org/abs/2407.14640)
Keywords: security, attack, generative, large language model
Abstract: The healthcare industry is currently experiencing an unprecedented wave of cybersecurity attacks, impacting millions of individuals. With the discovery of thousands of vulnerabilities each month, there is a pressing need to drive the automation of vulnerability assessment processes for medical devices, facilitating rapid mitigation efforts. Generative AI systems have revolutionized various industries, offering unparalleled opportunities for automation and increased efficiency. This paper presents a solution leveraging Large Language Models (LLMs) to learn from historical evaluations of vulnerabilities for the automatic assessment of vulnerabilities in the medical devices industry. This approach is applied within the portfolio of a single manufacturer, taking into account device characteristics, including existing security posture and controls. The primary contributions of this paper are threefold. Firstly, it provides a detailed examination of the best practices for training a vulnerability Language Model (LM) in an industrial context. Secondly, it presents a comprehensive comparison and insightful analysis of the effectiveness of Language Models in vulnerability assessment. Finally, it proposes a new human-in-the-loop framework to expedite vulnerability evaluation processes.

Title: Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

Authors: Nilanjana Das, Edward Raff, Manas Gaur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14644
Pdf URL: https://arxiv.org/pdf/2407.14644
Copy Paste: [[2407.14644]] Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context(https://arxiv.org/abs/2407.14644)
Keywords: attack, large language model
Abstract: Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we explore converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. This allows us to show suffix conversion without any gradients, using only LLMs to perform the attacks, and thus better understand the scope of possible risks. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. The situations are extracted from the IMDB dataset, and prompts are defined following a few-shot chain-of-thought prompting. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs. We find that across many LLMs, as few as 1 attempt produces an attack and that these attacks transfer between LLMs. The link to our code is available at \url{https://anonymous.4open.science/r/Situation-Driven-Adversarial-Attacks-7BB1/README.md}.

Title: The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments

Authors: Shivansh Sharma, Mathew Huang, Sanat Nair, Alan Wen, Christina Petlowany, Juston Moore, Selma Wanna, Mitch Pryor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14649
Pdf URL: https://arxiv.org/pdf/2407.14649
Copy Paste: [[2407.14649]] The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments(https://arxiv.org/abs/2407.14649)
Keywords: robust, segmentation
Abstract: Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions. Developing these techniques necessitates algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly. Although substantial efforts have curated datasets for hand segmentation, most focus on residential or commercial domains. Existing datasets targeting industrial settings predominantly rely on synthetic data, which we demonstrate does not effectively transfer to real-world operations. Moreover, these datasets lack uncertainty estimations critical for safe collaboration. Addressing these gaps, we present HAGS: Hand and Glove Segmentation Dataset. This dataset provides 1200 challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios as well as assess out-of-distribution images, constructed via green screen augmentations, to determine ML-classifier robustness. We study state-of-the-art, real-time segmentation models to evaluate existing methods. Our dataset and baselines are publicly available: this https URL and this https URL.

Title: OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

Authors: Yihang Yao, Zhepeng Cen, Wenhao Ding, Haohong Lin, Shiqi Liu, Tingnan Zhang, Wenhao Yu, Ding Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14653
Pdf URL: https://arxiv.org/pdf/2407.14653
Copy Paste: [[2407.14653]] OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning(https://arxiv.org/abs/2407.14653)
Keywords: robust, diffusion
Abstract: Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.

Title: LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition

Authors: Soroush Oraki, Harry Zhuang, Jie Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14655
Pdf URL: https://arxiv.org/pdf/2407.14655
Copy Paste: [[2407.14655]] LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition(https://arxiv.org/abs/2407.14655)
Keywords: transformer
Abstract: The complexity of state-of-the-art Transformer-based models for skeleton-based action recognition poses significant challenges in terms of computational efficiency and resource utilization. In this paper, we explore the application of Singular Value Decomposition (SVD) to effectively reduce the model sizes of these pre-trained models, aiming to minimize their resource consumption while preserving accuracy. Our method, LORTSAR (LOw-Rank Transformer for Skeleton-based Action Recognition), also includes a fine-tuning step to compensate for any potential accuracy degradation caused by model compression, and is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer". Experimental results on the "NTU RGB+D" and "NTU RGB+D 120" datasets show that our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.

Title: Is $F_1$ Score Suboptimal for Cybersecurity Models? Introducing $C_{score}$, a Cost-Aware Alternative for Model Assessment

Authors: Manish Marwah, Asad Narayanan, Stephen Jou, Martin Arlitt, Maria Pospelova
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14664
Pdf URL: https://arxiv.org/pdf/2407.14664
Copy Paste: [[2407.14664]] Is $F_1$ Score Suboptimal for Cybersecurity Models? Introducing $C_{score}$, a Cost-Aware Alternative for Model Assessment(https://arxiv.org/abs/2407.14664)
Keywords: security, attack
Abstract: The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. Various design choices during machine learning model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between these two errors. However, most of the commonly used metrics to evaluate model quality, such as $F_1$ score, which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, $C_{score}$ based on precision and recall that can replace $F_1$ score for model evaluation and selection. It includes a cost ratio that takes into account the differing costs of handling false positives and false negatives. We derive and characterize the new cost metric, and compare it to $F_1$ score. Further, we use this metric for model thresholding for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%.

Title: DefTesPY: Cyber defense model with enhanced data modeling and analysis for Tesla company via Python Language

Authors: Naresh Kshetri, Irin Sultana, Mir Mehedi Rahman, Darshana Shah
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.14671
Pdf URL: https://arxiv.org/pdf/2407.14671
Copy Paste: [[2407.14671]] DefTesPY: Cyber defense model with enhanced data modeling and analysis for Tesla company via Python Language(https://arxiv.org/abs/2407.14671)
Keywords: protect, defense, attack
Abstract: Several types of cyber-attacks on automobiles and business firms keep on rising as we are preparing to counter cybercrimes with several new technologies and defense models. Cyber defense (also, counter intelligence) is a computer network defense mechanism that involves response to activities, critical infrastructure protection, and information assurance for corporations, government bodies, and other conceivable networks. Cyber defense focuses on preventing, detecting, and responding to assaults or threats in a timely manner so that no infrastructure or information is compromised. With the increasing volume and complexity of cyber threats, most companies need cyber defense to protect sensitive information and assets. We can control attacker actions by utilizing firewalls at different levels, an intrusion detection system (IDS), with the intrusion prevention system (IPS) which can be installed independently or in combination with other protection approaches. Tesla is an American clean energy and automotive company in Austin, Texas, USA. The recent data breach at Tesla affected over 75,000 individuals as the company pinpoints two former employees as the offender revealing more than 23,000 internal files from 2015 to 2022. In this work, we will emphasize data modeling and data analysis using cyber defense model and python with a survey of the Tesla company. We have proposed a defense model, DefTesPY, with enhanced data modeling and data analysis based on the encountered cyber-attacks and cybercrimes for Tesla company till date.

Title: Compact Language Models via Pruning and Knowledge Distillation

Authors: Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14679
Pdf URL: https://arxiv.org/pdf/2407.14679
Copy Paste: [[2407.14679]] Compact Language Models via Pruning and Knowledge Distillation(https://arxiv.org/abs/2407.14679)
Keywords: large language model
Abstract: Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

Title: Data Poisoning: An Overlooked Threat to Power Grid Resilience

Authors: Nora Agah, Javad Mohammadi, Alex Aved, David Ferris, Erika Ardiles Cruz, Philip Morrone
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14684
Pdf URL: https://arxiv.org/pdf/2407.14684
Copy Paste: [[2407.14684]] Data Poisoning: An Overlooked Threat to Power Grid Resilience(https://arxiv.org/abs/2407.14684)
Keywords: secure, attack
Abstract: As the complexities of Dynamic Data Driven Applications Systems increase, preserving their resilience becomes more challenging. For instance, maintaining power grid resilience is becoming increasingly complicated due to the growing number of stochastic variables (such as renewable outputs) and extreme weather events that add uncertainty to the grid. Current optimization methods have struggled to accommodate this rise in complexity. This has fueled the growing interest in data-driven methods used to operate the grid, leading to more vulnerability to cyberattacks. One such disruption that is commonly discussed is the adversarial disruption, where the intruder attempts to add a small perturbation to input data in order to "manipulate" the system operation. During the last few years, work on adversarial training and disruptions on the power system has gained popularity. In this paper, we will first review these applications, specifically on the most common types of adversarial disruptions: evasion and poisoning disruptions. Through this review, we highlight the gap between poisoning and evasion research when applied to the power grid. This is due to the underlying assumption that model training is secure, leading to evasion disruptions being the primary type of studied disruption. Finally, we will examine the impacts of data poisoning interventions and showcase how they can endanger power grid resilience.

Title: A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning

Authors: Alejandro L. García Navarro, Nataliia Koneva, Alfonso Sánchez-Macián, José Alberto Hernández
Subjects: cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2407.14695
Pdf URL: https://arxiv.org/pdf/2407.14695
Copy Paste: [[2407.14695]] A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning(https://arxiv.org/abs/2407.14695)
Keywords: robust
Abstract: Python has gained widespread popularity in the fields of machine learning, artificial intelligence, and data engineering due to its effectiveness and extensive libraries. R, on its side, remains a dominant language for statistical analysis and visualization. However, certain libraries have become outdated, limiting their functionality and performance. Users can use Python's advanced machine learning and AI capabilities alongside R's robust statistical packages by combining these two programming languages. This paper explores using R's reticulate package to call Python from R, providing practical examples and highlighting scenarios where this integration enhances productivity and analytical capabilities. With a few hello-world code snippets, we demonstrate how to run Python's scikit-learn, pytorch and OpenAI gym libraries for building Machine Learning, Deep Learning, and Reinforcement Learning projects easily.

Title: $\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Authors: Minh-Quan Le, Alexandros Graikos, Srikar Yellapragada, Rajarsi Gupta, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14709
Pdf URL: https://arxiv.org/pdf/2407.14709
Copy Paste: [[2407.14709]] $\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions(https://arxiv.org/abs/2407.14709)
Keywords: diffusion, generative
Abstract: Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, $\infty$-Brush for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To our best knowledge, $\infty$-Brush is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to $4096\times4096$ pixels. The code is available at this https URL.

Title: Universally Harmonizing Differential Privacy Mechanisms for Federated Learning: Boosting Accuracy and Convergence

Authors: Shuya Feng, Meisam Mohammady, Hanbin Hong, Shenao Yan, Ashish Kundu, Binghui Wang, Yuan Hong
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14710
Pdf URL: https://arxiv.org/pdf/2407.14710
Copy Paste: [[2407.14710]] Universally Harmonizing Differential Privacy Mechanisms for Federated Learning: Boosting Accuracy and Convergence(https://arxiv.org/abs/2407.14710)
Keywords: privacy, attack, federate
Abstract: Differentially private federated learning (DP-FL) is a promising technique for collaborative model training while ensuring provable privacy for clients. However, optimizing the tradeoff between privacy and accuracy remains a critical challenge. To our best knowledge, we propose the first DP-FL framework (namely UDP-FL), which universally harmonizes any randomization mechanism (e.g., an optimal one) with the Gaussian Moments Accountant (viz. DP-SGD) to significantly boost accuracy and convergence. Specifically, UDP-FL demonstrates enhanced model performance by mitigating the reliance on Gaussian noise. The key mediator variable in this transformation is the Rényi Differential Privacy notion, which is carefully used to harmonize privacy budgets. We also propose an innovative method to theoretically analyze the convergence for DP-FL (including our UDP-FL ) based on mode connectivity analysis. Moreover, we evaluate our UDP-FL through extensive experiments benchmarked against state-of-the-art (SOTA) methods, demonstrating superior performance on both privacy guarantees and model performance. Notably, UDP-FL exhibits substantial resilience against different inference attacks, indicating a significant advance in safeguarding sensitive data in federated learning environments.

Title: Differential Privacy of Cross-Attention with Provable Guarantee

Authors: Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14717
Pdf URL: https://arxiv.org/pdf/2407.14717
Copy Paste: [[2407.14717]] Differential Privacy of Cross-Attention with Provable Guarantee(https://arxiv.org/abs/2407.14717)
Keywords: security, privacy, attack, robust, diffusion, generative
Abstract: Cross-attention has become a fundamental module nowadays in many important artificial intelligence applications, e.g., retrieval-augmented generation (RAG), system prompt, guided stable diffusion, and many so on. Ensuring cross-attention privacy is crucial and urgently needed because its key and value matrices may contain sensitive information about companies and their users, many of which profit solely from their system prompts or RAG data. In this work, we design a novel differential privacy (DP) data structure to address the privacy security of cross-attention with a theoretical guarantee. In detail, let $n$ be the input token length of system prompt/RAG data, $d$ be the feature dimension, $0 < \alpha \le 1$ be the relative error parameter, $R$ be the maximum value of the query and key matrices, $R_w$ be the maximum value of the value matrix, and $r,s,\epsilon_s$ be parameters of polynomial kernel methods. Then, our data structure requires $\widetilde{O}(ndr^2)$ memory consumption with $\widetilde{O}(nr^2)$ initialization time complexity and $\widetilde{O}(\alpha^{-1} r^2)$ query time complexity for a single token query. In addition, our data structure can guarantee that the user query is $(\epsilon, \delta)$-DP with $\widetilde{O}(n^{-1} \epsilon^{-1} \alpha^{-1/2} R^{2s} R_w r^2)$ additive error and $n^{-1} (\alpha + \epsilon_s)$ relative error between our output and the true answer. Furthermore, our result is robust to adaptive queries in which users can intentionally attack the cross-attention system. To our knowledge, this is the first work to provide DP for cross-attention. We believe it can inspire more privacy algorithm design in large generative models (LGMs).

Title: Downstream-Pretext Domain Knowledge Traceback for Active Learning

Authors: Beichen Zhang, Liang Li, Zheng-Jun Zha, Jiebo Luo, Qingming Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14720
Pdf URL: https://arxiv.org/pdf/2407.14720
Copy Paste: [[2407.14720]] Downstream-Pretext Domain Knowledge Traceback for Active Learning(https://arxiv.org/abs/2407.14720)
Keywords: robust, segmentation
Abstract: Active learning (AL) is designed to construct a high-quality labeled dataset by iteratively selecting the most informative samples. Such sampling heavily relies on data representation, while recently pre-training is popular for robust feature learning. However, as pre-training utilizes low-level pretext tasks that lack annotation, directly using pre-trained representation in AL is inadequate for determining the sampling score. To address this problem, we propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance for selecting diverse and instructive samples near the decision boundary. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. The diversity indicator constructs two feature spaces based on the pre-training pretext model and the downstream knowledge from annotation, by which it locates the neighbors of unlabeled data from the downstream space in the pretext space to explore the interaction of samples. With this mechanism, DOKT unifies the data relations of low-level and high-level representations to estimate traceback diversity. Next, in the uncertainty estimator, domain mixing is designed to enforce perceptual perturbing to unlabeled samples with similar visual patches in the pretext space. Then the divergence of perturbed samples is measured to estimate the domain uncertainty. As a result, DOKT selects the most diverse and important samples based on these two modules. The experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods and generalizes well to various application scenarios such as semantic segmentation and image captioning.

Title: CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting

Authors: Ryo Fujii, Ryo Hachiuma, Hideo Saito
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2407.14725
Pdf URL: https://arxiv.org/pdf/2407.14725
Copy Paste: [[2407.14725]] CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting(https://arxiv.org/abs/2407.14725)
Keywords: robust
Abstract: A crowd density forecasting task aims to predict how the crowd density map will change in the future from observed past crowd density maps. However, the past crowd density maps are often incomplete due to the miss-detection of pedestrians, and it is crucial to develop a robust crowd density forecasting model against the miss-detection. This paper presents a MAsked crowd density Completion framework for crowd density forecasting (CrowdMAC), which is simultaneously trained to forecast future crowd density maps from partially masked past crowd density maps (i.e., forecasting maps from past maps with miss-detection) while reconstructing the masked observation maps (i.e., imputing past maps with miss-detection). Additionally, we propose Temporal-Density-aware Masking (TDM), which non-uniformly masks tokens in the observed crowd density map, considering the sparsity of the crowd density maps and the informativeness of the subsequent frames for the forecasting task. Moreover, we introduce multi-task masking to enhance training efficiency. In the experiments, CrowdMAC achieves state-of-the-art performance on seven large-scale datasets, including SDD, ETH-UCY, inD, JRDB, VSCrowd, FDST, and croHD. We also demonstrate the robustness of the proposed method against both synthetic and realistic miss-detections.

Title: FedDM: Enhancing Communication Efficiency and Handling Data Heterogeneity in Federated Diffusion Models

Authors: Jayneel Vora, Nader Bouacida, Aditya Krishnan, Prasant Mohapatra
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2407.14730
Pdf URL: https://arxiv.org/pdf/2407.14730
Copy Paste: [[2407.14730]] FedDM: Enhancing Communication Efficiency and Handling Data Heterogeneity in Federated Diffusion Models(https://arxiv.org/abs/2407.14730)
Keywords: federate, diffusion
Abstract: We introduce FedDM, a novel training framework designed for the federated training of diffusion models. Our theoretical analysis establishes the convergence of diffusion models when trained in a federated setting, presenting the specific conditions under which this convergence is guaranteed. We propose a suite of training algorithms that leverage the U-Net architecture as the backbone for our diffusion models. These include a basic Federated Averaging variant, FedDM-vanilla, FedDM-prox to handle data heterogeneity among clients, and FedDM-quant, which incorporates a quantization module to reduce the model update size, thereby enhancing communication efficiency across the federated network. We evaluate our algorithms on FashionMNIST (28x28 resolution), CIFAR-10 (32x32 resolution), and CelebA (64x64 resolution) for DDPMs, as well as LSUN Church Outdoors (256x256 resolution) for LDMs, focusing exclusively on the imaging modality. Our evaluation results demonstrate that FedDM algorithms maintain high generation quality across image resolutions. At the same time, the use of quantized updates and proximal terms in the local training objective significantly enhances communication efficiency (up to 4x) and model convergence, particularly in non-IID data settings, at the cost of increased FID scores (up to 1.75x).

Title: Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL

Authors: Yunseon Choi, Sangmin Bae, Seonghyun Ban, Minchan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Bian, Kee-Eung Kim
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2407.14733
Pdf URL: https://arxiv.org/pdf/2407.14733
Copy Paste: [[2407.14733]] Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL(https://arxiv.org/abs/2407.14733)
Keywords: interpretability
Abstract: With the advent of foundation models, prompt tuning has positioned itself as an important technique for directing model behaviors and eliciting desired responses. Prompt tuning regards selecting appropriate keywords included into the input, thereby adapting to the downstream task without adjusting or fine-tuning the model parameters. There is a wide range of work in prompt tuning, from approaches that directly harness the backpropagated gradient signals from the model, to those employing black-box optimization such as reinforcement learning (RL) methods. Our primary focus is on RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning. While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability. We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration. We extensively evaluate our approach across various tasks, including few-shot text classification, unsupervised text style transfer, and textual inversion from images. The results indicate a notable improvement over baselines, highlighting the efficacy of our approach in addressing the challenges of prompt tuning. Moreover, we show that the prompts discovered using our method are more natural and interpretable compared to those from other baselines.

Title: Flatness-aware Sequential Learning Generates Resilient Backdoors

Authors: Hoang Pham, The-Anh Ta, Anh Tran, Khoa D. Doan
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14738
Pdf URL: https://arxiv.org/pdf/2407.14738
Copy Paste: [[2407.14738]] Flatness-aware Sequential Learning Generates Resilient Backdoors(https://arxiv.org/abs/2407.14738)
Keywords: security, defense, attack
Abstract: Recently, backdoor attacks have become an emerging threat to the security of machine learning models. From the adversary's perspective, the implanted backdoors should be resistant to defensive algorithms, but some recently proposed fine-tuning defenses can remove these backdoors with notable efficacy. This is mainly due to the catastrophic forgetting (CF) property of deep neural networks. This paper counters CF of backdoors by leveraging continual learning (CL) techniques. We begin by investigating the connectivity between a backdoored and fine-tuned model in the loss landscape. Our analysis confirms that fine-tuning defenses, especially the more advanced ones, can easily push a poisoned model out of the backdoor regions, making it forget all about the backdoors. Based on this finding, we re-formulate backdoor training through the lens of CL and propose a novel framework, named Sequential Backdoor Learning (SBL), that can generate resilient backdoors. This framework separates the backdoor poisoning process into two tasks: the first task learns a backdoored model, while the second task, based on the CL principles, moves it to a backdoored region resistant to fine-tuning. We additionally propose to seek flatter backdoor regions via a sharpness-aware minimizer in the framework, further strengthening the durability of the implanted backdoor. Finally, we demonstrate the effectiveness of our method through extensive empirical experiments on several benchmark datasets in the backdoor domain. The source code is available at this https URL

Title: Difflare: Removing Image Lens Flare with Latent Diffusion Model

Authors: Tianwen Zhou, Qihao Duan, Zitong Yu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2407.14746
Pdf URL: https://arxiv.org/pdf/2407.14746
Copy Paste: [[2407.14746]] Difflare: Removing Image Lens Flare with Latent Diffusion Model(https://arxiv.org/abs/2407.14746)
Keywords: extraction, diffusion, generative
Abstract: The recovery of high-quality images from images corrupted by lens flare presents a significant challenge in low-level vision. Contemporary deep learning methods frequently entail training a lens flare removing model from scratch. However, these methods, despite their noticeable success, fail to utilize the generative prior learned by pre-trained models, resulting in unsatisfactory performance in lens flare removal. Furthermore, there are only few works considering the physical priors relevant to flare removal. To address these issues, we introduce Difflare, a novel approach designed for lens flare removal. To leverage the generative prior learned by Pre-Trained Diffusion Models (PTDM), we introduce a trainable Structural Guidance Injection Module (SGIM) aimed at guiding the restoration process with PTDM. Towards more efficient training, we employ Difflare in the latent space. To address information loss resulting from latent compression and the stochastic sampling process of PTDM, we introduce an Adaptive Feature Fusion Module (AFFM), which incorporates the Luminance Gradient Prior (LGP) of lens flare to dynamically regulate feature extraction. Extensive experiments demonstrate that our proposed Difflare achieves state-of-the-art performance in real-world lens flare removal, restoring images corrupted by flare with improved fidelity and perceptual quality. The codes will be released soon.

Title: Enhancing Skin Disease Classification Leveraging Transformer-based Deep Learning Architectures and Explainable AI

Authors: Jayanth Mohan, Arrun Sivasubramanian, V Sowmya, Ravi Vinayakumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14757
Pdf URL: https://arxiv.org/pdf/2407.14757
Copy Paste: [[2407.14757]] Enhancing Skin Disease Classification Leveraging Transformer-based Deep Learning Architectures and Explainable AI(https://arxiv.org/abs/2407.14757)
Keywords: robust, extraction, transformer
Abstract: Skin diseases affect over a third of the global population, yet their impact is often underestimated. Automating skin disease classification to assist doctors with their prognosis might be difficult. Nevertheless, due to efficient feature extraction pipelines, deep learning techniques have shown much promise for various tasks, including dermatological disease identification. This study uses a skin disease dataset with 31 classes and compares it with all versions of Vision Transformers, Swin Transformers and DivoV2. The analysis is also extended to compare with benchmark convolution-based architecture presented in the literature. Transfer learning with ImageNet1k weights on the skin disease dataset contributes to a high test accuracy of 96.48\% and an F1-Score of 0.9727 using DinoV2, which is almost a 10\% improvement over this data's current benchmark results. The performance of DinoV2 was also compared for the HAM10000 and Dermnet datasets to test the model's robustness, and the trained model overcomes the benchmark results by a slight margin in test accuracy and in F1-Score on the 23 and 7 class datasets. The results are substantiated using explainable AI frameworks like GradCAM and SHAP, which provide precise image locations to map the disease, assisting dermatologists in early detection, prompt prognosis, and treatment.

Title: Data Augmentation in Graph Neural Networks: The Role of Generated Synthetic Graphs

Authors: Sumeyye Bas, Kiymet Kaya, Resul Tugay, Sule Gunduz Oguducu
Subjects: cs.LG, cs.AI, cs.DB, cs.IT
Abstract URL: https://arxiv.org/abs/2407.14765
Pdf URL: https://arxiv.org/pdf/2407.14765
Copy Paste: [[2407.14765]] Data Augmentation in Graph Neural Networks: The Role of Generated Synthetic Graphs(https://arxiv.org/abs/2407.14765)
Keywords: generative
Abstract: Graphs are crucial for representing interrelated data and aiding predictive modeling by capturing complex relationships. Achieving high-quality graph representation is important for identifying linked patterns, leading to improvements in Graph Neural Networks (GNNs) to better capture data structures. However, challenges such as data scarcity, high collection costs, and ethical concerns limit progress. As a result, generative models and data augmentation have become more and more popular. This study explores using generated graphs for data augmentation, comparing the performance of combining generated graphs with real graphs, and examining the effect of different quantities of generated graphs on graph classification tasks. The experiments show that balancing scalability and quality requires different generators based on graph size. Our results introduce a new approach to graph data augmentation, ensuring consistent labels and enhancing classification performance.

Title: Implementing Fairness: the view from a FairDream

Authors: Thomas Souverain, Johnathan Nguyen, Nicolas Meric, Paul Égré
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2407.14766
Pdf URL: https://arxiv.org/pdf/2407.14766
Copy Paste: [[2407.14766]] Implementing Fairness: the view from a FairDream(https://arxiv.org/abs/2407.14766)
Keywords: fair
Abstract: In this paper, we propose an experimental investigation of the problem of AI fairness in classification. We train an AI model and develop our own fairness package FairDream to detect inequalities and then to correct for them, using income prediction as a case study. Our experiments show that it is a property of FairDream to fulfill fairness objectives which are conditional on the ground truth (Equalized Odds), even when the algorithm is set the task of equalizing positives across groups (Demographic Parity). While this may be seen as an anomaly, we explain this property by comparing our approach with a closely related fairness method (GridSearch), which can enforce Demographic Parity at the expense of Equalized Odds. We grant that a fairness metric conditioned on true labels does not give a sufficient criterion to reach fairness, but we argue that it gives us at least a necessary condition to implement Demographic Parity cautiously. We also explain why neither Equal Calibration nor Equal Precision stand as relevant fairness criteria in classification. Addressing their limitations to warn the decision-maker for any disadvantaging rate, Equalized Odds avoids the peril of strict conservatism, while keeping away the utopia of a whole redistribution of resources through algorithms.

Title: Subgraph Clustering and Atom Learning for Improved Image Classification

Authors: Aryan Singh, Pepijn Van de Ven, Ciarán Eising, Patrick Denny
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14772
Pdf URL: https://arxiv.org/pdf/2407.14772
Copy Paste: [[2407.14772]] Subgraph Clustering and Atom Learning for Improved Image Classification(https://arxiv.org/abs/2407.14772)
Keywords: extraction
Abstract: In this study, we present the Graph Sub-Graph Network (GSN), a novel hybrid image classification model merging the strengths of Convolutional Neural Networks (CNNs) for feature extraction and Graph Neural Networks (GNNs) for structural modeling. GSN employs k-means clustering to group graph nodes into clusters, facilitating the creation of subgraphs. These subgraphs are then utilized to learn representative `atoms` for dictionary learning, enabling the identification of sparse, class-distinguishable features. This integrated approach is particularly relevant in domains like medical imaging, where discerning subtle feature differences is crucial for accurate classification. To evaluate the performance of our proposed GSN, we conducted experiments on benchmark datasets, including PascalVOC and HAM10000. Our results demonstrate the efficacy of our model in optimizing dictionary configurations across varied classes, which contributes to its effectiveness in medical classification tasks. This performance enhancement is primarily attributed to the integration of CNNs, GNNs, and graph learning techniques, which collectively improve the handling of datasets with limited labeled examples. Specifically, our experiments show that the model achieves a higher accuracy on benchmark datasets such as Pascal VOC and HAM10000 compared to conventional CNN approaches.

Title: On the Design and Analysis of LLM-Based Algorithms

Authors: Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2407.14788
Pdf URL: https://arxiv.org/pdf/2407.14788
Copy Paste: [[2407.14788]] On the Design and Analysis of LLM-Based Algorithms(https://arxiv.org/abs/2407.14788)
Keywords: large language model
Abstract: We initiate a formal investigation into the design and analysis of LLM-based algorithms, i.e. algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While LLM-based algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agent systems and compound AI systems, have achieved remarkable empirical success, the design and optimization of them have mostly relied on heuristics and trial-and-errors, which is largely due to a lack of formal and analytical study for these algorithms. To fill this gap, we start by identifying the computational-graph representation of LLM-based algorithms, the design principle of task decomposition, and some key abstractions, which then facilitate our formal analysis for the accuracy and efficiency of LLM-based algorithms, despite the black-box nature of LLMs. We further consider parallel decomposition for a case study, providing extensive analytical and empirical study for four concrete examples of this pattern. Our proposed framework holds promise for advancing LLM-based algorithms, by revealing the reasons behind curious empirical phenomena, guiding the choices of hyperparameters, predicting the empirical performance of algorithms, and inspiring new algorithm design. To promote further study of LLM-based algorithms, we release our source code at this https URL.

Title: FedPartWhole: Federated domain generalization via consistent part-whole hierarchies

Authors: Ahmed Radwan, Mohamed S. Shehata
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14792
Pdf URL: https://arxiv.org/pdf/2407.14792
Copy Paste: [[2407.14792]] FedPartWhole: Federated domain generalization via consistent part-whole hierarchies(https://arxiv.org/abs/2407.14792)
Keywords: privacy, federate
Abstract: Federated Domain Generalization (FedDG), aims to tackle the challenge of generalizing to unseen domains at test time while catering to the data privacy constraints that prevent centralized data storage from different domains originating at various clients. Existing approaches can be broadly categorized into four groups: domain alignment, data manipulation, learning strategies, and optimization of model aggregation weights. This paper proposes a novel approach to Federated Domain Generalization that tackles the problem from the perspective of the backbone model architecture. The core principle is that objects, even under substantial domain shifts and appearance variations, maintain a consistent hierarchical structure of parts and wholes. For instance, a photograph and a sketch of a dog share the same hierarchical organization, consisting of a head, body, limbs, and so on. The introduced architecture explicitly incorporates a feature representation for the image parse tree. To the best of our knowledge, this is the first work to tackle Federated Domain Generalization from a model architecture standpoint. Our approach outperforms a convolutional architecture of comparable size by over 12\%, despite utilizing fewer parameters. Additionally, it is inherently interpretable, contrary to the black-box nature of CNNs, which fosters trust in its predictions, a crucial asset in federated learning.

Title: PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates

Authors: Junjie Shi, Caozhi Shang, Zhaobin Sun, Li Yu, Xin Yang, Zengqiang Yan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14796
Pdf URL: https://arxiv.org/pdf/2407.14796
Copy Paste: [[2407.14796]] PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates(https://arxiv.org/abs/2407.14796)
Keywords: segmentation
Abstract: Incomplete multi-modal image segmentation is a fundamental task in medical imaging to refine deployment efficiency when only partial modalities are available. However, the common practice that complete-modality data is visible during model training is far from realistic, as modalities can have imbalanced missing rates in clinical scenarios. In this paper, we, for the first time, formulate such a challenging setting and propose Preference-Aware Self-diStillatION (PASSION) for incomplete multi-modal medical image segmentation under imbalanced missing rates. Specifically, we first construct pixel-wise and semantic-wise self-distillation to balance the optimization objective of each modality. Then, we define relative preference to evaluate the dominance of each modality during training, based on which to design task-wise and gradient-wise regularization to balance the convergence rates of different modalities. Experimental results on two publicly available multi-modal datasets demonstrate the superiority of PASSION against existing approaches for modality balancing. More importantly, PASSION is validated to work as a plug-and-play module for consistent performance improvement across different backbones. Code is available at this https URL.

Title: FairViT: Fair Vision Transformer via Adaptive Masking

Authors: Bowei Tian, Ruijie Du, Yanning Shen
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2407.14799
Pdf URL: https://arxiv.org/pdf/2407.14799
Copy Paste: [[2407.14799]] FairViT: Fair Vision Transformer via Adaptive Masking(https://arxiv.org/abs/2407.14799)
Keywords: fair, transformer
Abstract: Vision Transformer (ViT) has achieved excellent performance and demonstrated its promising potential in various computer vision tasks. The wide deployment of ViT in real-world tasks requires a thorough understanding of the societal impact of the model. However, most ViT-based works do not take fairness into account and it is unclear whether directly applying CNN-oriented debiased algorithm to ViT is feasible. Moreover, previous works typically sacrifice accuracy for fairness. Therefore, we aim to develop an algorithm that improves accuracy without sacrificing fairness. In this paper, we propose FairViT, a novel accurate and fair ViT framework. To this end, we introduce a novel distance loss and deploy adaptive fairness-aware masks on attention layers updating with model parameters. Experimental results show \sys can achieve accuracy better than other alternatives, even with competitive computational efficiency. Furthermore, \sys achieves appreciable fairness results.

Title: WiFaKey: Generating Cryptographic Keys from Face in the Wild

Authors: Xingbo Dong, Hui Zhang, Yen Lung Lai, Zhe Jin, Junduan Huang, Wenxiong Kang, Andrew Beng Jin Teoh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.14804
Pdf URL: https://arxiv.org/pdf/2407.14804
Copy Paste: [[2407.14804]] WiFaKey: Generating Cryptographic Keys from Face in the Wild(https://arxiv.org/abs/2407.14804)
Keywords: security, privacy, robust, biometric
Abstract: Deriving a unique cryptographic key from biometric measurements is a challenging task due to the existing noise gap between the biometric measurements and error correction coding. Additionally, privacy and security concerns arise as biometric measurements are inherently linked to the user. Biocryptosystems represent a key branch of solutions aimed at addressing these issues. However, many existing bio-cryptosystems rely on handcrafted feature extractors and error correction codes (ECC), often leading to performance degradation. To address these challenges and improve the reliability of biometric measurements, we propose a novel biometric cryptosystem named WiFaKey, for generating cryptographic keys from face in unconstrained settings. Speciffcally, WiFaKey ffrst introduces an adaptive random masking-driven feature transformation pipeline, AdaMTrans. AdaMTrans effectively quantizes and binarizes realvalued features and incorporates an adaptive random masking scheme to align the bit error rate with error correction requirements, thereby mitigating the noise gap. Besides, WiFaKey incorporates a supervised learning-based neural decoding scheme called Neural-MS decoder, which delivers a more robust error correction performance with less iteration than non-learning decoders, thereby alleviating the performance degradation. We evaluated WiFaKey using widely adopted face feature extractors on six large unconstrained and two constrained datasets. On the LFW dataset, WiFaKey achieved an average Genuine Match Rate of 85.45% and 85.20% at a 0% False Match Rate for MagFace and AdaFace features, respectively. Our comprehensive comparative analysis shows a signiffcant performance improvement of WiFaKey. The source code of our work is available at this http URL.

Title: Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

Authors: Di Fu, Thanh Vinh Vo, Haozhe Ma, Tze-Yun Leong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14811
Pdf URL: https://arxiv.org/pdf/2407.14811
Copy Paste: [[2407.14811]] Decoupled Prompt-Adapter Tuning for Continual Activity Recognition(https://arxiv.org/abs/2407.14811)
Keywords: security
Abstract: Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.

Title: GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition

Authors: Fanxu Min, Shaoxiang Guo, Fan Hao, Junyu Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14812
Pdf URL: https://arxiv.org/pdf/2407.14812
Copy Paste: [[2407.14812]] GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition(https://arxiv.org/abs/2407.14812)
Keywords: robust, biometric, transformer
Abstract: Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Existing appearance-based methods utilize CNN or Transformer to extract spatial and temporal features from silhouettes, while model-based methods employ GCN to focus on the special topological structure of skeleton points. However, the quality of silhouettes is limited by complex occlusions, and skeletons lack dense semantic features of the human body. To tackle these problems, we propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA), which effectively combines two modalities to obtain a more robust and comprehensive gait representation for recognition. First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors. Second, a co-attention alignment module is proposed to align the features by element-wise attention. Finally, we propose a mutual learning module, which achieves feature fusion through cross-attention, Wasserstein loss is further introduced to ensure the effective fusion of two modalities. Extensive experimental results demonstrate the superiority of our model on Gait3D, OU-MVLP, and CASIA-B.

Title: FMamba: Mamba based on Fast-attention for Multivariate Time-series Forecasting

Authors: Shusen Ma, Yu Kang, Peng Bai, Yun-Bo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14814
Pdf URL: https://arxiv.org/pdf/2407.14814
Copy Paste: [[2407.14814]] FMamba: Mamba based on Fast-attention for Multivariate Time-series Forecasting(https://arxiv.org/abs/2407.14814)
Keywords: extraction, transformer
Abstract: In multivariate time-series forecasting (MTSF), extracting the temporal correlations of the input sequences is crucial. While popular Transformer-based predictive models can perform well, their quadratic computational complexity results in inefficiency and high overhead. The recently emerged Mamba, a selective state space model, has shown promising results in many fields due to its strong temporal feature extraction capabilities and linear computational complexity. However, due to the unilateral nature of Mamba, channel-independent predictive models based on Mamba cannot attend to the relationships among all variables in the manner of Transformer-based models. To address this issue, we combine fast-attention with Mamba to introduce a novel framework named FMamba for MTSF. Technically, we first extract the temporal features of the input variables through an embedding layer, then compute the dependencies among input variables via the fast-attention module. Subsequently, we use Mamba to selectively deal with the input features and further extract the temporal dependencies of the variables through the multi-layer perceptron block (MLP-block). Finally, FMamba obtains the predictive results through the projector, a linear layer. Experimental results on eight public datasets demonstrate that FMamba can achieve state-of-the-art performance while maintaining low computational overhead.

Title: Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding

Authors: Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14816
Pdf URL: https://arxiv.org/pdf/2407.14816
Copy Paste: [[2407.14816]] Blind Image Deconvolution by Generative-based Kernel Prior and Initializer via Latent Encoding(https://arxiv.org/abs/2407.14816)
Keywords: generative
Abstract: Blind image deconvolution (BID) is a classic yet challenging problem in the field of image processing. Recent advances in deep image prior (DIP) have motivated a series of DIP-based approaches, demonstrating remarkable success in BID. However, due to the high non-convexity of the inherent optimization process, these methods are notorious for their sensitivity to the initialized kernel. To alleviate this issue and further improve their performance, we propose a new framework for BID that better considers the prior modeling and the initialization for blur kernels, leveraging a deep generative model. The proposed approach pre-trains a generative adversarial network-based kernel generator that aptly characterizes the kernel priors and a kernel initializer that facilitates a well-informed initialization for the blur kernel through latent space encoding. With the pre-trained kernel generator and initializer, one can obtain a high-quality initialization of the blur kernel, and enable optimization within a compact latent kernel manifold. Such a framework results in an evident performance improvement over existing DIP-based BID methods. Extensive experiments on different datasets demonstrate the effectiveness of the proposed method.

Title: CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation

Authors: Yukai Shi, Zhipeng Weng, Yupei Lin, Cidan Shi, Xiaojun Yang, Liang Lin
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2407.14823
Pdf URL: https://arxiv.org/pdf/2407.14823
Copy Paste: [[2407.14823]] CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation(https://arxiv.org/abs/2407.14823)
Keywords: robust
Abstract: In recent years, as computer vision tasks have increasingly relied on high-quality image inputs, the task of image dehazing has received significant attention. Previously, many methods based on priors and deep learning have been proposed to address the task of image dehazing. Ignoring the domain gap between different data, former de-hazing methods usually adopt multiple datasets for explicit training, which often makes the methods themselves be violated. To address this problem, we propose a novel method of internal and external data augmentation to improve the existing dehazing methodology. By using cross-data external augmentor. The dataset inherits samples from different domains that are firmly aligned, making the model learn more robust and generalizable features. By using the internal data augmentation method, the model can fully exploit local information within the images, thereby obtaining more image details. To demonstrate the effectiveness of our proposed method, we conduct training on both the Natural Image Dataset (NID) and the Remote Sensing Image Dataset (RSID). Experimental results show that our method clearly resolves the domain gap in different dehazing datasets and presents a new pipeline for joint training in the dehazing task. Our approach significantly outperforms other advanced methods in dehazing and produces dehazed images that are closest to real haze-free images. The code will be available at: this https URL

Title: Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Authors: Harsh Lunia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14834
Pdf URL: https://arxiv.org/pdf/2407.14834
Copy Paste: [[2407.14834]] Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators(https://arxiv.org/abs/2407.14834)
Keywords: large language model
Abstract: Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.

Title: Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection

Authors: Jeffy Yu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14838
Pdf URL: https://arxiv.org/pdf/2407.14838
Copy Paste: [[2407.14838]] Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection(https://arxiv.org/abs/2407.14838)
Keywords: security, attack, large language model
Abstract: The rapid growth of Decentralized Finance (DeFi) has been accompanied by substantial financial losses due to smart contract vulnerabilities, underscoring the critical need for effective security auditing. With attacks becoming more frequent, the necessity and demand for auditing services has escalated. This especially creates a financial burden for independent developers and small businesses, who often have limited available funding for these services. Our study builds upon existing frameworks by integrating Retrieval-Augmented Generation (RAG) with large language models (LLMs), specifically employing GPT-4-1106 for its 128k token context window. We construct a vector store of 830 known vulnerable contracts, leveraging Pinecone for vector storage, OpenAI's text-embedding-ada-002 for embeddings, and LangChain to construct the RAG-LLM pipeline. Prompts were designed to provide a binary answer for vulnerability detection. We first test 52 smart contracts 40 times each against a provided vulnerability type, verifying the replicability and consistency of the RAG-LLM. Encouraging results were observed, with a 62.7% success rate in guided detection of vulnerabilities. Second, we challenge the model under a "blind" audit setup, without the vulnerability type provided in the prompt, wherein 219 contracts undergo 40 tests each. This setup evaluates the general vulnerability detection capabilities without hinted context assistance. Under these conditions, a 60.71% success rate was observed. While the results are promising, we still emphasize the need for human auditing at this time. We provide this study as a proof of concept for a cost-effective smart contract auditing process, moving towards democratic access to security.

Title: Text-based Talking Video Editing with Cascaded Conditional Diffusion

Authors: Bo Han, Heqing Zou, Haoyang Li, Guangcong Wang, Chng Eng Siong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14841
Pdf URL: https://arxiv.org/pdf/2407.14841
Copy Paste: [[2407.14841]] Text-based Talking Video Editing with Cascaded Conditional Diffusion(https://arxiv.org/abs/2407.14841)
Keywords: diffusion
Abstract: Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first propose a dynamic weighted in-context diffusion module to synthesize dense-landmark motions given an edited audio. \textit{\textbf{In the second stage}}, we introduce a warping-guided conditional diffusion module. The module first interpolates between the start and end frames of the editing interval to generate smooth intermediate frames. Then, with the help of the audio-to-dense motion images, these intermediate frames are warped to obtain coarse intermediate frames. Conditioned on the warped intermedia frames, a diffusion model is adopted to generate detailed and high-resolution target frames, which guarantees coherent and identity-preserved transitions. The cascaded conditional diffusion model decomposes the complex talking editing task into two flexible generation tasks, which provides a generalizable talking-face representation, seamless audio-visual transitions, and identity-preserved faces on a small dataset. Experiments show the effectiveness and superiority of the proposed method.

Title: Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models

Authors: Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, Bryan Kian Hsiang Low
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2407.14845
Pdf URL: https://arxiv.org/pdf/2407.14845
Copy Paste: [[2407.14845]] Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models(https://arxiv.org/abs/2407.14845)
Keywords: large language model
Abstract: Large language models (LLMs) are widely used in decision-making, but their reliability, especially in critical tasks like healthcare, is not well-established. Therefore, understanding how LLMs reason and make decisions is crucial for their safe deployment. This paper investigates how the uncertainty of responses generated by LLMs relates to the information provided in the input prompt. Leveraging the insight that LLMs learn to infer latent concepts during pretraining, we propose a prompt-response concept model that explains how LLMs generate responses and helps understand the relationship between prompts and response uncertainty. We show that the uncertainty decreases as the prompt's informativeness increases, similar to epistemic uncertainty. Our detailed experimental results on real datasets validate our proposed model.

Title: CBCTLiTS: A Synthetic, Paired CBCT/CT Dataset For Segmentation And Style Transfer

Authors: Maximilian E. Tschuchnig, Philipp Steininger, Michael Gadermayr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14853
Pdf URL: https://arxiv.org/pdf/2407.14853
Copy Paste: [[2407.14853]] CBCTLiTS: A Synthetic, Paired CBCT/CT Dataset For Segmentation And Style Transfer(https://arxiv.org/abs/2407.14853)
Keywords: segmentation
Abstract: Medical imaging is vital in computer assisted intervention. Particularly cone beam computed tomography (CBCT) with defacto real time and mobility capabilities plays an important role. However, CBCT images often suffer from artifacts, which pose challenges for accurate interpretation, motivating research in advanced algorithms for more effective use in clinical practice. In this work we present CBCTLiTS, a synthetically generated, labelled CBCT dataset for segmentation with paired and aligned, high quality computed tomography data. The CBCT data is provided in 5 different levels of quality, reaching from a large number of projections with high visual quality and mild artifacts to a small number of projections with severe artifacts. This allows thorough investigations with the quality as a degree of freedom. We also provide baselines for several possible research scenarios like uni- and multimodal segmentation, multitask learning and style transfer followed by segmentation of relatively simple, liver to complex liver tumor segmentation. CBCTLiTS is accesssible via this https URL.

Title: Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

Authors: A. Verdone, A. Devoto, C. Sebastiani, J. Carmignani, M. D'Onofrio, S. Giagu, S. Scardapane, M. Panella
Subjects: cs.LG, hep-ex
Abstract URL: https://arxiv.org/abs/2407.14859
Pdf URL: https://arxiv.org/pdf/2407.14859
Copy Paste: [[2407.14859]] Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques(https://arxiv.org/abs/2407.14859)
Keywords: robust
Abstract: The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.

Title: An Explainable Fast Deep Neural Network for Emotion Recognition

Authors: Francesco Di Luzio, Antonello Rosato, Massimo Panella
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14865
Pdf URL: https://arxiv.org/pdf/2407.14865
Copy Paste: [[2407.14865]] An Explainable Fast Deep Neural Network for Emotion Recognition(https://arxiv.org/abs/2407.14865)
Keywords: robust, explainability
Abstract: In the context of artificial intelligence, the inherent human attribute of engaging in logical reasoning to facilitate decision-making is mirrored by the concept of explainability, which pertains to the ability of a model to provide a clear and interpretable account of how it arrived at a particular outcome. This study explores explainability techniques for binary deep neural architectures in the framework of emotion classification through video analysis. We investigate the optimization of input features to binary classifiers for emotion recognition, with face landmarks detection using an improved version of the Integrated Gradients explainability method. The main contribution of this paper consists in the employment of an innovative explainable artificial intelligence algorithm to understand the crucial facial landmarks movements during emotional feeling, using this information also for improving the performances of deep learning-based emotion classifiers. By means of explainability, we can optimize the number and the position of the facial landmarks used as input features for facial emotion recognition, lowering the impact of noisy landmarks and thus increasing the accuracy of the developed models. In order to test the effectiveness of the proposed approach, we considered a set of deep binary models for emotion classification trained initially with a complete set of facial landmarks, which are progressively reduced based on a suitable optimization procedure. The obtained results prove the robustness of the proposed explainable approach in terms of understanding the relevance of the different facial points for the different emotions, also improving the classification accuracy and diminishing the computational cost.

Title: Dual High-Order Total Variation Model for Underwater Image Restoration

Authors: Yuemei Li, Guojia Hou, Peixian Zhuang, Zhenkuan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14868
Pdf URL: https://arxiv.org/pdf/2407.14868
Copy Paste: [[2407.14868]] Dual High-Order Total Variation Model for Underwater Image Restoration(https://arxiv.org/abs/2407.14868)
Keywords: robust
Abstract: Underwater images are typically characterized by color cast, haze, blurring, and uneven illumination due to the selective absorption and scattering when light propagates through the water, which limits their practical applications. Underwater image enhancement and restoration (UIER) is one crucial mode to improve the visual quality of underwater images. However, most existing UIER methods concentrate on enhancing contrast and dehazing, rarely pay attention to the local illumination differences within the image caused by illumination variations, thus introducing some undesirable artifacts and unnatural color. To address this issue, an effective variational framework is proposed based on an extended underwater image formation model (UIFM). Technically, dual high-order regularizations are successfully integrated into the variational model to acquire smoothed local ambient illuminance and structure-revealed reflectance in a unified manner. In our proposed framework, the weight factors-based color compensation is combined with the color balance to compensate for the attenuated color channels and remove the color cast. In particular, the local ambient illuminance with strong robustness is acquired by performing the local patch brightest pixel estimation and an improved gamma correction. Additionally, we design an iterative optimization algorithm relying on the alternating direction method of multipliers (ADMM) to accelerate the solution of the proposed variational model. Considerable experiments on three real-world underwater image datasets demonstrate that the proposed method outperforms several state-of-the-art methods with regard to visual quality and quantitative assessments. Moreover, the proposed method can also be extended to outdoor image dehazing, low-light image enhancement, and some high-level vision tasks. The code is available at this https URL.

Title: Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Authors: Yanting Yang, Minghao Chen, Qibo Qiu, Jiahao Wu, Wenxiao Wang, Binbin Lin, Ziyu Guan, Xiaofei He
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2407.14872
Pdf URL: https://arxiv.org/pdf/2407.14872
Copy Paste: [[2407.14872]] Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts(https://arxiv.org/abs/2407.14872)
Keywords: robust
Abstract: For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

Title: Seal: Advancing Speech Language Models to be Few-Shot Learners

Authors: Shuyu Lei, Lingen Liu, Jiaolong Yang, Yasen Jiao, Yuxiang Yang, Yushu Yang, Xiang Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.14875
Pdf URL: https://arxiv.org/pdf/2407.14875
Copy Paste: [[2407.14875]] Seal: Advancing Speech Language Models to be Few-Shot Learners(https://arxiv.org/abs/2407.14875)
Keywords: robust
Abstract: Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

Title: Thompson Sampling Itself is Differentially Private

Authors: Tingting Ou, Marco Avella Medina, Rachel Cummings
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2407.14879
Pdf URL: https://arxiv.org/pdf/2407.14879
Copy Paste: [[2407.14879]] Thompson Sampling Itself is Differentially Private(https://arxiv.org/abs/2407.14879)
Keywords: privacy
Abstract: In this work we first show that the classical Thompson sampling algorithm for multi-arm bandits is differentially private as-is, without any modification. We provide per-round privacy guarantees as a function of problem parameters and show composition over $T$ rounds; since the algorithm is unchanged, existing $O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in performance due to privacy. We then show that simple modifications -- such as pre-pulling all arms a fixed number of times, increasing the sampling variance -- can provide tighter privacy guarantees. We again provide privacy guarantees that now depend on the new parameters introduced in the modification, which allows the analyst to tune the privacy guarantee as desired. We also provide a novel regret analysis for this new algorithm, and show how the new parameters also impact expected regret. Finally, we empirically validate and illustrate our theoretical findings in two parameter regimes and demonstrate that tuning the new parameters substantially improve the privacy-regret tradeoff.

Title: Reduced Effectiveness of Kolmogorov-Arnold Networks on Functions with Noise

Authors: Haoran Shen, Chen Zeng, Jiahui Wang, Qiao Wang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2407.14882
Pdf URL: https://arxiv.org/pdf/2407.14882
Copy Paste: [[2407.14882]] Reduced Effectiveness of Kolmogorov-Arnold Networks on Functions with Noise(https://arxiv.org/abs/2407.14882)
Keywords: diffusion
Abstract: It has been observed that even a small amount of noise introduced into the dataset can significantly degrade the performance of KAN. In this brief note, we aim to quantitatively evaluate the performance when noise is added to the dataset. We propose an oversampling technique combined with denoising to alleviate the impact of noise. Specifically, we employ kernel filtering based on diffusion maps for pre-filtering the noisy data for training KAN network. Our experiments show that while adding i.i.d. noise with any fixed SNR, when we increase the amount of training data by a factor of $r$, the test-loss (RMSE) of KANs will exhibit a performance trend like $\text{test-loss} \sim \mathcal{O}(r^{-\frac{1}{2}})$ as $r\to +\infty$. We conclude that applying both oversampling and filtering strategies can reduce the detrimental effects of noise. Nevertheless, determining the optimal variance for the kernel filtering process is challenging, and enhancing the volume of training data substantially increases the associated costs, because the training dataset needs to be expanded multiple times in comparison to the initial clean data. As a result, the noise present in the data ultimately diminishes the effectiveness of Kolmogorov-Arnold networks.

Title: Latent Pollution Model: The Hidden Carbon Footprint in 3D Image Synthesis

Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Sandy Engelhardt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14892
Pdf URL: https://arxiv.org/pdf/2407.14892
Copy Paste: [[2407.14892]] Latent Pollution Model: The Hidden Carbon Footprint in 3D Image Synthesis(https://arxiv.org/abs/2407.14892)
Keywords: diffusion, generative
Abstract: Contemporary developments in generative AI are rapidly transforming the field of medical AI. These developments have been predominantly driven by the availability of large datasets and high computing power, which have facilitated a significant increase in model capacity. Despite their considerable potential, these models demand substantially high power, leading to high carbon dioxide (CO2) emissions. Given the harm such models are causing to the environment, there has been little focus on the carbon footprints of such models. This study analyzes carbon emissions from 2D and 3D latent diffusion models (LDMs) during training and data generation phases, revealing a surprising finding: the synthesis of large images contributes most significantly to these emissions. We assess different scenarios including model sizes, image dimensions, distributed training, and data generation steps. Our findings reveal substantial carbon emissions from these models, with training 2D and 3D models comparable to driving a car for 10 km and 90 km, respectively. The process of data generation is even more significant, with CO2 emissions equivalent to driving 160 km for 2D models and driving for up to 3345 km for 3D synthesis. Additionally, we found that the location of the experiment can increase carbon emissions by up to 94 times, and even the time of year can influence emissions by up to 50%. These figures are alarming, considering they represent only a single training and data generation phase for each model. Our results emphasize the urgent need for developing environmentally sustainable strategies in generative AI.

Title: AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Authors: Yunlong Lin, Tian Ye, Sixiang Chen, Zhenqi Fu, Yingying Wang, Wenhao Chai, Zhaohu Xing, Lei Zhu, Xinghao Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14900
Pdf URL: https://arxiv.org/pdf/2407.14900
Copy Paste: [[2407.14900]] AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement(https://arxiv.org/abs/2407.14900)
Keywords: diffusion
Abstract: Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

Title: Self-supervised transformer-based pre-training method with General Plant Infection dataset

Authors: Zhengle Wang, Ruifeng Wang, Minjuan Wang, Tianyun Lai, Man Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14911
Pdf URL: https://arxiv.org/pdf/2407.14911
Copy Paste: [[2407.14911]] Self-supervised transformer-based pre-training method with General Plant Infection dataset(https://arxiv.org/abs/2407.14911)
Keywords: transformer
Abstract: Pest and disease classification is a challenging issue in agriculture. The performance of deep learning models is intricately linked to training data diversity and quantity, posing issues for plant pest and disease datasets that remain underdeveloped. This study addresses these challenges by constructing a comprehensive dataset and proposing an advanced network architecture that combines Contrastive Learning and Masked Image Modeling (MIM). The dataset comprises diverse plant species and pest categories, making it one of the largest and most varied in the field. The proposed network architecture demonstrates effectiveness in addressing plant pest and disease recognition tasks, achieving notable detection accuracy. This approach offers a viable solution for rapid, efficient, and cost-effective plant pest and disease detection, thereby reducing agricultural production costs. Our code and dataset will be publicly available to advance research in plant pest and disease recognition the GitHub repository at this https URL

Title: PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction

Authors: Weiqin Jiao, Claudio Persello, George Vosselman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14912
Pdf URL: https://arxiv.org/pdf/2407.14912
Copy Paste: [[2407.14912]] PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction(https://arxiv.org/abs/2407.14912)
Keywords: extraction
Abstract: Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, which offers an efficient and fully integrated approach to predict vectorized building polygons and bounding boxes directly from remotely sensed images. Notably, PolyR-CNN leverages solely the features of the Region of Interest (RoI) for the prediction, thereby mitigating the necessity for complex designs. Furthermore, we propose a novel scheme with PolyR-CNN to extract detailed outline information from polygon vertex coordinates, termed vertex proposal feature, to guide the RoI features to predict more regular buildings. PolyR-CNN demonstrates the capacity to deal with buildings with holes through a simple post-processing method on the Inria dataset. Comprehensive experiments conducted on the CrowdAI dataset show that PolyR-CNN achieves competitive accuracy compared to state-of-the-art methods while significantly improving computational efficiency, i.e., achieving 79.2 Average Precision (AP), exhibiting a 15.9 AP gain and operating 2.5 times faster and four times lighter than the well-established end-to-end method PolyWorld. Replacing the backbone with a simple ResNet-50, PolyR-CNN maintains a 71.1 AP while running four times faster than PolyWorld.

Title: RoIPoly: Vectorized Building Outline Extraction Using Vertex and Logit Embeddings

Authors: Weiqin Jiao, Hao Cheng, Claudio Persello, George Vosselman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14920
Pdf URL: https://arxiv.org/pdf/2407.14920
Copy Paste: [[2407.14920]] RoIPoly: Vectorized Building Outline Extraction Using Vertex and Logit Embeddings(https://arxiv.org/abs/2407.14920)
Keywords: extraction
Abstract: Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant vertices, both resulting in suboptimal performance. To handle these issues, we propose a novel Region-of-Interest (RoI) query-based approach called RoIPoly. Specifically, we formulate each vertex as a query and constrain the query attention on the most relevant regions of a potential building, yielding reduced computational overhead and more efficient vertex level interaction. Moreover, we introduce a novel learnable logit embedding to facilitate vertex classification on the attention map; thus, no post-processing is needed for redundant vertex removal. We evaluated our method on the vectorized building outline extraction dataset CrowdAI and the 2D floorplan reconstruction dataset Structured3D. On the CrowdAI dataset, RoIPoly with a ResNet50 backbone outperforms existing methods with the same or better backbones on most MS-COCO metrics, especially on small buildings, and achieves competitive results in polygon quality and vertex redundancy without any post-processing. On the Structured3D dataset, our method achieves the second-best performance on most metrics among existing methods dedicated to 2D floorplan reconstruction, demonstrating our cross-domain generalization capability. The code will be released upon acceptance of this paper.

Title: RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Authors: Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14923
Pdf URL: https://arxiv.org/pdf/2407.14923
Copy Paste: [[2407.14923]] RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies(https://arxiv.org/abs/2407.14923)
Keywords: extraction
Abstract: The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.

Title: POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Authors: Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2407.14931
Pdf URL: https://arxiv.org/pdf/2407.14931
Copy Paste: [[2407.14931]] POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation(https://arxiv.org/abs/2407.14931)
Keywords: fair
Abstract: Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obstacle avoidance, that have been conventionally approached with the classical non-learnable methods (e.g., heuristic search) is currently suggested to be solved by the learning-based or hybrid methods. Still, in this domain, it is hard, not to say impossible, to conduct a fair comparison between classical, learning-based, and hybrid approaches due to the lack of a unified framework that supports both learning and evaluation. To this end, we introduce POGEMA, a set of comprehensive tools that includes a fast environment for learning, a generator of problem instances, the collection of pre-defined ones, a visualization toolkit, and a benchmarking tool that allows automated evaluation. We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators (such as success rate and path length), allowing a fair multi-fold comparison. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.

Title: Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Authors: Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, NhatHai Phan
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2407.14937
Pdf URL: https://arxiv.org/pdf/2407.14937
Copy Paste: [[2407.14937]] Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)(https://arxiv.org/abs/2407.14937)
Keywords: secure, security, defense, attack, robust, large language model
Abstract: Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.

Title: From Ad Identifiers to Global Privacy Control: The Status Quo and Future of Opting Out of Ad Tracking on Android

Authors: Sebastian Zimmeck, Nishant Aggarwal, Zachary Liu, Konrad Kollnig
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2407.14938
Pdf URL: https://arxiv.org/pdf/2407.14938
Copy Paste: [[2407.14938]] From Ad Identifiers to Global Privacy Control: The Status Quo and Future of Opting Out of Ad Tracking on Android(https://arxiv.org/abs/2407.14938)
Keywords: privacy
Abstract: Apps and their integrated third party libraries often collect a variety of data from people to show them personalized ads. This practice is often privacy-invasive. Since 2013, Google has therefore allowed users to limit ad tracking on Android via system settings. Further, under the 2018 California Consumer Privacy Act (CCPA), apps must honor opt-outs from ad tracking under the Global Privacy Control (GPC). The efficacy of these two methods to limit ad tracking has not been studied in prior work. Our legal and technical analysis details how the GPC applies to mobile apps and how it could be integrated directly into Android, thereby developing a reference design for GPC on Android. Our empirical analysis of 1,896 top-ranked Android apps shows that both the Android system-level opt-out and the GPC signal rarely restrict ad tracking. In our view, deleting the AdID and opting out under the CCPA has the same meaning. Thus, the current AdID setting and APIs should be evolved towards GPC and integrated into Android's Privacy Sandbox.

Title: Automatic Generation of Fashion Images using Prompting in Generative Machine Learning Models

Authors: Georgia Argyrou, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14944
Pdf URL: https://arxiv.org/pdf/2407.14944
Copy Paste: [[2407.14944]] Automatic Generation of Fashion Images using Prompting in Generative Machine Learning Models(https://arxiv.org/abs/2407.14944)
Keywords: diffusion, generative, large language model
Abstract: The advent of artificial intelligence has contributed in a groundbreaking transformation of the fashion industry, redefining creativity and innovation in unprecedented ways. This work investigates methodologies for generating tailored fashion descriptions using two distinct Large Language Models and a Stable Diffusion model for fashion image creation. Emphasizing adaptability in AI-driven fashion creativity, we depart from traditional approaches and focus on prompting techniques, such as zero-shot and few-shot learning, as well as Chain-of-Thought (CoT), which results in a variety of colors and textures, enhancing the diversity of the outputs. Central to our methodology is Retrieval-Augmented Generation (RAG), enriching models with insights from fashion sources to ensure contemporary representations. Evaluation combines quantitative metrics such as CLIPscore with qualitative human judgment, highlighting strengths in creativity, coherence, and aesthetic appeal across diverse styles. Among the participants, RAG and few-shot learning techniques are preferred for their ability to produce more relevant and appealing fashion descriptions. Our code is provided at this https URL.

Title: Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models

Authors: Navid Seidi, Satyaki Roy, Sajal K. Das, Ardhendu Tripathy
Subjects: cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2407.14960
Pdf URL: https://arxiv.org/pdf/2407.14960
Copy Paste: [[2407.14960]] Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models(https://arxiv.org/abs/2407.14960)
Keywords: privacy, federate
Abstract: The diversity in disease profiles and therapeutic approaches between hospitals and health professionals underscores the need for patient-centric personalized strategies in healthcare. Alongside this, similarities in disease progression across patients can be utilized to improve prediction models in survival analysis. The need for patient privacy and the utility of prediction models can be simultaneously addressed in the framework of Federated Learning (FL). This paper outlines an approach in the domain of federated survival analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a specific focus on mitigating data heterogeneity and elevating model performance. We present an FL approach that employs feature-based clustering to enhance model accuracy across synthetic datasets and real-world applications, including the Surveillance, Epidemiology, and End Results (SEER) database. Furthermore, we consider an event-based reporting strategy that provides a dynamic approach to model adaptation by responding to local data changes. Our experiments show the efficacy of our approach and discuss future directions for a practical application of FL in healthcare.

Title: Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives

Authors: Desta Haileselassie Hagos, Rick Battle, Danda B. Rawat
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14962
Pdf URL: https://arxiv.org/pdf/2407.14962
Copy Paste: [[2407.14962]] Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives(https://arxiv.org/abs/2407.14962)
Keywords: generative, large language model
Abstract: The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper contributes to providing a holistic perspective on the technical foundations, practical applications, and emerging challenges within the evolving landscape of Generative AI and LLMs. We believe that understanding the generative capabilities of AI systems and the specific context of LLMs is crucial for researchers, practitioners, and policymakers to collaboratively shape the responsible and ethical integration of these technologies into various domains. Furthermore, we identify and address main research gaps, providing valuable insights to guide future research endeavors within the AI research community.

Title: Base and Exponent Prediction in Mathematical Expressions using Multi-Output CNN

Authors: Md Laraib Salam, Akash S Balsaraf, Gaurav Gupta
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14967
Pdf URL: https://arxiv.org/pdf/2407.14967
Copy Paste: [[2407.14967]] Base and Exponent Prediction in Mathematical Expressions using Multi-Output CNN(https://arxiv.org/abs/2407.14967)
Keywords: robust
Abstract: The use of neural networks and deep learning techniques in image processing has significantly advanced the field, enabling highly accurate recognition results. However, achieving high recognition rates often necessitates complex network models, which can be challenging to train and require substantial computational resources. This research presents a simplified yet effective approach to predicting both the base and exponent from images of mathematical expressions using a multi-output Convolutional Neural Network (CNN). The model is trained on 10,900 synthetically generated images containing exponent expressions, incorporating random noise, font size variations, and blur intensity to simulate real-world conditions. The proposed CNN model demonstrates robust performance with efficient training time. The experimental results indicate that the model achieves high accuracy in predicting the base and exponent values, proving the efficacy of this approach in handling noisy and varied input images.

Title: Technical report: Improving the properties of molecules generated by LIMO

Authors: Vineet Thumuluri, Peter Eckmann, Michael K. Gilson, Rose Yu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2407.14968
Pdf URL: https://arxiv.org/pdf/2407.14968
Copy Paste: [[2407.14968]] Technical report: Improving the properties of molecules generated by LIMO(https://arxiv.org/abs/2407.14968)
Keywords: transformer
Abstract: This technical report investigates variants of the Latent Inceptionism on Molecules (LIMO) framework to improve the properties of generated molecules. We conduct ablative studies of molecular representation, decoder model, and surrogate model training scheme. The experiments suggest that an autogressive Transformer decoder with GroupSELFIES achieves the best average properties for the random generation task.

Title: Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Authors: Md Zarif Hossain, Ahmed Imteaj
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14971
Pdf URL: https://arxiv.org/pdf/2407.14971
Copy Paste: [[2407.14971]] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models(https://arxiv.org/abs/2407.14971)
Keywords: secure, attack, robust
Abstract: Vision-language models (VLMs) have achieved significant strides in recent times specially in multimodal tasks, yet they remain susceptible to adversarial attacks on their vision components. To address this, we propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned CLIP encoder exhibit significantly enhanced robustness against adversarial attacks, while preserving semantic meaning of the perturbed images. Notably, Sim-CLIP does not require additional training or fine-tuning of the VLM itself; replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to provide robustness. This work underscores the significance of reinforcing foundational models like CLIP to safeguard the reliability of downstream VLM applications, paving the way for more secure and effective multimodal systems.

Title: ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition

Authors: Mohammad Saeed Ebrahimi Saadabadi, Sahar Rahimi Malakshan, Ali Dabouei, Nasser M. Nasrabadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14972
Pdf URL: https://arxiv.org/pdf/2407.14972
Copy Paste: [[2407.14972]] ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition(https://arxiv.org/abs/2407.14972)
Keywords: robust
Abstract: Aiming to enhance Face Recognition (FR) on Low-Quality (LQ) inputs, recent studies suggest incorporating synthetic LQ samples into training. Although promising, the quality factors that are considered in these works are general rather than FR-specific, \eg, atmospheric turbulence, resolution, \etc. Motivated by the observation of the vulnerability of current FR models to even small Face Alignment Errors (FAE) in LQ images, we present a simple yet effective method that considers FAE as another quality factor that is tailored to FR. We seek to improve LQ FR by enhancing FR models' robustness to FAE. To this aim, we formalize the problem as a combination of differentiable spatial transformations and adversarial data augmentation in FR. We perturb the alignment of the training samples using a controllable spatial transformation and enrich the training with samples expressing FAE. We demonstrate the benefits of the proposed method by conducting evaluations on IJB-B, IJB-C, IJB-S (+4.3\% Rank1), and TinyFace (+2.63\%). \href{this https URL}{this https URL}

Title: Out of spuriousity: Improving robustness to spurious correlations without group annotations

Authors: Phuong Quynh Le, Jörg Schlötterer, Christin Seifert
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14974
Pdf URL: https://arxiv.org/pdf/2407.14974
Copy Paste: [[2407.14974]] Out of spuriousity: Improving robustness to spurious correlations without group annotations(https://arxiv.org/abs/2407.14974)
Keywords: robust
Abstract: Machine learning models are known to learn spurious correlations, i.e., features having strong relations with class labels but no causal relation. Relying on those correlations leads to poor performance in the data groups without these correlations and poor generalization ability. To improve the robustness of machine learning models to spurious correlations, we propose an approach to extract a subnetwork from a fully trained network that does not rely on spurious correlations. The subnetwork is found by the assumption that data points with the same spurious attribute will be close to each other in the representation space when training with ERM, then we employ supervised contrastive loss in a novel way to force models to unlearn the spurious connections. The increase in the worst-group performance of our approach contributes to strengthening the hypothesis that there exists a subnetwork in a fully trained dense network that is responsible for using only invariant features in classification tasks, therefore erasing the influence of spurious features even in the setup of multi spurious attributes and no prior knowledge of attributes labels.

Title: RGB2Point: 3D Point Cloud Generation from Single RGB Images

Authors: Jae Joong Lee, Bedrich Benes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.14979
Pdf URL: https://arxiv.org/pdf/2407.14979
Copy Paste: [[2407.14979]] RGB2Point: 3D Point Cloud Generation from Single RGB Images(https://arxiv.org/abs/2407.14979)
Keywords: diffusion, transformer
Abstract: We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.

Title: GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

Authors: Jingzhi Gong, Sisi Li, Giordano d'Aloisio, Zishuo Ding, Yulong Ye, William B. Langdon, Federica Sarro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.14982
Pdf URL: https://arxiv.org/pdf/2407.14982
Copy Paste: [[2407.14982]] GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation(https://arxiv.org/abs/2407.14982)
Keywords: diffusion
Abstract: Tuning the parameters and prompts for improving AI-based text-to-image generation has remained a substantial yet unaddressed challenge. Hence we introduce GreenStableYolo, which improves the parameters and prompts for Stable Diffusion to both reduce GPU inference time and increase image generation quality using NSGA-II and Yolo. Our experiments show that despite a relatively slight trade-off (18%) in image quality compared to StableYolo (which only considers image quality), GreenStableYolo achieves a substantial reduction in inference time (266% less) and a 526% higher hypervolume, thereby advancing the state-of-the-art for text-to-image generation.

Title: Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Authors: Antonis Antoniades, Xinyi Wang, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.14985
Pdf URL: https://arxiv.org/pdf/2407.14985
Copy Paste: [[2407.14985]] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data(https://arxiv.org/abs/2407.14985)
Keywords: large language model
Abstract: Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.

Title: All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks

Authors: Ajay Jaiswal, Nurendra Choudhary, Ravinarayana Adkathimar, Muthu P. Alagappan, Gaurush Hiranandani, Ying Ding, Zhangyang Wang, Edward W Huang, Karthik Subbian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.14996
Pdf URL: https://arxiv.org/pdf/2407.14996
Copy Paste: [[2407.14996]] All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks(https://arxiv.org/abs/2407.14996)
Keywords: large language model
Abstract: Graph Neural Networks (GNNs) have attracted immense attention in the past decade due to their numerous real-world applications built around graph-structured data. On the other hand, Large Language Models (LLMs) with extensive pretrained knowledge and powerful semantic comprehension abilities have recently shown a remarkable ability to benefit applications using vision and text data. In this paper, we investigate how LLMs can be leveraged in a computationally efficient fashion to benefit rich graph-structured data, a modality relatively unexplored in LLM literature. Prior works in this area exploit LLMs to augment every node features in an ad-hoc fashion (not scalable for large graphs), use natural language to describe the complex structural information of graphs, or perform computationally expensive finetuning of LLMs in conjunction with GNNs. We propose E-LLaGNN (Efficient LLMs augmented GNNs), a framework with an on-demand LLM service that enriches message passing procedure of graph learning by enhancing a limited fraction of nodes from the graph. More specifically, E-LLaGNN relies on sampling high-quality neighborhoods using LLMs, followed by on-demand neighborhood feature enhancement using diverse prompts from our prompt catalog, and finally information aggregation using message passing from conventional GNN architectures. We explore several heuristics-based active node selection strategies to limit the computational and memory footprint of LLMs when handling millions of nodes. Through extensive experiments & ablation on popular graph benchmarks of varying scales (Cora, PubMed, ArXiv, & Products), we illustrate the effectiveness of our E-LLaGNN framework and reveal many interesting capabilities such as improved gradient flow in deep GNNs, LLM-free inference ability etc.

Title: Requiem for a drone: a machine-learning based framework for stealthy attacks against unmanned autonomous vehicles

Authors: Kyo Hyun Kim, Denizhan Kara, Vineetha Paruchuri, Sibin Mohan, Greg Kimberly, Jae Kim, Josh Eckhardt
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15003
Pdf URL: https://arxiv.org/pdf/2407.15003
Copy Paste: [[2407.15003]] Requiem for a drone: a machine-learning based framework for stealthy attacks against unmanned autonomous vehicles(https://arxiv.org/abs/2407.15003)
Keywords: attack, steal
Abstract: There is a space of uncertainty in the modeling of vehicular dynamics of autonomous systems due to noise in sensor readings, environmental factors or modeling errors. We present Requiem, a software-only, blackbox approach that exploits this space in a stealthy manner causing target systems, e.g., unmanned aerial vehicles (UAVs), to significantly deviate from their mission parameters. Our system achieves this by modifying sensor values, all while avoiding detection by onboard anomaly detectors (hence, "stealthy"). The Requiem framework uses a combination of multiple deep learning models (that we refer to as "surrogates" and "spoofers") coupled with extensive, realistic simulations on a software-in-the-loop quadrotor UAV system. Requiem makes no assumptions about either the (types of) sensors or the onboard state estimation algorithm(s) -- it works so long as the latter is "learnable". We demonstrate the effectiveness of our system using various attacks across multiple missions as well as multiple sets of statistical analyses. We show that Requiem successfully exploits the modeling errors (i.e., causes significant deviations from planned mission parameters) while remaining stealthy (no detection even after {tens of meters of deviations}) and are generalizable (Requiem has potential to work across different attacks and sensor types).

Title: Knowledge Mechanisms in Large Language Models: A Survey and Perspective

Authors: Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15017
Pdf URL: https://arxiv.org/pdf/2407.15017
Copy Paste: [[2407.15017]] Knowledge Mechanisms in Large Language Models: A Survey and Perspective(https://arxiv.org/abs/2407.15017)
Keywords: large language model
Abstract: Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.

Title: Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

Authors: Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, Ashish Sabharwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15018
Pdf URL: https://arxiv.org/pdf/2407.15018
Copy Paste: [[2407.15018]] Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions(https://arxiv.org/abs/2407.15018)
Keywords: transformer
Abstract: Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that an inability to separate answer symbol tokens in vocabulary space is a property of models unable to perform formatted MCQA tasks.

Title: Enhancing Incremental Summarization with Structured Representations

Authors: EunJeong Hwang, Yichao Zhou, James Bradley Wendt, Beliz Gunel, Nguyen Vo, Jing Xie, Sandeep Tata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15021
Pdf URL: https://arxiv.org/pdf/2407.15021
Copy Paste: [[2407.15021]] Enhancing Incremental Summarization with Structured Representations(https://arxiv.org/abs/2407.15021)
Keywords: large language model
Abstract: Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations ($GU_{json}$), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy ($CoK_{json}$) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets.

Title: ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks

Authors: Ghazi Gharsallah, Georges Kaddoum
Subjects: cs.CV, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2407.15023
Pdf URL: https://arxiv.org/pdf/2407.15023
Copy Paste: [[2407.15023]] ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks(https://arxiv.org/abs/2407.15023)
Keywords: transformer
Abstract: As wireless communication technology progresses towards the sixth generation (6G), high-frequency millimeter-wave (mmWave) communication has emerged as a promising candidate for enabling vehicular networks. It offers high data rates and low-latency communication. However, obstacles such as buildings, trees, and other vehicles can cause signal attenuation and blockage, leading to communication failures that can result in fatal accidents or traffic congestion. Predicting blockages is crucial for ensuring reliable and efficient communications. Furthermore, the advent of 6G technology is anticipated to integrate advanced sensing capabilities, utilizing a variety of sensor types. These sensors, ranging from traditional RF sensors to cameras and Lidar sensors, are expected to provide access to rich multimodal data, thereby enriching communication systems with a wealth of additional contextual information. Leveraging this multimodal data becomes essential for making precise network management decisions, including the crucial task of blockage detection. In this paper, we propose a Deep Learning (DL)-based approach that combines Convolutional Neural Networks (CNNs) and customized Vision Transformers (ViTs) to effectively extract essential information from multimodal data and predict blockages in vehicular networks. Our method capitalizes on the synergistic strengths of CNNs and ViTs to extract features from time-series multimodal data, which include images and beam vectors. To capture temporal dependencies between the extracted features and the blockage state at future time steps, we employ a Gated Recurrent Unit (GRU)-based architecture. Our results show that the proposed approach achieves high accuracy and outperforms state-of-the-art solutions, achieving more than $95\%$ accurate predictions.

Title: MedSAGa: Few-shot Memory Efficient Medical Image Segmentation using Gradient Low-Rank Projection in SAM

Authors: Navyansh Mahla, Annie D'souza, Shubh Gupta, Bhavik Kanekar, Kshitij Sharad Jadhav
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15042
Pdf URL: https://arxiv.org/pdf/2407.15042
Copy Paste: [[2407.15042]] MedSAGa: Few-shot Memory Efficient Medical Image Segmentation using Gradient Low-Rank Projection in SAM(https://arxiv.org/abs/2407.15042)
Keywords: segmentation
Abstract: The application of large-scale models in medical image segmentation demands substantial quantities of meticulously annotated data curated by experts along with high computational resources, both of which are challenges in resource-poor settings. In this study, we present the Medical Segment Anything Model with Galore MedSAGa where we adopt the Segment Anything Model (SAM) to achieve memory-efficient, few-shot medical image segmentation by applying Gradient Low-Rank Projection GaLore to the parameters of the image encoder of SAM. Meanwhile, the weights of the prompt encoder and mask decoder undergo full parameter fine-tuning using standard optimizers. We further assess MedSAGa's few-shot learning capabilities, reporting on its memory efficiency and segmentation performance across multiple standard medical image segmentation datasets. We compare it with several baseline models, including LoRA fine-tuned SAM (SAMed) and DAE-Former. Experiments across multiple datasets and these baseline models with different number of images for fine tuning demonstrated that the GPU memory consumption of MedSAGa is significantly less than that of the baseline models, achieving an average memory efficiency of 66% more than current state-of-the-art (SOTA) models for medical image segmentation. The combination of substantially lower memory requirements and comparable to SOTA results in few-shot learning for medical image segmentation positions MedSAGa as an optimal solution for deployment in resource-constrained settings.

Title: Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Authors: Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang
Subjects: cs.LG, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2407.15050
Pdf URL: https://arxiv.org/pdf/2407.15050
Copy Paste: [[2407.15050]] Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts(https://arxiv.org/abs/2407.15050)
Keywords: security, attack, large language model
Abstract: Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5\% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.

Title: Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Authors: Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiaoyong Wei, Chang Wen Chen, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15051
Pdf URL: https://arxiv.org/pdf/2407.15051
Copy Paste: [[2407.15051]] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval(https://arxiv.org/abs/2407.15051)
Keywords: large language model
Abstract: In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at this https URL.

Title: AGORA: Open More and Trust Less in Binary Verification Service

Authors: Hongbo Chen, Quan Zhou, Sen Yang, Xing Han, Fan Zhang, Danfeng Zhang, Xiaofeng Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15062
Pdf URL: https://arxiv.org/pdf/2407.15062
Copy Paste: [[2407.15062]] AGORA: Open More and Trust Less in Binary Verification Service(https://arxiv.org/abs/2407.15062)
Keywords: secure, security
Abstract: Binary verification plays a pivotal role in software security, yet building a verification service that is both open and trustworthy poses a formidable challenge. In this paper, we introduce a novel binary verification service, AGORA, scrupulously designed to overcome the challenge. At the heart of this approach lies a strategic insight: certain tasks can be delegated to untrusted entities, while the corresponding validators are securely housed within the trusted computing base (TCB). AGORA can validate untrusted assertions generated for versatile policies. Through a novel blockchain-based bounty task manager, it also utilizes crowdsourcing to remove trust in theorem provers. These synergistic techniques successfully ameliorate the TCB size burden associated with two procedures: binary analysis and theorem proving. The design of AGORA allows untrusted parties to participate in these complex processes. Moreover, based on running the optimized TCB within trusted execution environments and recording the verification process on a blockchain, the public can audit the correctness of verification results. By implementing verification workflows for software-based fault isolation policy and side-channel mitigation, our evaluation demonstrates the efficacy of AGORA.

Title: Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization

Authors: Jiajun Hu, Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15085
Pdf URL: https://arxiv.org/pdf/2407.15085
Copy Paste: [[2407.15085]] Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization(https://arxiv.org/abs/2407.15085)
Keywords: transformer
Abstract: Domain generalization (DG) aims to avoid the performance degradation of the model when the distribution shift between the limited training data and unseen test data occurs. Recently, foundation models with enormous parameters have been pre-trained with huge datasets, demonstrating strong generalization ability and showing promising direction for solving the DG problem. However, fully Fine-Tuning (FT) the foundation models results in unsatisfactory out-of-distribution accuracy due to the destroyed pre-trained generalized features. Recently, Parameter-Efficient Fine-Tuning (PEFT) alleviates the above problem by fine-tuning a small portion of the model parameters while keeping the rest frozen, which achieves better generalization performance compared to FT. Nevertheless, PEFT still suffers from the issue of overfitting to the training domains. To address the above issue, we propose Parameter-Efficient Group with Orthogonal regularization (PEGO) for vision transformers, which effectively preserves the generalization ability of the pre-trained network and learns more diverse knowledge compared with conventional PEFT. Specifically, we inject a group of trainable Low-Rank Adaptation (LoRA) modules into the pre-trained model and propose an orthogonal regularization loss to enhance the generalization ability of the model. Our framework achieves SOTA performance on five DG benchmarks, while only requiring training a small number of parameters without adding additional testing cost.

Title: Navigation Instruction Generation with BEV Perception and Large Language Models

Authors: Sheng Fan, Rui Liu, Wenguan Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15087
Pdf URL: https://arxiv.org/pdf/2407.15087
Copy Paste: [[2407.15087]] Navigation Instruction Generation with BEV Perception and Large Language Models(https://arxiv.org/abs/2407.15087)
Keywords: large language model
Abstract: Navigation instruction generation, which requires embodied agents to describe the navigation routes, has been of great interest in robotics and human-computer interaction. Existing studies directly map the sequence of 2D perspective observations to route descriptions. Though straightforward, they overlook the geometric information and object semantics of the 3D environment. To address these challenges, we propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEVVisual Encoder for the comprehension of 3D environments through fusing BEV and perspective features. To leverage the powerful language capabilities of MLLMs, the fused representations are used as visual prompts for MLLMs, and perspective-BEV prompt tuning is proposed for parameter-efficient updating. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner. BEVInstructor achieves impressive performance across diverse datasets (i.e., R2R, REVERIE, and UrbanWalk).

Title: SeqMIA: Sequential-Metric Based Membership Inference Attack

Authors: Hao Li, Zheng Li, Siyuan Wu, Chengrui Hu, Yutong Ye, Min Zhang, Dengguo Feng, Yang Zhang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15098
Pdf URL: https://arxiv.org/pdf/2407.15098
Copy Paste: [[2407.15098]] SeqMIA: Sequential-Metric Based Membership Inference Attack(https://arxiv.org/abs/2407.15098)
Keywords: defense, attack, membership infer
Abstract: Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent patterns. Consequently, they struggle to effectively distinguish between members and non-members who exhibit similar metric values, particularly resulting in a high false-positive rate. In this study, we delve deeper into the new membership signals in the black-box scenario. We identify a new, more integrated membership signal: the Pattern of Metric Sequence, derived from the various stages of model training. We contend that current signals provide only partial perspectives of this new signal: the new one encompasses both the model's multiple intermediate and final states, with a greater emphasis on temporal patterns among them. Building upon this signal, we introduce a novel attack method called Sequential-metric based Membership Inference Attack (SeqMIA). Specifically, we utilize knowledge distillation to obtain a set of distilled models representing various stages of the target model's training. We then assess multiple metrics on these distilled models in chronological order, creating distilled metric sequence. We finally integrate distilled multi-metric sequences as a sequential multiformat and employ an attention-based RNN attack model for inference. Empirical results show SeqMIA outperforms all baselines, especially can achieve an order of magnitude improvement in terms of TPR @ 0.1% FPR. Furthermore, we delve into the reasons why this signal contributes to SeqMIA's high attack performance, and assess various defense mechanisms against SeqMIA.

Title: A General Framework for Data-Use Auditing of ML Models

Authors: Zonghao Huang, Neil Zhenqiang Gong, Michael K. Reiter
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15100
Pdf URL: https://arxiv.org/pdf/2407.15100
Copy Paste: [[2407.15100]] A General Framework for Data-Use Auditing of ML Models(https://arxiv.org/abs/2407.15100)
Keywords: membership infer
Abstract: Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.

Title: D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Authors: Zhaotong Yang, Zicheng Jiang, Xinzhe Li, Huiyu Zhou, Junyu Dong, Huaidong Zhang, Yong Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15111
Pdf URL: https://arxiv.org/pdf/2407.15111
Copy Paste: [[2407.15111]] D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On(https://arxiv.org/abs/2407.15111)
Keywords: diffusion
Abstract: In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D$^4$-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.

Title: DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer

Authors: Jinfeng Wei, Xiaofeng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15130
Pdf URL: https://arxiv.org/pdf/2407.15130
Copy Paste: [[2407.15130]] DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer(https://arxiv.org/abs/2407.15130)
Keywords: large language model
Abstract: In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding specific weighted layer penalties and redistribution, offering an economical and effective solution without additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.

Title: Proximal Policy Distillation

Authors: Giacomo Spigler
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15134
Pdf URL: https://arxiv.org/pdf/2407.15134
Copy Paste: [[2407.15134]] Proximal Policy Distillation(https://arxiv.org/abs/2407.15134)
Keywords: robust
Abstract: We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

Title: A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Authors: Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Özen Nergis Dolcerocca
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15136
Pdf URL: https://arxiv.org/pdf/2407.15136
Copy Paste: [[2407.15136]] A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts(https://arxiv.org/abs/2407.15136)
Keywords: large language model
Abstract: This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

Title: D$^4$M: Dataset Distillation via Disentangled Diffusion Model

Authors: Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, Bowen Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15138
Pdf URL: https://arxiv.org/pdf/2407.15138
Copy Paste: [[2407.15138]] D$^4$M: Dataset Distillation via Disentangled Diffusion Model(https://arxiv.org/abs/2407.15138)
Keywords: robust, diffusion
Abstract: Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset, most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless, these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures. With empirical observations, we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this, we introduce Dataset Distillation via Disentangled Diffusion Model (D$^4$M), an efficient framework for dataset distillation. Compared to architecture-dependent methods, D$^4$M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile, eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments, D$^4$M demonstrates superior performance and robust generalization, surpassing the SOTA methods across most aspects.

Title: Rethinking Feature Backbone Fine-tuning for Remote Sensing Object Detection

Authors: Yechan Kim, JongHyun Park, SooYeon Kim, Moongu Jeon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15143
Pdf URL: https://arxiv.org/pdf/2407.15143
Copy Paste: [[2407.15143]] Rethinking Feature Backbone Fine-tuning for Remote Sensing Object Detection(https://arxiv.org/abs/2407.15143)
Keywords: extraction, transformer
Abstract: Recently, numerous methods have achieved impressive performance in remote sensing object detection, relying on convolution or transformer architectures. Such detectors typically have a feature backbone to extract useful features from raw input images. For the remote sensing domain, a common practice among current detectors is to initialize the backbone with pre-training on ImageNet consisting of natural scenes. Fine-tuning the backbone is typically required to generate features suitable for remote-sensing images. However, this could hinder the extraction of basic visual features in long-term training, thus restricting performance improvement. To mitigate this issue, we propose a novel method named DBF (Dynamic Backbone Freezing) for feature backbone fine-tuning on remote sensing object detection. Our method aims to handle the dilemma of whether the backbone should extract low-level generic features or possess specific knowledge of the remote sensing domain, by introducing a module called 'Freezing Scheduler' to dynamically manage the update of backbone features during training. Extensive experiments on DOTA and DIOR-R show that our approach enables more accurate model learning while substantially reducing computational costs. Our method can be seamlessly adopted without additional effort due to its straightforward design.

Title: SNNGX: Securing Spiking Neural Networks with Genetic XOR Encryption on RRAM-based Neuromorphic Accelerator

Authors: Kwunhang Wong, Songqi Wang, Wei Huang, Xinyuan Zhang, Yangu He, Karl M.H. Lai, Yuzhong Jiao, Ning Lin, Xiaojuan Qi, Xiaoming Chen, Zhongrui Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15152
Pdf URL: https://arxiv.org/pdf/2407.15152
Copy Paste: [[2407.15152]] SNNGX: Securing Spiking Neural Networks with Genetic XOR Encryption on RRAM-based Neuromorphic Accelerator(https://arxiv.org/abs/2407.15152)
Keywords: secure, protect, attack, steal
Abstract: Biologically plausible Spiking Neural Networks (SNNs), characterized by spike sparsity, are growing tremendous attention over intellectual edge devices and critical bio-medical applications as compared to artificial neural networks (ANNs). However, there is a considerable risk from malicious attempts to extract white-box information (i.e., weights) from SNNs, as attackers could exploit well-trained SNNs for profit and white-box adversarial concerns. There is a dire need for intellectual property (IP) protective measures. In this paper, we present a novel secure software-hardware co-designed RRAM-based neuromorphic accelerator for protecting the IP of SNNs. Software-wise, we design a tailored genetic algorithm with classic XOR encryption to target the least number of weights that need encryption. From a hardware perspective, we develop a low-energy decryption module, meticulously designed to provide zero decryption latency. Extensive results from various datasets, including NMNIST, DVSGesture, EEGMMIDB, Braille Letter, and SHD, demonstrate that our proposed method effectively secures SNNs by encrypting a minimal fraction of stealthy weights, only 0.00005% to 0.016% weight bits. Additionally, it achieves a substantial reduction in energy consumption, ranging from x59 to x6780, and significantly lowers decryption latency, ranging from x175 to x4250. Moreover, our method requires as little as one sample per class in dataset for encryption and addresses hessian/gradient-based search insensitive problems. This strategy offers a highly efficient and flexible solution for securing SNNs in diverse applications.

Title: Anchored Diffusion for Video Face Reenactment

Authors: Idan Kligvasser, Regev Cohen, George Leifman, Ehud Rivlin, Michael Elad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15153
Pdf URL: https://arxiv.org/pdf/2407.15153
Copy Paste: [[2407.15153]] Anchored Diffusion for Video Face Reenactment(https://arxiv.org/abs/2407.15153)
Keywords: diffusion, transformer
Abstract: Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.

Title: Fine-grained Gender Control in Machine Translation with Large Language Models

Authors: Minwoo Lee, Hyukhun Koh, Minsung Kim, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15154
Pdf URL: https://arxiv.org/pdf/2407.15154
Copy Paste: [[2407.15154]] Fine-grained Gender Control in Machine Translation with Large Language Models(https://arxiv.org/abs/2407.15154)
Keywords: large language model
Abstract: In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation.

Title: Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Authors: Yunyi Xuan, Weijie Chen, Shicai Yang, Di Xie, Luojun Lin, Yueting Zhuang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2407.15155
Pdf URL: https://arxiv.org/pdf/2407.15155
Copy Paste: [[2407.15155]] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification(https://arxiv.org/abs/2407.15155)
Keywords: data-free
Abstract: Data-Free Knowledge Distillation (DFKD) has shown great potential in creating a compact student model while alleviating the dependency on real training data by synthesizing surrogate data. However, prior arts are seldom discussed under distribution shifts, which may be vulnerable in real-world applications. Recent Vision-Language Foundation Models, e.g., CLIP, have demonstrated remarkable performance in zero-shot out-of-distribution generalization, yet consuming heavy computation resources. In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts, inheriting the out-of-distribution generalization capability from the pre-trained foundation models. In order to avoid generalization degradation, the primary challenge of this task lies in synthesizing diverse surrogate images driven by text prompts. Since not only category concepts but also style information are encoded in text prompts, we propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles, namely Mix-Prompt, Random-Prompt, and Contrastive-Prompt. Experiments on out-of-distribution generalization datasets demonstrate the effectiveness of the proposed methods, with Contrastive-Prompt performing the best.

Title: HERGen: Elevating Radiology Report Generation with Longitudinal Data

Authors: Fuying Wang, Shenghui Du, Lequan Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15158
Pdf URL: https://arxiv.org/pdf/2407.15158
Copy Paste: [[2407.15158]] HERGen: Elevating Radiology Report Generation with Longitudinal Data(https://arxiv.org/abs/2407.15158)
Keywords: transformer
Abstract: Radiology reports provide detailed descriptions of medical imaging integrated with patients' medical histories, while report writing is traditionally labor-intensive, increasing radiologists' workload and the risk of diagnostic errors. Recent efforts in automating this process seek to mitigate these issues by enhancing accuracy and clinical efficiency. Emerging research in automating this process promises to alleviate these challenges by reducing errors and streamlining clinical workflows. However, existing automated approaches are based on a single timestamp and often neglect the critical temporal aspect of patients' imaging histories, which is essential for accurate longitudinal analysis. To address this gap, we propose a novel History Enhanced Radiology Report Generation (HERGen) framework that employs a employs a group causal transformer to efficiently integrate longitudinal data across patient visits. Our approach not only allows for comprehensive analysis of varied historical data but also improves the quality of generated reports through an auxiliary contrastive objective that aligns image sequences with their corresponding reports. More importantly, we introduce a curriculum learning-based strategy to adeptly handle the inherent complexity of longitudinal radiology data and thus stabilize the optimization of our framework. The extensive evaluations across three datasets demonstrate that our framework surpasses existing methods in generating accurate radiology reports and effectively predicting disease progression from medical images.

Title: When Can Transformers Count to n?

Authors: Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, Amir Globerson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15160
Pdf URL: https://arxiv.org/pdf/2407.15160
Copy Paste: [[2407.15160]] When Can Transformers Count to n?(https://arxiv.org/abs/2407.15160)
Keywords: transformer, large language model
Abstract: Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.

Title: Adversarial Circuit Evaluation

Authors: Niels uit de Bos, Adrià Garriga-Alonso
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.15166
Pdf URL: https://arxiv.org/pdf/2407.15166
Copy Paste: [[2407.15166]] Adversarial Circuit Evaluation(https://arxiv.org/abs/2407.15166)
Keywords: robust
Abstract: Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.

Title: Semi-Supervised Pipe Video Temporal Defect Interval Localization

Authors: Zhu Huang, Gang Pan, Chao Kang, YaoZhi Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15170
Pdf URL: https://arxiv.org/pdf/2407.15170
Copy Paste: [[2407.15170]] Semi-Supervised Pipe Video Temporal Defect Interval Localization(https://arxiv.org/abs/2407.15170)
Keywords: segmentation
Abstract: In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.

Title: Assessing Sample Quality via the Latent Space of Generative Models

Authors: Jingyi Xu, Hieu Le, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15171
Pdf URL: https://arxiv.org/pdf/2407.15171
Copy Paste: [[2407.15171]] Assessing Sample Quality via the Latent Space of Generative Models(https://arxiv.org/abs/2407.15171)
Keywords: robust, diffusion, generative
Abstract: Advances in generative models increase the need for sample quality assessment. To do so, previous methods rely on a pre-trained feature extractor to embed the generated samples and real samples into a common space for comparison. However, different feature extractors might lead to inconsistent assessment outcomes. Moreover, these methods are not applicable for domains where a robust, universal feature extractor does not yet exist, such as medical images or 3D assets. In this paper, we propose to directly examine the latent space of the trained generative model to infer generated sample quality. This is feasible because the quality a generated sample directly relates to the amount of training data resembling it, and we can infer this information by examining the density of the latent space. Accordingly, we use a latent density score function to quantify sample quality. We show that the proposed score correlates highly with the sample quality for various generative models including VAEs, GANs and Latent Diffusion Models. Compared with previous quality assessment methods, our method has the following advantages: 1) pre-generation quality estimation with reduced computational cost, 2) generalizability to various domains and modalities, and 3) applicability to latent-based image editing and generation methods. Extensive experiments demonstrate that our proposed methods can benefit downstream tasks such as few-shot image classification and latent face image editing. Code is available at this https URL.

Title: TADA: Temporal Adversarial Data Augmentation for Time Series Data

Authors: Byeong Tak Lee, Joon-myoung Kwon, Yong-Yeon Jo
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2407.15174
Pdf URL: https://arxiv.org/pdf/2407.15174
Copy Paste: [[2407.15174]] TADA: Temporal Adversarial Data Augmentation for Time Series Data(https://arxiv.org/abs/2407.15174)
Keywords: robust
Abstract: Domain generalization involves training machine learning models to perform robustly on unseen samples from out-of-distribution datasets. Adversarial Data Augmentation (ADA) is a commonly used approach that enhances model adaptability by incorporating synthetic samples, designed to simulate potential unseen samples. While ADA effectively addresses amplitude-related distribution shifts, it falls short in managing temporal shifts, which are essential for time series data. To address this limitation, we propose the Temporal Adversarial Data Augmentation for time teries Data (TADA), which incorporates a time warping technique specifically targeting temporal shifts. Recognizing the challenge of non-differentiability in traditional time warping, we make it differentiable by leveraging phase shifts in the frequency domain. Our evaluations across diverse domains demonstrate that TADA significantly outperforms existing ADA variants, enhancing model performance across time series datasets with varied distributions.

Title: Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope

Authors: Xiaoran Liu, Qipeng Guo, Yuerong Song, Zhigeng Liu, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15176
Pdf URL: https://arxiv.org/pdf/2407.15176
Copy Paste: [[2407.15176]] Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope(https://arxiv.org/abs/2407.15176)
Keywords: large language model
Abstract: The maximum supported context length is a critical bottleneck limiting the practical application of the Large Language Model (LLM). Although existing length extrapolation methods can extend the context of LLMs to millions of tokens, these methods all have an explicit upper bound. In this work, we propose LongCache, a training-free approach that enables LLM to support an infinite context with finite context scope, through full-context cache selection and training-free integration. This effectively frees LLMs from the length extrapolation issue. We validate LongCache on the LongBench and L-Eval and demonstrate its performance is on par with traditional full-attention mechanisms. Furthermore, we have applied LongCache on mainstream LLMs, including LLaMA3 and Mistral-v0.3, enabling them to support context lengths of at least 400K in Needle-In-A-Haystack tests. We will improve the efficiency of LongCache by GPU-aware optimization soon.

Title: A Survey on Employing Large Language Models for Text-to-SQL Tasks

Authors: Liang Shi, Zhengju Tang, Zhi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15186
Pdf URL: https://arxiv.org/pdf/2407.15186
Copy Paste: [[2407.15186]] A Survey on Employing Large Language Models for Text-to-SQL Tasks(https://arxiv.org/abs/2407.15186)
Keywords: large language model
Abstract: The increasing volume of data stored in relational databases has led to the need for efficient querying and utilization of this data in various sectors. However, writing SQL queries requires specialized knowledge, which poses a challenge for non-professional users trying to access and query databases. Text-to-SQL parsing solves this issue by converting natural language queries into SQL queries, thus making database access more accessible for non-expert users. To take advantage of the recent developments in Large Language Models (LLMs), a range of new methods have emerged, with a primary focus on prompt engineering and fine-tuning. This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions. We hope this review will enable readers to gain a broader understanding of the recent advances in this field and offer some insights into its future trajectory.

Title: HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Authors: Haiyang Zhou, Xinhua Cheng, Wangbo Yu, Yonghong Tian, Li Yuan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2407.15187
Pdf URL: https://arxiv.org/pdf/2407.15187
Copy Paste: [[2407.15187]] HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions(https://arxiv.org/abs/2407.15187)
Keywords: robust, diffusion, generative
Abstract: 3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry. Owing to the powerful generative capabilities of text-to-image diffusion models that provide reliable priors, the creation of 3D scenes using only text prompts has become viable, thereby significantly advancing researches in text-driven 3D scene generation. In order to obtain multiple-view supervision from 2D diffusion models, prevailing methods typically employ the diffusion model to generate an initial local image, followed by iteratively outpainting the local image using diffusion models to gradually generate scenes. Nevertheless, these outpainting-based approaches prone to produce global inconsistent scene generation results without high degree of completeness, restricting their broader applications. To tackle these problems, we introduce HoloDreamer, a framework that first generates high-definition panorama as a holistic initialization of the full 3D scene, then leverage 3D Gaussian Splatting (3D-GS) to quickly reconstruct the 3D scene, thereby facilitating the creation of view-consistent and fully enclosed 3D scenes. Specifically, we propose Stylized Equirectangular Panorama Generation, a pipeline that combines multiple diffusion models to enable stylized and detailed equirectangular panorama generation from complex text prompts. Subsequently, Enhanced Two-Stage Panorama Reconstruction is introduced, conducting a two-stage optimization of 3D-GS to inpaint the missing region and enhance the integrity of the scene. Comprehensive experiments demonstrated that our method outperforms prior works in terms of overall visual consistency and harmony as well as reconstruction quality and rendering robustness when generating fully enclosed scenes.

Title: HyperbolicLR: Epoch insensitive learning rate scheduler

Authors: Tae-Geun Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15200
Pdf URL: https://arxiv.org/pdf/2407.15200
Copy Paste: [[2407.15200]] HyperbolicLR: Epoch insensitive learning rate scheduler(https://arxiv.org/abs/2407.15200)
Keywords: robust
Abstract: This study proposes two novel learning rate schedulers: the Hyperbolic Learning Rate Scheduler (HyperbolicLR) and the Exponential Hyperbolic Learning Rate Scheduler (ExpHyperbolicLR). These schedulers attempt to address the inconsistent learning curves often observed in conventional schedulers when adjusting the number of epochs. By leveraging the asymptotic behavior of hyperbolic curves, the proposed schedulers maintain more consistent learning curves across varying epoch settings. The HyperbolicLR algorithm directly applies this property to the epoch-learning rate space, while the ExpHyperbolicLR maps this concept onto the exponential space of epochs and learning rates. To evaluate the performance of these schedulers, first we found the optimal hyperparameters for each scheduler on a small number of epochs, fixed these values, and compared their performance as the number of epochs increased. Our experimental results on various deep learning tasks and architectures demonstrate that both HyperbolicLR and ExpHyperbolicLR maintain more consistent performance improvements compared to conventional schedulers as the number of epochs increases. These findings suggest that our hyperbolic-based learning rate schedulers offer a more robust and efficient approach to training deep neural networks, especially in scenarios where computational resources or time constraints limit extensive hyperparameter searches.

Title: When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Authors: Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
Subjects: cs.CL, cs.AI, cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15211
Pdf URL: https://arxiv.org/pdf/2407.15211
Copy Paste: [[2407.15211]] When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?(https://arxiv.org/abs/2407.15211)
Keywords: attack, robust
Abstract: The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Title: Efficient Visual Transformer by Learnable Token Merging

Authors: Yancheng Wang, Yingzhen Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15219
Pdf URL: https://arxiv.org/pdf/2407.15219
Copy Paste: [[2407.15219]] Efficient Visual Transformer by Learnable Token Merging(https://arxiv.org/abs/2407.15219)
Keywords: transformer
Abstract: Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of mask module in our LTM blocks which generate the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers. The code of the LTM-Transformer is available at \url{this https URL}.

Title: PUFFLE: Balancing Privacy, Utility, and Fairness in Federated Learning

Authors: Luca Corbucci, Mikko A Heikkila, David Solans Noguero, Anna Monreale, Nicolas Kourtellis
Subjects: cs.LG, cs.AI, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2407.15224
Pdf URL: https://arxiv.org/pdf/2407.15224
Copy Paste: [[2407.15224]] PUFFLE: Balancing Privacy, Utility, and Fairness in Federated Learning(https://arxiv.org/abs/2407.15224)
Keywords: privacy, federate, fair
Abstract: Training and deploying Machine Learning models that simultaneously adhere to principles of fairness and privacy while ensuring good utility poses a significant challenge. The interplay between these three factors of trustworthiness is frequently underestimated and remains insufficiently explored. Consequently, many efforts focus on ensuring only two of these factors, neglecting one in the process. The decentralization of the datasets and the variations in distributions among the clients exacerbate the complexity of achieving this ethical trade-off in the context of Federated Learning (FL). For the first time in FL literature, we address these three factors of trustworthiness. We introduce PUFFLE, a high-level parameterised approach that can help in the exploration of the balance between utility, privacy, and fairness in FL scenarios. We prove that PUFFLE can be effective across diverse datasets, models, and data distributions, reducing the model unfairness up to 75%, with a maximum reduction in the utility of 17% in the worst-case scenario, while maintaining strict privacy guarantees during the FL training.

Title: The Hitchhiker's Guide to Human Alignment with *PO

Authors: Kian Ahrabian, Xihui Lin, Barun Patra, Vishrav Chaudhary, Alon Benhaim, Jay Pujara, Xia Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15229
Pdf URL: https://arxiv.org/pdf/2407.15229
Copy Paste: [[2407.15229]] The Hitchhiker's Guide to Human Alignment with *PO(https://arxiv.org/abs/2407.15229)
Keywords: robust, large language model
Abstract: With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO.

Title: CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Authors: Yu Li, Yifan Chen, Gongye Liu, Jie Wu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15233
Pdf URL: https://arxiv.org/pdf/2407.15233
Copy Paste: [[2407.15233]] CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model(https://arxiv.org/abs/2407.15233)
Keywords: diffusion, transformer
Abstract: Layout generation is the foundation task of intelligent design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlap, or spatial misalignment between layouts, which are closely related to the spatial structure of graphic layouts. We find that these methods overly focus on content information and lack constraints on layout spatial structure, resulting in an imbalance of learning content-aware and graphic-aware features. To tackle this issue, we propose Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model (CGB-DM). Specifically, we first design a regulator that balances the predicted content and graphic weight, overcoming the tendency of paying more attention to the content on canvas. Secondly, we introduce a graphic constraint of saliency bounding box to further enhance the alignment of geometric features between layout representations and images. In addition, we adapt a transformer-based diffusion model as the backbone, whose powerful generation capability ensures the quality in layout generation. Extensive experimental results indicate that our method has achieved state-of-the-art performance in both quantitative and qualitative evaluations. Our model framework can also be expanded to other graphic design fields.

Title: TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data

Authors: Jipeng Zhang, Yaxuan Qin, Renjie Pi, Weizhong Zhang, Rui Pan, Tong Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15235
Pdf URL: https://arxiv.org/pdf/2407.15235
Copy Paste: [[2407.15235]] TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data(https://arxiv.org/abs/2407.15235)
Keywords: large language model
Abstract: Instruction tuning has achieved unprecedented success in NLP, turning large language models into versatile chatbots. However, the increasing variety and volume of instruction datasets demand significant computational resources. To address this, it is essential to extract a small and highly informative subset (i.e., Coreset) that achieves comparable performance to the full dataset. Achieving this goal poses non-trivial challenges: 1) data selection requires accurate data representations that reflect the training samples' quality, 2) considering the diverse nature of instruction datasets, and 3) ensuring the efficiency of the coreset selection algorithm for large models. To address these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS). Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection. Experimental results show that our algorithm, selecting only 5% of the data, surpasses other unsupervised methods and achieves performance close to that of the full dataset.

Title: Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling

Authors: Junn Yong Loo, Michelle Adeline, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Raphael C.-W. Phan
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2407.15238
Pdf URL: https://arxiv.org/pdf/2407.15238
Copy Paste: [[2407.15238]] Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling(https://arxiv.org/abs/2407.15238)
Keywords: generative
Abstract: Energy based models (EBMs) are appealing for their generality and simplicity in data likelihood modeling, but have conventionally been difficult to train due to the unstable and time-consuming implicit MCMC sampling during contrastive divergence training. In this paper, we present a novel energy-based generative framework, Variational Potential Flow (VAPO), that entirely dispenses with implicit MCMC sampling and does not rely on complementary latent models or cooperative training. The VAPO framework aims to learn a potential energy function whose gradient (flow) guides the prior samples, so that their density evolution closely follows an approximate data likelihood homotopy. An energy loss function is then formulated to minimize the Kullback-Leibler divergence between density evolution of the flow-driven prior and the data likelihood homotopy. Images can be generated after training the potential energy, by initializing the samples from Gaussian prior and solving the ODE governing the potential flow on a fixed time interval using generic ODE solvers. Experiment results show that the proposed VAPO framework is capable of generating realistic images on various image datasets. In particular, our proposed framework achieves competitive FID scores for unconditional image generation on the CIFAR-10 and CelebA datasets.

Title: BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM

Authors: Hanjun Luo, Haoyu Huang, Ziye Deng, Xuecheng Liu, Ruizhe Chen, Zuozhu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15240
Pdf URL: https://arxiv.org/pdf/2407.15240
Copy Paste: [[2407.15240]] BIGbench: A Unified Benchmark for Social Bias in Text-to-Image Generative Models Based on Multi-modal LLM(https://arxiv.org/abs/2407.15240)
Keywords: protect, generative, large language model
Abstract: Text-to-Image (T2I) generative models are becoming more crucial in terms of their ability to generate complex and high-quality images, which also raises concerns about the social biases in their outputs, especially in human generation. Sociological research has established systematic classifications of bias; however, existing research of T2I models often conflates different types of bias, hindering the progress of these methods. In this paper, we introduce BIGbench, a unified benchmark for Biases of Image Generation with a well-designed dataset. In contrast to existing benchmarks, BIGbench classifies and evaluates complex biases into four dimensions: manifestation of bias, visibility of bias, acquired attributes, and protected attributes. Additionally, BIGbench applies advanced multi-modal large language models (MLLM), achieving fully automated evaluation while maintaining high accuracy. We apply BIGbench to evaluate eight recent general T2I models and three debiased methods. We also conduct human evaluation, whose results demonstrated the effectiveness of BIGbench in aligning images and identifying various biases. Besides, our study also revealed new research directions about biases, including the side-effect of irrelevant protected attributes and distillation. Our dataset and benchmark is openly accessible to the research community to ensure the reproducibility.

Title: XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models

Authors: Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Navid Nobani, Andrea Seveso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15248
Pdf URL: https://arxiv.org/pdf/2407.15248
Copy Paste: [[2407.15248]] XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models(https://arxiv.org/abs/2407.15248)
Keywords: interpretability, large language model
Abstract: In this survey, we address the key challenges in Large Language Models (LLM) research, focusing on the importance of interpretability. Driven by increasing interest from AI and business sectors, we highlight the need for transparency in LLMs. We examine the dual paths in current LLM research and eXplainable Artificial Intelligence (XAI): enhancing performance through XAI and the emerging focus on model interpretability. Our paper advocates for a balanced approach that values interpretability equally with functional advancements. Recognizing the rapid development in LLM research, our survey includes both peer-reviewed and preprint (arXiv) papers, offering a comprehensive overview of XAI's role in LLM research. We conclude by urging the research community to advance both LLM and XAI fields together.

Title: An Adaptive System for Wearable Devices to Detect Stress Using Physiological Signals

Authors: Gelei Xu, Ruiyang Qin, Zhi Zheng, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15252
Pdf URL: https://arxiv.org/pdf/2407.15252
Copy Paste: [[2407.15252]] An Adaptive System for Wearable Devices to Detect Stress Using Physiological Signals(https://arxiv.org/abs/2407.15252)
Keywords: protect
Abstract: Timely stress detection is crucial for protecting vulnerable groups from long-term detrimental effects by enabling early intervention. Wearable devices, by collecting real-time physiological signals, offer a solution for accurate stress detection accommodating individual differences. This position paper introduces an adaptive framework for personalized stress detection using PPG and EDA signals. Unlike traditional methods that rely on a generalized model, which may suffer performance drops when applied to new users due to domain shifts, this framework aims to provide each user with a personalized model for higher stress detection accuracy. The framework involves three stages: developing a generalized model offline with an initial dataset, adapting the model to the user's unlabeled data, and fine-tuning it with a small set of labeled data obtained through user interaction. This approach not only offers a foundation for mobile applications that provide personalized stress detection and intervention but also has the potential to address a wider range of mental health issues beyond stress detection using physiological signals.

Title: Weakly SSM : On the Viability of Weakly Supervised Segmentations for Statistical Shape Modeling

Authors: Janmesh Ukey, Tushar Kataria, Shireen Y. Elhabian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15260
Pdf URL: https://arxiv.org/pdf/2407.15260
Copy Paste: [[2407.15260]] Weakly SSM : On the Viability of Weakly Supervised Segmentations for Statistical Shape Modeling(https://arxiv.org/abs/2407.15260)
Keywords: segmentation
Abstract: Statistical Shape Models (SSMs) excel at identifying population level anatomical variations, which is at the core of various clinical and biomedical applications, including morphology-based diagnostics and surgical planning. However, the effectiveness of SSM is often constrained by the necessity for expert-driven manual segmentation, a process that is both time-intensive and expensive, thereby restricting their broader application and utility. Recent deep learning approaches enable the direct estimation of Statistical Shape Models (SSMs) from unsegmented images. While these models can predict SSMs without segmentation during deployment, they do not address the challenge of acquiring the manual annotations needed for training, particularly in resource-limited settings. Semi-supervised and foundation models for anatomy segmentation can mitigate the annotation burden. Yet, despite the abundance of available approaches, there are no established guidelines to inform end-users on their effectiveness for the downstream task of constructing SSMs. In this study, we systematically evaluate the potential of weakly supervised methods as viable alternatives to manual segmentation's for building SSMs. We establish a new performance benchmark by employing various semi-supervised and foundational model methods for anatomy segmentation under low annotation settings, utilizing the predicted segmentation's for the task of SSM. We compare the modes of shape variation and use quantitative metrics to compare against a shape model derived from a manually annotated dataset. Our results indicate that some methods produce noisy segmentation, which is very unfavorable for SSM tasks, while others can capture the correct modes of variations in the population cohort with 60-80\% reduction in required manual annotation.

Title: A Learning-Based Attack Framework to Break SOTA Poisoning Defenses in Federated Learning

Authors: Yuxin Yang (1 and 2), Qiang Li (1), Chenfei Nie (1), Yuan Hong (3), Meng Pang (4), Binghui Wang (2) ((1) College of Computer Science and Technology, Jilin University, (2) Illinois Institute of Technology, (3) University of Connecticut, (4) Nanchang University)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15267
Pdf URL: https://arxiv.org/pdf/2407.15267
Copy Paste: [[2407.15267]] A Learning-Based Attack Framework to Break SOTA Poisoning Defenses in Federated Learning(https://arxiv.org/abs/2407.15267)
Keywords: privacy, protect, defense, attack, robust, federate
Abstract: Federated Learning (FL) is a novel client-server distributed learning framework that can protect data privacy. However, recent works show that FL is vulnerable to poisoning attacks. Many defenses with robust aggregators (AGRs) are proposed to mitigate the issue, but they are all broken by advanced attacks. Very recently, some renewed robust AGRs are designed, typically with novel clipping or/and filtering strate-gies, and they show promising defense performance against the advanced poisoning attacks. In this paper, we show that these novel robust AGRs are also vulnerable to carefully designed poisoning attacks. Specifically, we observe that breaking these robust AGRs reduces to bypassing the clipping or/and filtering of malicious clients, and propose an optimization-based attack framework to leverage this observation. Under the framework, we then design the customized attack against each robust AGR. Extensive experiments on multiple datasets and threat models verify our proposed optimization-based attack can break the SOTA AGRs. We hence call for novel defenses against poisoning attacks to FL. Code is available at: this https URL BreakSTOAPoisoningDefenses.

Title: MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Authors: Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15272
Pdf URL: https://arxiv.org/pdf/2407.15272
Copy Paste: [[2407.15272]] MIBench: Evaluating Multimodal Large Language Models over Multiple Images(https://arxiv.org/abs/2407.15272)
Keywords: large language model
Abstract: Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at this https URL.

Title: Minimizing the Number of Roles in Bottom-Up Role-Mining using Maximal Biclique Enumeration

Authors: Mahesh Tripunitara
Subjects: cs.CR, cs.DS
Abstract URL: https://arxiv.org/abs/2407.15278
Pdf URL: https://arxiv.org/pdf/2407.15278
Copy Paste: [[2407.15278]] Minimizing the Number of Roles in Bottom-Up Role-Mining using Maximal Biclique Enumeration(https://arxiv.org/abs/2407.15278)
Keywords: security
Abstract: Bottom-up role-mining is the determination of a set of roles given as input a set of users and the permissions those users possess. It is well-established in the research literature, and in practice, as an important problem in information security. A natural objective that has been explored in prior work is for the set of roles to be of minimum size. We address this problem for practical inputs while reconciling foundations, specifically, that the problem is \cnph. We first observe that an approach from prior work that exploits a sufficient condition for an efficient algorithm, while a useful first step, does not scale to more recently proposed benchmark inputs. We propose a new technique: the enumeration of maximal bicliques. We point out that the number of maximal bicliques provides a natural measure of the hardness of an input. We leverage the enumeration of maximal bicliques in two different ways. Our first approach addresses more than half the benchmark inputs to yield exact results. The other approach is needed for hard instances; in it, we identify and adopt as roles those that correspond to large maximal bicliques. We have implemented all our algorithms and carried out an extensive empirical assessment, which suggests that our approaches are promising. Our code is available publicly as open-source.

Title: SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking

Authors: Kuan-Yen Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15281
Pdf URL: https://arxiv.org/pdf/2407.15281
Copy Paste: [[2407.15281]] SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking(https://arxiv.org/abs/2407.15281)
Keywords: secure, large language model
Abstract: Understanding rich dialogues often requires NLP systems to access relevant commonsense persona knowledge, but retrieving this knowledge is challenging due to complex contexts and the implicit nature of commonsense. This paper presents our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge, addressing the critical need for integrating persona and commonsense knowledge in open-domain dialogue systems. We introduce SynCPKL Pipeline, a pipeline that leverages Large Language Models to generate high-quality synthetic datasets for training commonsense persona knowledge linkers. To demonstrate the efficacy of our approach, we present SynCPKL, a new dataset specifically designed for this task. Our experiments validate the effectiveness of SynCPKL for training commonsense persona knowledge linkers. Additionally, our top-performing model, Derberta-SynCPKL, secured first place in the CPKL challenge by a 16% improvement in F1 score. We released both SynCPKL and Derberta-SynCPKL at this https URL.

Title: Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

Authors: Xiaoyang Wu, Xiang Xu, Lingdong Kong, Liang Pan, Ziwei Liu, Tong He, Wanli Ouyang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15282
Pdf URL: https://arxiv.org/pdf/2407.15282
Copy Paste: [[2407.15282]] Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation(https://arxiv.org/abs/2407.15282)
Keywords: secure, transformer, segmentation
Abstract: In this technical report, we detail our first-place solution for the 2024 Waymo Open Dataset Challenge's semantic segmentation track. We significantly enhanced the performance of Point Transformer V3 on the Waymo benchmark by implementing cutting-edge, plug-and-play training and inference technologies. Notably, our advanced version, Point Transformer V3 Extreme, leverages multi-frame training and a no-clipping-point policy, achieving substantial gains over the original PTv3 performance. Additionally, employing a straightforward model ensemble strategy further boosted our results. This approach secured us the top position on the Waymo Open Dataset semantic segmentation leaderboard, markedly outperforming other entries.

Title: Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms

Authors: Sheila Schoepp, Mehran Taghian, Shotaro Miwa, Yoshihiro Mitsuka, Shadan Golestan, Osmar Zaïane
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2407.15283
Pdf URL: https://arxiv.org/pdf/2407.15283
Copy Paste: [[2407.15283]] Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms(https://arxiv.org/abs/2407.15283)
Keywords: robust
Abstract: Industry is rapidly moving towards fully autonomous and interconnected systems that can detect and adapt to changing conditions, including machine hardware faults. Traditional methods for adding hardware fault tolerance to machines involve duplicating components and algorithmically reconfiguring a machine's processes when a fault occurs. However, the growing interest in reinforcement learning-based robotic control offers a new perspective on achieving hardware fault tolerance. However, limited research has explored the potential of these approaches for hardware fault tolerance in machines. This paper investigates the potential of two state-of-the-art reinforcement learning algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to enhance hardware fault tolerance into machines. We assess the performance of these algorithms in two OpenAI Gym simulated environments, Ant-v2 and FetchReach-v1. Robot models in these environments are subjected to six simulated hardware faults. Additionally, we conduct an ablation study to determine the optimal method for transferring an agent's knowledge, acquired through learning in a normal (pre-fault) environment, to a (post-)fault environment in a continual learning setting. Our results demonstrate that reinforcement learning-based approaches can enhance hardware fault tolerance in simulated machines, with adaptation occurring within minutes. Specifically, PPO exhibits the fastest adaptation when retaining the knowledge within its models, while SAC performs best when discarding all acquired knowledge. Overall, this study highlights the potential of reinforcement learning-based approaches, such as PPO and SAC, for hardware fault tolerance in machines. These findings pave the way for the development of robust and adaptive machines capable of effectively operating in real-world scenarios.

Title: Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Authors: Guangliang Liu, Haitao Mao, Jiliang Tang, Kristen Marie Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15286
Pdf URL: https://arxiv.org/pdf/2407.15286
Copy Paste: [[2407.15286]] Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis(https://arxiv.org/abs/2407.15286)
Keywords: large language model
Abstract: Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly reducing the immorality stored in hidden states. Through empirical investigation with tasks of language generation and multi-choice question answering, we conclude: (i) LLMs exhibit good performance across both tasks, and self-correction instructions are particularly beneficial when the correct answer is already top-ranked; (ii) The morality levels in intermediate hidden states are strong indicators as to whether one instruction would be more effective than another; (iii) Based on our analysis of intermediate hidden states and task case studies of self-correction behaviors, we are first to propose the hypothesis that intrinsic moral self-correction is in fact superficial.

Title: Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Authors: Kwanyong Park, Kuniaki Saito, Donghyun Kim
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15296
Pdf URL: https://arxiv.org/pdf/2407.15296
Copy Paste: [[2407.15296]] Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection(https://arxiv.org/abs/2407.15296)
Keywords: generative
Abstract: Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

Title: FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Authors: Weiping Ding, Tianyi Zhou, Jiashuang Huang, Shu Jiang, Tao Hou, Chin-Teng Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15312
Pdf URL: https://arxiv.org/pdf/2407.15312
Copy Paste: [[2407.15312]] FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification(https://arxiv.org/abs/2407.15312)
Keywords: robust, extraction, interpretability
Abstract: Histopathological image classification constitutes a pivotal task in computer-aided diagnostics. The precise identification and categorization of histopathological images are of paramount significance for early disease detection and treatment. In the diagnostic process of pathologists, a multi-tiered approach is typically employed to assess abnormalities in cell regions at different magnifications. However, feature extraction is often performed at a single granularity, overlooking the multi-granular characteristics of cells. To address this issue, we propose the Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN). Inspired by the multi-granular diagnostic approach of pathologists, we perform feature extraction on cell structures at coarse, medium, and fine granularity, enabling the model to fully harness the information in histopathological images. We incorporate the theory of fuzzy logic to address the challenge of redundant key information arising during multi-granular feature extraction. Cell features are described from different perspectives using multiple fuzzy membership functions, which are fused to create universal fuzzy features. A fuzzy-guided cross-attention module guides universal fuzzy features toward multi-granular features. We propagate these features through an encoder to all patch tokens, aiming to achieve enhanced classification accuracy and robustness. In experiments on multiple public datasets, our model exhibits a significant improvement in accuracy over commonly used classification methods for histopathological image classification and shows commendable interpretability.

Title: Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

Authors: Xiao Liu, Xiaoliu Guan, Yu Wu, Jiaxu Miao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15328
Pdf URL: https://arxiv.org/pdf/2407.15328
Copy Paste: [[2407.15328]] Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models(https://arxiv.org/abs/2407.15328)
Keywords: privacy, diffusion
Abstract: Diffusion models, known for their tremendous ability to generate novel and high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent approaches for memory mitigation either only focused on the text modality problem in cross-modal generation tasks or utilized data augmentation strategies. In this paper, we propose a novel training framework for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. To facilitate ``forgetting'' of stored information in diffusion model parameters, we propose an iterative ensemble training strategy by splitting the data into multiple shards for training multiple models and intermittently aggregating these model parameters. Moreover, practical analysis of losses illustrates that the training loss for easily memorable images tends to be obviously lower. Thus, we propose an anti-gradient control method to exclude the sample with a lower loss value from the current mini-batch to avoid memorizing. Extensive experiments and analysis on \crnote{four} datasets are conducted to illustrate the effectiveness of our method, and results show that our method successfully reduces memory capacity while even improving the performance slightly. Moreover, to save the computing cost, we successfully apply our method to fine-tune the well-trained diffusion models by limited epochs, demonstrating the applicability of our method. Code is available in this https URL.

Title: Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

Authors: Yiran Yang, Xu Gao, Tong Wang, Xin Hao, Yifeng Shi, Xiao Tan, Xiaoqing Ye, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15334
Pdf URL: https://arxiv.org/pdf/2407.15334
Copy Paste: [[2407.15334]] Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection(https://arxiv.org/abs/2407.15334)
Keywords: robust
Abstract: Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

Title: Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

Authors: Wenbin An, Feng Tian, Jiahao Nie, Wenkai Shi, Haonan Lin, Yan Chen, QianYing Wang, Yaqiang Wu, Guang Dai, Ping Chen
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2407.15346
Pdf URL: https://arxiv.org/pdf/2407.15346
Copy Paste: [[2407.15346]] Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models(https://arxiv.org/abs/2407.15346)
Keywords: large language model
Abstract: Knowledge-based Visual Question Answering (KVQA) requires both image and world knowledge to answer questions. Current methods first retrieve knowledge from the image and external knowledge base with the original complex question, then generate answers with Large Language Models (LLMs). However, since the original question contains complex elements that require knowledge from different sources, acquiring different kinds of knowledge in a coupled manner may confuse models and hinder them from retrieving precise knowledge. Furthermore, the ``forward-only'' answering process fails to explicitly capture the knowledge needs of LLMs, which can further hurt answering quality. To cope with the above limitations, we propose DKA: Disentangled Knowledge Acquisition from LLM feedback, a training-free framework that disentangles knowledge acquisition to avoid confusion and uses LLM's feedback to specify the required knowledge. Specifically, DKA requires LLMs to specify what knowledge they need to answer the question and decompose the original complex question into two simple sub-questions: Image-based sub-question and Knowledge-based sub-question. Then we use the two sub-questions to retrieve knowledge from the image and knowledge base, respectively. In this way, two knowledge acquisition models can focus on the content that corresponds to them and avoid disturbance of irrelevant elements in the original complex question, which can help to provide more precise knowledge and better align the knowledge needs of LLMs to yield correct answers. Experiments on benchmark datasets show that DKA significantly outperforms SOTA models. To facilitate future research, our data and code are available at \url{this https URL}.

Title: RoadPainter: Points Are Ideal Navigators for Topology transformER

Authors: Zhongxing Ma, Shuang Liang, Yongkun Wen, Weixin Lu, Guowei Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15349
Pdf URL: https://arxiv.org/pdf/2407.15349
Copy Paste: [[2407.15349]] RoadPainter: Points Are Ideal Navigators for Topology transformER(https://arxiv.org/abs/2407.15349)
Keywords: transformer
Abstract: Topology reasoning aims to provide a precise understanding of road scenes, enabling autonomous systems to identify safe and efficient routes. In this paper, we present RoadPainter, an innovative approach for detecting and reasoning the topology of lane centerlines using multi-view images. The core concept behind RoadPainter is to extract a set of points from each centerline mask to improve the accuracy of centerline prediction. We start by implementing a transformer decoder that integrates a hybrid attention mechanism and a real-virtual separation strategy to predict coarse lane centerlines and establish topological associations. Then, we generate centerline instance masks guided by the centerline points from the transformer decoder. Moreover, we derive an additional set of points from each mask and combine them with previously detected centerline points for further refinement. Additionally, we introduce an optional module that incorporates a Standard Definition (SD) map to further optimize centerline detection and enhance topological reasoning performance. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of RoadPainter.

Title: LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation

Authors: Jiaxing Zhang, Jiayi Liu, Dongsheng Luo, Jennifer Neville, Hua Wei
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2407.15351
Pdf URL: https://arxiv.org/pdf/2407.15351
Copy Paste: [[2407.15351]] LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation(https://arxiv.org/abs/2407.15351)
Keywords: interpretability, large language model
Abstract: Recent studies seek to provide Graph Neural Network (GNN) interpretability via multiple unsupervised learning models. Due to the scarcity of datasets, current methods easily suffer from learning bias. To solve this problem, we embed a Large Language Model (LLM) as knowledge into the GNN explanation network to avoid the learning bias problem. We inject LLM as a Bayesian Inference (BI) module to mitigate learning bias. The efficacy of the BI module has been proven both theoretically and experimentally. We conduct experiments on both synthetic and real-world datasets. The innovation of our work lies in two parts: 1. We provide a novel view of the possibility of an LLM functioning as a Bayesian inference to improve the performance of existing algorithms; 2. We are the first to discuss the learning bias issues in the GNN explanation problem.

Title: MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

Authors: Chunyang Li, Hao Peng, Xiaozhi Wang, Yunjia Qi, Lei Hou, Bin Xu, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15352
Pdf URL: https://arxiv.org/pdf/2407.15352
Copy Paste: [[2407.15352]] MAVEN-Fact: A Large-scale Event Factuality Detection Dataset(https://arxiv.org/abs/2407.15352)
Keywords: large language model
Abstract: Event Factuality Detection (EFD) task determines the factuality of textual events, i.e., classifying whether an event is a fact, possibility, or impossibility, which is essential for faithfully understanding and utilizing event knowledge. However, due to the lack of high-quality large-scale data, event factuality detection is under-explored in event understanding research, which limits the development of EFD community. To address these issues and provide faithful event understanding, we introduce MAVEN-Fact, a large-scale and high-quality EFD dataset based on the MAVEN dataset. MAVEN-Fact includes factuality annotations of 112,276 events, making it the largest EFD dataset. Extensive experiments demonstrate that MAVEN-Fact is challenging for both conventional fine-tuned models and large language models (LLMs). Thanks to the comprehensive annotations of event arguments and relations in MAVEN, MAVEN-Fact also supports some further analyses and we find that adopting event arguments and relations helps in event factuality detection for fine-tuned models but does not benefit LLMs. Furthermore, we preliminarily study an application case of event factuality detection and find it helps in mitigating event-related hallucination in LLMs. Our dataset and codes can be obtained from \url{this https URL}

Title: Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

Authors: Yuan Pu, Zhuolun He, Tairu Qiu, Haoyuan Wu, Bei Yu
Subjects: cs.CL, cs.AR
Abstract URL: https://arxiv.org/abs/2407.15353
Pdf URL: https://arxiv.org/pdf/2407.15353
Copy Paste: [[2407.15353]] Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA(https://arxiv.org/abs/2407.15353)
Keywords: generative
Abstract: Retrieval augmented generation (RAG) enhances the accuracy and reliability of generative AI models by sourcing factual information from external databases, which is extensively employed in document-grounded question-answering (QA) tasks. Off-the-shelf RAG flows are well pretrained on general-purpose documents, yet they encounter significant challenges when being applied to knowledge-intensive vertical domains, such as electronic design automation (EDA). This paper addresses such issue by proposing a customized RAG framework along with three domain-specific techniques for EDA tool documentation QA, including a contrastive learning scheme for text embedding model fine-tuning, a reranker distilled from proprietary LLM, and a generative LLM fine-tuned with high-quality domain corpus. Furthermore, we have developed and released a documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced RTL-to-GDSII design platform. Experimental results demonstrate that our proposed RAG flow and techniques have achieved superior performance on ORD-QA as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA benchmark and the training dataset for our customized RAG flow are open-source at this https URL.

Title: Attention Beats Linear for Fast Implicit Neural Representation Generation

Authors: Shuyi Zhang, Ke Liu, Jingjun Gu, Xiaoxu Cai, Zhihua Wang, Jiajun Bu, Haishuai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15355
Pdf URL: https://arxiv.org/pdf/2407.15355
Copy Paste: [[2407.15355]] Attention Beats Linear for Fast Implicit Neural Representation Generation(https://arxiv.org/abs/2407.15355)
Keywords: transformer
Abstract: Implicit Neural Representation (INR) has gained increasing popularity as a data representation method, serving as a prerequisite for innovative generation models. Unlike gradient-based methods, which exhibit lower efficiency in inference, the adoption of hyper-network for generating parameters in Multi-Layer Perceptrons (MLP), responsible for executing INR functions, has surfaced as a promising and efficient alternative. However, as a global continuous function, MLP is challenging in modeling highly discontinuous signals, resulting in slow convergence during the training phase and inaccurate reconstruction performance. Moreover, MLP requires massive representation parameters, which implies inefficiencies in data representation. In this paper, we propose a novel Attention-based Localized INR (ANR) composed of a localized attention layer (LAL) and a global MLP that integrates coordinate features with data features and converts them to meaningful outputs. Subsequently, we design an instance representation framework that delivers a transformer-like hyper-network to represent data instances as a compact representation vector. With instance-specific representation vector and instance-agnostic ANR parameters, the target signals are well reconstructed as a continuous function. We further address aliasing artifacts with variational coordinates when obtaining the super-resolution inference results. Extensive experimentation across four datasets showcases the notable efficacy of our ANR method, e.g. enhancing the PSNR value from 37.95dB to 47.25dB on the CelebA dataset. Code is released at this https URL.

Title: X-Recon: Learning-based Patient-specific High-Resolution CT Reconstruction from Orthogonal X-Ray Images

Authors: Yunpeng Wang, Kang Wang, Yaoyao Zhuo, Weiya Shi, Fei Shan, Lei Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15356
Pdf URL: https://arxiv.org/pdf/2407.15356
Copy Paste: [[2407.15356]] X-Recon: Learning-based Patient-specific High-Resolution CT Reconstruction from Orthogonal X-Ray Images(https://arxiv.org/abs/2407.15356)
Keywords: transformer, generative, segmentation
Abstract: Rapid and accurate diagnosis of pneumothorax, utilizing chest X-ray and computed tomography (CT), is crucial for assisted diagnosis. Chest X-ray is commonly used for initial localization of pneumothorax, while CT ensures accurate quantification. However, CT scans involve high radiation doses and can be costly. To achieve precise quantitative diagnosis while minimizing radiation exposure, we proposed X-Recon, a CT ultra-sparse reconstruction network based on ortho-lateral chest X-ray images. X-Recon integrates generative adversarial networks (GANs), including a generator with a multi-scale fusion rendering module and a discriminator enhanced by 3D coordinate convolutional layers, designed to facilitate CT reconstruction. To improve precision, a projective spatial transformer is utilized to incorporate multi-angle projection loss. Additionally, we proposed PTX-Seg, a zero-shot pneumothorax segmentation algorithm, combining image processing techniques with deep-learning models for the segmentation of air-accumulated regions and lung structures. Experiments on a large-scale dataset demonstrate its superiority over existing approaches. X-Recon achieved a significantly higher reconstruction resolution with a higher average spatial resolution and a lower average slice thickness. The reconstruction metrics achieved state-of-the-art performance in terms of several metrics including peak signal-to-noise ratio. The zero-shot segmentation algorithm, PTX-Seg, also demonstrated high segmentation precision for the air-accumulated region, the left lung, and the right lung. Moreover, the consistency analysis for the pneumothorax chest occupancy ratio between reconstructed CT and original CT obtained a high correlation coefficient. Code will be available at: this https URL

Title: Dissecting Multiplication in Transformers: Insights into LLMs

Authors: Luyu Qiu, Jianing Li, Chi Su, Chen Jason Zhang, Lei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15360
Pdf URL: https://arxiv.org/pdf/2407.15360
Copy Paste: [[2407.15360]] Dissecting Multiplication in Transformers: Insights into LLMs(https://arxiv.org/abs/2407.15360)
Keywords: interpretability, transformer, large language model
Abstract: Transformer-based large language models have achieved remarkable performance across various natural language processing tasks. However, they often struggle with seemingly easy tasks like arithmetic despite their vast capabilities. This stark disparity raise human's concerns about their safe and ethical use, hinder their widespread this http URL this paper, we focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain. We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication. Our observations indicate that the model decomposes multiplication task into multiple parallel subtasks, sequentially optimizing each subtask for each digit to complete the final multiplication. Based on observation and analysis, we infer the reasons of transformers deficiencies in multiplication tasks lies in their difficulty in calculating successive carryovers and caching intermediate results, and confirmed this inference through experiments. Guided by these findings, we propose improvements to enhance transformers performance on multiplication tasks. These enhancements are validated through rigorous testing and mathematical modeling, not only enhance transformer's interpretability, but also improve its performance, e.g., we achieve over 99.9% accuracy on 5-digit integer multiplication with a tiny transformer, outperform LLMs GPT-4. Our method contributes to the broader fields of model understanding and interpretability, paving the way for analyzing more complex tasks and Transformer models. This work underscores the importance of explainable AI, helping to build trust in large language models and promoting their adoption in critical applications.

Title: Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

Authors: Rongwu Xu, Zi'an Zhou, Tianwei Zhang, Zehan Qi, Su Yao, Ke Xu, Wei Xu, Han Qiu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2407.15366
Pdf URL: https://arxiv.org/pdf/2407.15366
Copy Paste: [[2407.15366]] Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias(https://arxiv.org/abs/2407.15366)
Keywords: large language model
Abstract: The common toxicity and societal bias in contents generated by large language models (LLMs) necessitate strategies to reduce harm. Present solutions often demand white-box access to the model or substantial training, which is impractical for cutting-edge commercial LLMs. Moreover, prevailing prompting methods depend on external tool feedback and fail to simultaneously lessen toxicity and bias. Motivated by social psychology principles, we propose a novel strategy named \textbf{perspective-taking prompting (\textsc{PeT})} that inspires LLMs to integrate diverse human perspectives and self-regulate their responses. This self-correction mechanism can significantly diminish toxicity (up to $89\%$) and bias (up to $73\%$) in LLMs' responses. Rigorous evaluations and ablation studies are conducted on two commercial LLMs (ChatGPT and GLM) and three open-source LLMs, revealing \textsc{PeT}'s superiority in producing less harmful responses, outperforming five strong baselines.

Title: Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data

Authors: Junha Song, Tae Soo Kim, Junha Kim, Gunhee Nam, Thijs Kooi, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15383
Pdf URL: https://arxiv.org/pdf/2407.15383
Copy Paste: [[2407.15383]] Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data(https://arxiv.org/abs/2407.15383)
Keywords: segmentation
Abstract: This paper aims to adapt the source model to the target environment, leveraging small user feedback (i.e., labeled target data) readily available in real-world applications. We find that existing semi-supervised domain adaptation (SemiSDA) methods often suffer from poorly improved adaptation performance when directly utilizing such feedback data, as shown in Figure 1. We analyze this phenomenon via a novel concept called Negatively Biased Feedback (NBF), which stems from the observation that user feedback is more likely for data points where the model produces incorrect predictions. To leverage this feedback while avoiding the issue, we propose a scalable adapting approach, Retrieval Latent Defending. This approach helps existing SemiSDA methods to adapt the model with a balanced supervised signal by utilizing latent defending samples throughout the adaptation process. We demonstrate the problem caused by NBF and the efficacy of our approach across various benchmarks, including image classification, semantic segmentation, and a real-world medical imaging application. Our extensive experiments reveal that integrating our approach with multiple state-of-the-art SemiSDA methods leads to significant performance improvements.

Title: Towards Robust Vision Transformer via Masked Adaptive Ensemble

Authors: Fudong Lin, Jiadong Lou, Xu Yuan, Nian-Feng Tzeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15385
Pdf URL: https://arxiv.org/pdf/2407.15385
Copy Paste: [[2407.15385]] Towards Robust Vision Transformer via Masked Adaptive Ensemble(https://arxiv.org/abs/2407.15385)
Keywords: attack, robust, transformer
Abstract: Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples, with our adaptive ensemble to adaptively adjust the proportion of visual representations from the two encoders for accurate classification. This design enables our ViT architecture to achieve a better trade-off between standard accuracy and robustness. Besides, our adaptive ensemble technique allows us to mask off a random subset of image patches within input data, boosting our ViT's robustness against adaptive attacks, while maintaining high standard accuracy. Experimental results exhibit that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and adversarial robustness of 90.3% and 49.8%, respectively.

Title: Poisoning with A Pill: Circumventing Detection in Federated Learning

Authors: Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2407.15389
Pdf URL: https://arxiv.org/pdf/2407.15389
Copy Paste: [[2407.15389]] Poisoning with A Pill: Circumventing Detection in Federated Learning(https://arxiv.org/abs/2407.15389)
Keywords: security, privacy, protect, defense, attack, steal, federate
Abstract: Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.

Title: ALLaM: Large Language Models for Arabic and English

Authors: M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15390
Pdf URL: https://arxiv.org/pdf/2407.15390
Copy Paste: [[2407.15390]] ALLaM: Large Language Models for Arabic and English(https://arxiv.org/abs/2407.15390)
Keywords: large language model
Abstract: We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.

Title: Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Authors: Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa Garcia
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2407.15399
Pdf URL: https://arxiv.org/pdf/2407.15399
Copy Paste: [[2407.15399]] Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models(https://arxiv.org/abs/2407.15399)
Keywords: attack, large language model
Abstract: With the development of large language models (LLMs) like ChatGPT, both their vast applications and potential vulnerabilities have come to the forefront. While developers have integrated multiple safety mechanisms to mitigate their misuse, a risk remains, particularly when models encounter adversarial inputs. This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from LLMs. We delineate three pivotal strategies: (i) decomposing malicious questions into seemingly innocent sub-questions; (ii) rewriting overtly malicious questions into more covert, benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting models for illustrative examples. Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses. Through our experiments conducted on GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy compared to conventional attack methods. In summary, this work introduces a novel attack method that outperforms previous approaches, raising an important question: How to discern whether the ultimate intent in a dialogue is malicious?

Title: Tackling Selfish Clients in Federated Learning

Authors: Andrea Augello, Ashish Gupta, Giuseppe Lo Re, Sajal K. Das
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2407.15402
Pdf URL: https://arxiv.org/pdf/2407.15402
Copy Paste: [[2407.15402]] Tackling Selfish Clients in Federated Learning(https://arxiv.org/abs/2407.15402)
Keywords: robust, federate
Abstract: Federated Learning (FL) is a distributed machine learning paradigm facilitating participants to collaboratively train a model without revealing their local data. However, when FL is deployed into the wild, some intelligent clients can deliberately deviate from the standard training process to make the global model inclined toward their local model, thereby prioritizing their local data distribution. We refer to this novel category of misbehaving clients as selfish. In this paper, we propose a Robust aggregation strategy for FL server to mitigate the effect of Selfishness (in short RFL-Self). RFL-Self incorporates an innovative method to recover (or estimate) the true updates of selfish clients from the received ones, leveraging robust statistics (median of norms) of the updates at every round. By including the recovered updates in aggregation, our strategy offers strong robustness against selfishness. Our experimental results, obtained on MNIST and CIFAR-10 datasets, demonstrate that just 2% of clients behaving selfishly can decrease the accuracy by up to 36%, and RFL-Self can mitigate that effect without degrading the global model performance.

Title: A Solution toward Transparent and Practical AI Regulation: Privacy Nutrition Labels for Open-source Generative AI-based Applications

Authors: Meixue Si, Shidong Pan, Dianshu Liao, Xiaoyu Sun, Zhen Tao, Wenchang Shi, Zhenchang Xing
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2407.15407
Pdf URL: https://arxiv.org/pdf/2407.15407
Copy Paste: [[2407.15407]] A Solution toward Transparent and Practical AI Regulation: Privacy Nutrition Labels for Open-source Generative AI-based Applications(https://arxiv.org/abs/2407.15407)
Keywords: privacy, generative
Abstract: The rapid development and widespread adoption of Generative Artificial Intelligence-based (GAI) applications have greatly enriched our daily lives, benefiting people by enhancing creativity, personalizing experiences, improving accessibility, and fostering innovation and efficiency across various domains. However, along with the development of GAI applications, concerns have been raised about transparency in their privacy practices. Traditional privacy policies often fail to effectively communicate essential privacy information due to their complexity and length, and open-source community developers often neglect privacy practices even more. Only 12.2% of examined open-source GAI apps provide a privacy policy. To address this, we propose a regulation-driven GAI Privacy Label and introduce Repo2Label, a novel framework for automatically generating these labels based on code repositories. Our user study indicates a common endorsement of the proposed GAI privacy label format. Additionally, Repo2Label achieves a precision of 0.81, recall of 0.88, and F1-score of 0.84 based on the benchmark dataset, significantly outperforming the developer self-declared privacy notices. We also discuss the common regulatory (in)compliance of open-source GAI apps, comparison with other privacy notices, and broader impacts to different stakeholders. Our findings suggest that Repo2Label could serve as a significant tool for bolstering the privacy transparency of GAI apps and make them more practical and responsible.

Title: Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Authors: Kent Fujiwara, Mikihiro Tanaka, Qing Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15408
Pdf URL: https://arxiv.org/pdf/2407.15408
Copy Paste: [[2407.15408]] Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models(https://arxiv.org/abs/2407.15408)
Keywords: robust
Abstract: With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.

Title: Weights Shuffling for Improving DPSGD in Transformer-based Models

Authors: Jungang Yang, Zhe Ji, Liyao Xiang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2407.15414
Pdf URL: https://arxiv.org/pdf/2407.15414
Copy Paste: [[2407.15414]] Weights Shuffling for Improving DPSGD in Transformer-based Models(https://arxiv.org/abs/2407.15414)
Keywords: privacy, transformer
Abstract: Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.

Title: LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Authors: Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15415
Pdf URL: https://arxiv.org/pdf/2407.15415
Copy Paste: [[2407.15415]] LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models(https://arxiv.org/abs/2407.15415)
Keywords: large language model
Abstract: We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in this https URL.

Title: Local All-Pair Correspondence for Point Tracking

Authors: Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15420
Pdf URL: https://arxiv.org/pdf/2407.15420
Copy Paste: [[2407.15420]] Local All-Pair Correspondence for Point Tracking(https://arxiv.org/abs/2407.15420)
Keywords: robust, transformer
Abstract: We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

Title: Planning behavior in a recurrent neural network that plays Sokoban

Authors: Adrià Garriga-Alonso, Mohammad Taufeeque, Adam Gleave
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15421
Pdf URL: https://arxiv.org/pdf/2407.15421
Copy Paste: [[2407.15421]] Planning behavior in a recurrent neural network that plays Sokoban(https://arxiv.org/abs/2407.15421)
Keywords: interpretability
Abstract: To predict how advanced neural networks generalize to novel situations, it is essential to understand how they reason. Guez et al. (2019, "An investigation of model-free planning") trained a recurrent neural network (RNN) to play Sokoban with model-free reinforcement learning. They found that adding extra computation steps to the start of episodes at test time improves the RNN's success rate. We further investigate this phenomenon, finding that it rapidly emerges early on in training and then slowly fades, but only for comparatively easier levels. The RNN also often takes redundant actions at episode starts, and these are reduced by adding extra computation steps. Our results suggest that the RNN learns to take time to think by `pacing', despite the per-step penalties, indicating that training incentivizes planning capabilities. The small size (1.29M parameters) and interesting behavior of this model make it an excellent model organism for mechanistic interpretability.

Title: Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attention

Authors: Jiahao Lyu, Minghua Zhao, Jing Hu, Runtao Xi, Xuewen Huang, Shuangli Du, Cheng Shi, Tian Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15424
Pdf URL: https://arxiv.org/pdf/2407.15424
Copy Paste: [[2407.15424]] Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attention(https://arxiv.org/abs/2407.15424)
Keywords: extraction
Abstract: With the widespread deployment of video surveillance devices and the demand for intelligent system development, video anomaly detection (VAD) has become an important part of constructing intelligent surveillance systems. Expanding the discriminative boundary between normal and abnormal events to enhance performance is the common goal and challenge of VAD. To address this problem, we propose a Bidirectional Skip-frame Prediction (BiSP) network based on a dual-stream autoencoder, from the perspective of learning the intra-domain disparity between different features. The BiSP skips frames in the training phase to achieve the forward and backward frame prediction respectively, and in the testing phase, it utilizes bidirectional consecutive frames to co-predict the same intermediate frames, thus expanding the degree of disparity between normal and abnormal events. The BiSP designs the variance channel attention and context spatial attention from the perspectives of movement patterns and object scales, respectively, thus ensuring the maximization of the disparity between normal and abnormal in the feature extraction and delivery with different dimensions. Extensive experiments from four benchmark datasets demonstrate the effectiveness of the proposed BiSP, which substantially outperforms state-of-the-art competing methods.

Title: Empirical Capacity Model for Self-Attention Neural Networks

Authors: Aki Härmä, Marcin Pietrasik, Anna Wilbik
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2407.15425
Pdf URL: https://arxiv.org/pdf/2407.15425
Copy Paste: [[2407.15425]] Empirical Capacity Model for Self-Attention Neural Networks(https://arxiv.org/abs/2407.15425)
Keywords: transformer
Abstract: Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.

Title: Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training

Authors: Ye Lin Tun, Chu Myaet Thwal, Minh N. H. Nguyen, Choong Seon Hong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.15426
Pdf URL: https://arxiv.org/pdf/2407.15426
Copy Paste: [[2407.15426]] Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training(https://arxiv.org/abs/2407.15426)
Keywords: privacy, federate
Abstract: Combining different data modalities enables deep neural networks to tackle complex tasks more effectively, making multimodal learning increasingly popular. To harness multimodal data closer to end users, it is essential to integrate multimodal learning with privacy-preserving training approaches such as federated learning (FL). However, compared to conventional unimodal learning, multimodal setting requires dedicated encoders for each modality, resulting in larger and more complex models that demand significant resources. This presents a substantial challenge for FL clients operating with limited computational resources and communication bandwidth. To address these challenges, we introduce LW-FedMML, a layer-wise federated multimodal learning approach, which decomposes the training process into multiple steps. Each step focuses on training only a portion of the model, thereby significantly reducing the memory and computational requirements. Moreover, FL clients only need to exchange the trained model portion with the central server, lowering the resulting communication cost. We conduct extensive experiments across various FL scenarios and multimodal learning setups to validate the effectiveness of our proposed method. The results demonstrate that LW-FedMML can compete with conventional end-to-end federated multimodal learning (FedMML) while significantly reducing the resource burden on FL clients. Specifically, LW-FedMML reduces memory usage by up to $2.7\times$, computational operations (FLOPs) by $2.4\times$, and total communication cost by $2.3\times$. We also introduce a progressive training approach called Prog-FedMML. While it offers lesser resource efficiency than LW-FedMML, Prog-FedMML has the potential to surpass the performance of end-to-end FedMML, making it a viable option for scenarios with fewer resource constraints.

Title: YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images

Authors: Bowen Liu, Dongjie Chen, Xiao Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15427
Pdf URL: https://arxiv.org/pdf/2407.15427
Copy Paste: [[2407.15427]] YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images(https://arxiv.org/abs/2407.15427)
Keywords: robust, extraction
Abstract: With the rapid growth of the PCB manufacturing industry, there is an increasing demand for computer vision inspection to detect defects during production. Improving the accuracy and generalization of PCB defect detection models remains a significant challenge. This paper proposes a high-precision, robust, and real-time end-to-end method for PCB defect detection based on deep Convolutional Neural Networks (CNN). Traditional methods often suffer from low accuracy and limited applicability. We propose a novel approach combining YOLOv5 and multiscale modules for hierarchical residual-like connections. In PCB defect detection, noise can confuse the background and small targets. The YOLOv5 model provides a strong foundation with its real-time processing and accurate object detection capabilities. The multi-scale module extends traditional approaches by incorporating hierarchical residual-like connections within a single block, enabling multiscale feature extraction. This plug-and-play module significantly enhances performance by extracting features at multiple scales and levels, which are useful for identifying defects of varying sizes and complexities. Our multi-scale architecture integrates feature extraction, defect localization, and classification into a unified network. Experiments on a large-scale PCB dataset demonstrate significant improvements in precision, recall, and F1-score compared to existing methods. This work advances computer vision inspection for PCB defect detection, providing a reliable solution for high-precision, robust, real-time, and domain-adaptive defect detection in the PCB manufacturing industry.

Title: Decoding BACnet Packets: A Large Language Model Approach for Packet Interpretation

Authors: Rashi Sharma, Hiroyuki Okada, Tatsumi Oba, Karthikk Subramanian, Naoto Yanai, Sugiri Pranata
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15428
Pdf URL: https://arxiv.org/pdf/2407.15428
Copy Paste: [[2407.15428]] Decoding BACnet Packets: A Large Language Model Approach for Packet Interpretation(https://arxiv.org/abs/2407.15428)
Keywords: security, large language model
Abstract: The Industrial Control System (ICS) environment encompasses a wide range of intricate communication protocols, posing substantial challenges for Security Operations Center (SOC) analysts tasked with monitoring, interpreting, and addressing network activities and security incidents. Conventional monitoring tools and techniques often struggle to provide a clear understanding of the nature and intent of ICS-specific communications. To enhance comprehension, we propose a software solution powered by a Large Language Model (LLM). This solution currently focused on BACnet protocol, processes a packet file data and extracts context by using a mapping database, and contemporary context retrieval methods for Retrieval Augmented Generation (RAG). The processed packet information, combined with the extracted context, serves as input to the LLM, which generates a concise packet file summary for the user. The software delivers a clear, coherent, and easily understandable summary of network activities, enabling SOC analysts to better assess the current state of the control system.

Title: Learning at a Glance: Towards Interpretable Data-limited Continual Semantic Segmentation via Semantic-Invariance Modelling

Authors: Bo Yuan, Danpei Zhao, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15429
Pdf URL: https://arxiv.org/pdf/2407.15429
Copy Paste: [[2407.15429]] Learning at a Glance: Towards Interpretable Data-limited Continual Semantic Segmentation via Semantic-Invariance Modelling(https://arxiv.org/abs/2407.15429)
Keywords: robust, interpretability, segmentation
Abstract: Continual semantic segmentation (CSS) based on incremental learning (IL) is a great endeavour in developing human-like segmentation models. However, current CSS approaches encounter challenges in the trade-off between preserving old knowledge and learning new ones, where they still need large-scale annotated data for incremental training and lack interpretability. In this paper, we present Learning at a Glance (LAG), an efficient, robust, human-like and interpretable approach for CSS. Specifically, LAG is a simple and model-agnostic architecture, yet it achieves competitive CSS efficiency with limited incremental data. Inspired by human-like recognition patterns, we propose a semantic-invariance modelling approach via semantic features decoupling that simultaneously reconciles solid knowledge inheritance and new-term learning. Concretely, the proposed decoupling manner includes two ways, i.e., channel-wise decoupling and spatial-level neuron-relevant semantic consistency. Our approach preserves semantic-invariant knowledge as solid prototypes to alleviate catastrophic forgetting, while also constraining sample-specific contents through an asymmetric contrastive learning method to enhance model robustness during IL steps. Experimental results in multiple datasets validate the effectiveness of the proposed method. Furthermore, we introduce a novel CSS protocol that better reflects realistic data-limited CSS settings, and LAG achieves superior performance under multiple data-limited conditions.

Title: Enhancement of 3D Gaussian Splatting using Raw Mesh for Photorealistic Recreation of Architectures

Authors: Ruizhe Wang, Chunliang Hua, Tomakayev Shingys, Mengyuan Niu, Qingxin Yang, Lizhong Gao, Yi Zheng, Junyan Yang, Qiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15435
Pdf URL: https://arxiv.org/pdf/2407.15435
Copy Paste: [[2407.15435]] Enhancement of 3D Gaussian Splatting using Raw Mesh for Photorealistic Recreation of Architectures(https://arxiv.org/abs/2407.15435)
Keywords: protect
Abstract: The photorealistic reconstruction and rendering of architectural scenes have extensive applications in industries such as film, games, and transportation. It also plays an important role in urban planning, architectural design, and the city's promotion, especially in protecting historical and cultural relics. The 3D Gaussian Splatting, due to better performance over NeRF, has become a mainstream technology in 3D reconstruction. Its only input is a set of images but it relies heavily on geometric parameters computed by the SfM process. At the same time, there is an existing abundance of raw 3D models, that could inform the structural perception of certain buildings but cannot be applied. In this paper, we propose a straightforward method to harness these raw 3D models to guide 3D Gaussians in capturing the basic shape of the building and improve the visual quality of textures and details when photos are captured non-systematically. This exploration opens up new possibilities for improving the effectiveness of 3D reconstruction techniques in the field of architectural design.

Title: Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Authors: Ziqun Chen, Kechao Cai, Zhuoyue Chen, Jinbei Zhang, John C.S. Lui
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2407.15439
Pdf URL: https://arxiv.org/pdf/2407.15439
Copy Paste: [[2407.15439]] Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays(https://arxiv.org/abs/2407.15439)
Keywords: fair
Abstract: We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

Title: Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

Authors: Song Wang, Xun Wang, Jie Mei, Yujia Xie, Sean Muarray, Zhang Li, Lingfeng Wu, Si-Qing Chen, Wayne Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15441
Pdf URL: https://arxiv.org/pdf/2407.15441
Copy Paste: [[2407.15441]] Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned(https://arxiv.org/abs/2407.15441)
Keywords: large language model
Abstract: Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, our team has crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.

Title: Text2Place: Affordance-aware Text Guided Human Placement

Authors: Rishubh Parihar, Harsh Gupta, Sachidanand VS, R. Venkatesh Babu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15446
Pdf URL: https://arxiv.org/pdf/2407.15446
Copy Paste: [[2407.15446]] Text2Place: Affordance-aware Text Guided Human Placement(https://arxiv.org/abs/2407.15446)
Keywords: generative
Abstract: For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as \textbf{Semantic Human Placement}. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages \textbf{i)} learning \textit{semantic masks} using text guidance for localizing regions in the image to place humans and \textbf{ii)} subject-conditioned inpainting to place a given subject adhering to the scene affordance within the \textit{semantic masks}. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need for large-scale training. To the best of our knowledge, we are the first ones to provide an effective solution for realistic human placements in diverse real-world scenes. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.

Title: SIGMA:Sinkhorn-Guided Masked Video Modeling

Authors: Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15447
Pdf URL: https://arxiv.org/pdf/2407.15447
Copy Paste: [[2407.15447]] SIGMA:Sinkhorn-Guided Masked Video Modeling(https://arxiv.org/abs/2407.15447)
Keywords: robust
Abstract: Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: this https URL.

Title: Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval

Authors: Daeun Lee, Jaewoong Choi, Hiroshi Mizuseki, Byungju Lee
Subjects: cs.CL, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2407.15459
Pdf URL: https://arxiv.org/pdf/2407.15459
Copy Paste: [[2407.15459]] Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval(https://arxiv.org/abs/2407.15459)
Keywords: extraction
Abstract: Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.

Title: The Diversity Bonus: Learning from Dissimilar Distributed Clients in Personalized Federated Learning

Authors: Xinghao Wu, Xuefeng Liu, Jianwei Niu, Guogang Zhu, Shaojie Tang, Xiaotian Li, Jiannong Cao
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2407.15464
Pdf URL: https://arxiv.org/pdf/2407.15464
Copy Paste: [[2407.15464]] The Diversity Bonus: Learning from Dissimilar Distributed Clients in Personalized Federated Learning(https://arxiv.org/abs/2407.15464)
Keywords: federate
Abstract: Personalized Federated Learning (PFL) is a commonly used framework that allows clients to collaboratively train their personalized models. PFL is particularly useful for handling situations where data from different clients are not independent and identically distributed (non-IID). Previous research in PFL implicitly assumes that clients can gain more benefits from those with similar data distributions. Correspondingly, methods such as personalized weight aggregation are developed to assign higher weights to similar clients during training. We pose a question: can a client benefit from other clients with dissimilar data distributions and if so, how? This question is particularly relevant in scenarios with a high degree of non-IID, where clients have widely different data distributions, and learning from only similar clients will lose knowledge from many other clients. We note that when dealing with clients with similar data distributions, methods such as personalized weight aggregation tend to enforce their models to be close in the parameter space. It is reasonable to conjecture that a client can benefit from dissimilar clients if we allow their models to depart from each other. Based on this idea, we propose DiversiFed which allows each client to learn from clients with diversified data distribution in personalized federated learning. DiversiFed pushes personalized models of clients with dissimilar data distributions apart in the parameter space while pulling together those with similar distributions. In addition, to achieve the above effect without using prior knowledge of data distribution, we design a loss function that leverages the model similarity to determine the degree of attraction and repulsion between any two models. Experiments on several datasets show that DiversiFed can benefit from dissimilar clients and thus outperform the state-of-the-art methods.

Title: Learning deep illumination-robust features from multispectral filter array images

Authors: Anis Amziane
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15472
Pdf URL: https://arxiv.org/pdf/2407.15472
Copy Paste: [[2407.15472]] Learning deep illumination-robust features from multispectral filter array images(https://arxiv.org/abs/2407.15472)
Keywords: robust
Abstract: Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through $\textit{demosaicing}$, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: $\textit{raw spectral constancy}$ to mitigate the impact of illumination, $\textit{MSFA-preserving}$ transformations suited for raw image augmentation to train DNNs on diverse raw textures, and $\textit{raw-mixing}$ to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.~The source code will be available.

Title: In-Context Learning Improves Compositional Understanding of Vision-Language Models

Authors: Matteo Nulli, Anesa Ibrahimi, Avik Pal, Hoshe Lee, Ivona Najdenkoska
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15487
Pdf URL: https://arxiv.org/pdf/2407.15487
Copy Paste: [[2407.15487]] In-Context Learning Improves Compositional Understanding of Vision-Language Models(https://arxiv.org/abs/2407.15487)
Keywords: generative
Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this work, we investigate the reasons for such a lack of capability by performing an extensive bench-marking of compositional understanding in VLMs. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Furthermore, we leverage In-Context Learning (ICL) as a way to improve the ability of VLMs to perform more complex reasoning and understanding given an image. Our extensive experiments demonstrate that our proposed approach outperforms baseline models across multiple compositional understanding datasets.

Title: DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Authors: Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Lan Du, Cunjian Chen, Yufei Guo, Kejie Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15488
Pdf URL: https://arxiv.org/pdf/2407.15488
Copy Paste: [[2407.15488]] DiffX: Guide Your Layout to Cross-Modal Generative Modeling(https://arxiv.org/abs/2407.15488)
Keywords: robust, diffusion, generative
Abstract: Diffusion models have made significant strides in text-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, including chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal "RGB+X" generation, called DiffX. We firstly construct the cross-modal image datasets with text descriptions using the LLaVA model for image captioning, supplemented by manual corrections. Notably, DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space, facilitated by our Dual-Path Variational AutoEncoder (DP-VAE). Furthermore, we incorporate the gated cross-attention mechanism to connect the layout and text conditions, leveraging Long-CLIP for embedding long captions to enhance user guidance. Through extensive experiments, DiffX demonstrates robustness and flexibility in cross-modal generation across three RGB+X datasets: FLIR, MFNet, and COME15K, guided by various layout types. It also shows the potential for adaptive generation of "RGB+X+Y" or more diverse modalities. Our code and processed image captions are available at this https URL.

Title: Fast computation of 2-isogenies in dimension 4 and cryptographic applications

Authors: Pierrick Dartois
Subjects: cs.CR, math.AG
Abstract URL: https://arxiv.org/abs/2407.15492
Pdf URL: https://arxiv.org/pdf/2407.15492
Copy Paste: [[2407.15492]] Fast computation of 2-isogenies in dimension 4 and cryptographic applications(https://arxiv.org/abs/2407.15492)
Keywords: attack
Abstract: Dimension 4 isogenies have first been introduced in cryptography for the cryptanalysis of Supersingular Isogeny Diffie-Hellman (SIDH) and have been used constructively in several schemes, including SQIsignHD, a derivative of SQIsign isogeny based signature scheme. Unlike in dimensions 2 and 3, we can no longer rely on the Jacobian model and its derivatives to compute isogenies. In dimension 4 (and higher), we can only use theta-models. Previous works by Romain Cosset, David Lubicz and Damien Robert have focused on the computation of $\ell$-isogenies in theta-models of level $n$ coprime to $\ell$ (which requires to use $n^g$ coordinates in dimension $g$). For cryptographic applications, we need to compute chains of $2$-isogenies, requiring to use $\geq 3^g$ coordinates in dimension $g$ with state of the art algorithms. In this paper, we present algorithms to compute chains of $2$-isogenies between abelian varieties of dimension $g\geq 1$ with theta-coordinates of level $n=2$, generalizing a previous work by Pierrick Dartois, Luciano Maino, Giacomo Pope and Damien Robert in dimension $g=2$. We propose an implementation of these algorithms in dimension $g=4$ to compute endomorphisms of elliptic curve products derived from Kani's lemma with applications to SQIsignHD and SIDH cryptanalysis. We are now able to run a complete key recovery attack on SIDH when the endomorphism ring of the starting curve is unknown within a few seconds on a laptop for all NIST SIKE parameters.

Title: Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

Authors: Dingyao Yu, Yang An, Wei Ye, Xiongfeng Xiao, Shaoguang Mao, Tao Ge, Shikun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15498
Pdf URL: https://arxiv.org/pdf/2407.15498
Copy Paste: [[2407.15498]] Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction(https://arxiv.org/abs/2407.15498)
Keywords: robust
Abstract: Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).

Title: TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping

Authors: Despina Konstantinidou, Christos Koutlis, Symeon Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15500
Pdf URL: https://arxiv.org/pdf/2407.15500
Copy Paste: [[2407.15500]] TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping(https://arxiv.org/abs/2407.15500)
Keywords: generative
Abstract: Generative AI technologies produce hyper-realistic imagery that can be used for nefarious purposes such as producing misleading or harmful content, among others. This makes Synthetic Image Detection (SID) an essential tool for defending against AI-generated harmful content. Current SID methods typically resize input images to a fixed resolution or perform center-cropping due to computational concerns, leading to challenges in effectively detecting artifacts in high-resolution images. To this end, we propose TextureCrop, a novel image pre-processing technique. By focusing on high-frequency image parts where generation artifacts are prevalent, TextureCrop effectively enhances SID accuracy while maintaining manageable memory requirements. Experimental results demonstrate a consistent improvement in AUC across various detectors by 5.7% compared to center cropping and by 14% compared to resizing, across high-resolution images from the Forensynths and Synthbuster datasets.

Title: WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

Authors: Zirui Shao, Feiyu Gao, Hangdi Xing, Zepeng Zhu, Zhi Yu, Jiajun Bu, Qi Zheng, Cong Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15502
Pdf URL: https://arxiv.org/pdf/2407.15502
Copy Paste: [[2407.15502]] WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation(https://arxiv.org/abs/2407.15502)
Keywords: generative
Abstract: In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results.

Title: Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models

Authors: Adway Girish, Alliot Nagle, Marco Bondaschi, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim
Subjects: cs.LG, cs.CL, cs.IT
Abstract URL: https://arxiv.org/abs/2407.15504
Pdf URL: https://arxiv.org/pdf/2407.15504
Copy Paste: [[2407.15504]] Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models(https://arxiv.org/abs/2407.15504)
Keywords: large language model
Abstract: We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.

Title: SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time

Authors: Stanislav Frolov, Brian B. Moser, Andreas Dengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15507
Pdf URL: https://arxiv.org/pdf/2407.15507
Copy Paste: [[2407.15507]] SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time(https://arxiv.org/abs/2407.15507)
Keywords: diffusion, generative
Abstract: Generating high-resolution images with generative models has recently been made widely accessible by leveraging diffusion models pre-trained on large-scale datasets. Various techniques, such as MultiDiffusion and SyncDiffusion, have further pushed image generation beyond training resolutions, i.e., from square images to panorama, by merging multiple overlapping diffusion paths or employing gradient descent to maintain perceptual coherence. However, these methods suffer from significant computational inefficiencies due to generating and averaging numerous predictions, which is required in practice to produce high-quality and seamless images. This work addresses this limitation and presents a novel approach that eliminates the need to generate and average numerous overlapping denoising predictions. Our method shifts non-overlapping denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in coherent, high-resolution images with fewer overall steps. We demonstrate the effectiveness of our approach through qualitative and quantitative evaluations, comparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our method offers several key benefits, including improved computational efficiency and faster inference times while producing comparable or better image quality.

Title: Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Authors: Yifei Gao, Jie Ou, Lei Wang, Fanhua Shang, Jaji Wu, Jun Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15508
Pdf URL: https://arxiv.org/pdf/2407.15508
Copy Paste: [[2407.15508]] Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners(https://arxiv.org/abs/2407.15508)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) showcase remarkable performance and robust deductive capabilities, yet their expansive size complicates deployment and raises environmental concerns due to substantial resource consumption. The recent development of a quantization technique known as Learnable Singular-value Increment (LSI) has addressed some of these quantization challenges. Leveraging insights from LSI and our extensive research, we have developed innovative methods that enhance the performance of quantized LLMs, particularly in low-bit settings. Our methods consistently deliver state-of-the-art results across various quantization scenarios and offer deep theoretical insights into the quantization process, elucidating the potential of quantized models for widespread application.

Title: Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

Authors: Francisco Mena, Diego Arenas, Andreas Dengel
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2407.15512
Pdf URL: https://arxiv.org/pdf/2407.15512
Copy Paste: [[2407.15512]] Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation(https://arxiv.org/abs/2407.15512)
Keywords: robust
Abstract: Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.

Title: Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Authors: Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2407.15516
Pdf URL: https://arxiv.org/pdf/2407.15516
Copy Paste: [[2407.15516]] Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models(https://arxiv.org/abs/2407.15516)
Keywords: large language model
Abstract: The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.

Title: Towards Efficient Transferable Preemptive Adversarial Defense

Authors: Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Isao Echizen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15524
Pdf URL: https://arxiv.org/pdf/2407.15524
Copy Paste: [[2407.15524]] Towards Efficient Transferable Preemptive Adversarial Defense(https://arxiv.org/abs/2407.15524)
Keywords: protect, defense, attack
Abstract: Deep learning technology has brought convenience and advanced developments but has become untrustworthy because of its sensitivity to inconspicuous perturbations (i.e., adversarial attacks). Attackers utilize this sensitivity to slightly manipulate transmitted messages. To defend against such attacks, we have devised a strategy for "attacking" the message before it is attacked. This strategy, dubbed Fast Preemption, provides an efficient transferable preemptive defense by using different models for labeling inputs and learning crucial features. A forward-backward cascade learning algorithm is used to compute protective perturbations, starting with forward propagation optimization to achieve rapid convergence, followed by iterative backward propagation learning to alleviate overfitting. This strategy offers state-of-the-art transferability and protection across various systems. With the running of only three steps, our Fast Preemption framework outperforms benchmark training-time, test-time, and preemptive adversarial defenses. We have also devised the first to our knowledge effective white-box adaptive reversion attack and demonstrate that the protection added by our defense strategy is irreversible unless the backbone model, algorithm, and settings are fully compromised. This work provides a new direction to developing active defenses against adversarial attacks.

Title: Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Authors: Eugenio Lomurno, Matteo Matteucci
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15526
Pdf URL: https://arxiv.org/pdf/2407.15526
Copy Paste: [[2407.15526]] Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks(https://arxiv.org/abs/2407.15526)
Keywords: privacy, attack, membership infer, generative
Abstract: Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.

Title: Inverted Activations

Authors: Georgii Novikov, Ivan Oseledets
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2407.15545
Pdf URL: https://arxiv.org/pdf/2407.15545
Copy Paste: [[2407.15545]] Inverted Activations(https://arxiv.org/abs/2407.15545)
Keywords: transformer
Abstract: The scaling of neural networks with increasing data and model sizes necessitates more efficient deep learning algorithms. This paper addresses the memory footprint challenge in neural network training by proposing a modification to the handling of activation tensors in pointwise nonlinearity layers. Traditionally, these layers save the entire input tensor for the backward pass, leading to substantial memory use. Our method involves saving the output tensor instead, reducing the memory required when the subsequent layer also saves its input tensor. This approach is particularly beneficial for transformer-based architectures like GPT, BERT, Mistral, and Llama. Application of our method involves taken an inverse function of nonlinearity. To the best of our knowledge, that can not be done analitically and instead we buid an accurate approximations using simpler functions. Experimental results confirm that our method significantly reduces memory usage without affecting training accuracy. The implementation is available at this https URL.

Title: Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Authors: Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2407.15549
Pdf URL: https://arxiv.org/pdf/2407.15549
Copy Paste: [[2407.15549]] Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs(https://arxiv.org/abs/2407.15549)
Keywords: attack, robust, interpretability, large language model
Abstract: Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of `jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

Title: SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning

Authors: Chunzhen Jin, Yongfeng Huang, Yaqi Wang, Peng Cao, Osmar Zaiane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15556
Pdf URL: https://arxiv.org/pdf/2407.15556
Copy Paste: [[2407.15556]] SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning(https://arxiv.org/abs/2407.15556)
Keywords: extraction
Abstract: Text style transfer, an important research direction in natural language processing, aims to adapt the text to various preferences but often faces challenges with limited resources. In this work, we introduce a novel method termed Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning (SETTP) for effective style transfer in low-resource scenarios. First, SETTP learns source style-level prompts containing fundamental style characteristics from high-resource style transfer. During training, the source style-level prompts are transferred through an attention module to derive a target style-level prompt for beneficial knowledge provision in low-resource style transfer. Additionally, we propose instance-level prompts obtained by clustering the target resources based on the semantic content to reduce semantic bias. We also propose an automated evaluation approach of style similarity based on alignment with human evaluations using ChatGPT-4. Our experiments across three resourceful styles show that SETTP requires only 1/20th of the data volume to achieve performance comparable to state-of-the-art methods. In tasks involving scarce data like writing style and role style, SETTP outperforms previous methods by 16.24\%.

Title: A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

Authors: Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji
Subjects: cs.LG, cs.DC, cs.IT, math.OC
Abstract URL: https://arxiv.org/abs/2407.15567
Pdf URL: https://arxiv.org/pdf/2407.15567
Copy Paste: [[2407.15567]] A New Theoretical Perspective on Data Heterogeneity in Federated Optimization(https://arxiv.org/abs/2407.15567)
Keywords: federate
Abstract: In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.

Title: An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

Authors: Yuetong Zhao, Hongyu Cao, Xianyu Zhao, Zhijian Ou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15569
Pdf URL: https://arxiv.org/pdf/2407.15569
Copy Paste: [[2407.15569]] An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought(https://arxiv.org/abs/2407.15569)
Keywords: extraction, generative
Abstract: Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.

Title: Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts

Authors: Soojin Yoon, Sungho Ko, Tongyoung Kim, SeongKu Kang, Jinyoung Yeo, Dongha Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15588
Pdf URL: https://arxiv.org/pdf/2407.15588
Copy Paste: [[2407.15588]] Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts(https://arxiv.org/abs/2407.15588)
Keywords: robust
Abstract: Cross-lingual entity alignment (EA) enables the integration of multiple knowledge graphs (KGs) across different languages, providing users with seamless access to diverse and comprehensive knowledge.Existing methods, mostly supervised, face challenges in obtaining labeled entity pairs. To address this, recent studies have shifted towards a self-supervised and unsupervised frameworks. Despite their effectiveness, these approaches have limitations: (1) they mainly focus on entity features, neglecting the semantic information of relations, (2) they assume isomorphism between source and target graphs, leading to noise and reduced alignment accuracy, and (3) they are susceptible to noise in the textual features, especially when encountering inconsistent translations or Out-Of-Vocabulary (OOV) problems. In this paper, we propose ERAlign, an unsupervised and robust cross-lingual EA framework that jointly performs Entity-level and Relation-level Alignment using semantic textual features of relations and entities. Its refinement process iteratively enhances results by fusing entity-level and relation-level alignments based on neighbor triple matching. The additional verification process examines the entities' neighbor triples as the linearized text. This \textit{Align-and-Verify} pipeline that rigorously assesses alignment results, achieving near-perfect alignment even in the presence of noisy textual features of entities. Our extensive experiments demonstrate that robustness and general applicability of \proposed improved the accuracy and effectiveness of EA tasks, contributing significantly to knowledge-oriented applications.

Title: Discrete Flow Matching

Authors: Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15595
Pdf URL: https://arxiv.org/pdf/2407.15595
Copy Paste: [[2407.15595]] Discrete Flow Matching(https://arxiv.org/abs/2407.15595)
Keywords: diffusion, generative
Abstract: Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers several key contributions: (i) it works with a general family of probability paths interpolating between source and target distributions; (ii) it allows for a generic formula for sampling from these probability paths using learned posteriors such as the probability denoiser ($x$-prediction) and noise-prediction ($\epsilon$-prediction); (iii) practically, focusing on specific probability paths defined with different schedulers considerably improves generative perplexity compared to previous discrete diffusion and flow models; and (iv) by scaling Discrete Flow Matching models up to 1.7B parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1 and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion, significantly closing the gap between autoregressive models and discrete flow models.

Title: Semi-Supervised Learning for Anomaly Detection in Blockchain-based Supply Chains

Authors: Do Hai Son, Bui Duc Manh, Tran Viet Khoa, Nguyen Linh Trung, Dinh Thai Hoang, Hoang Trong Minh, Yibeltal Alem, Le Quang Minh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2407.15603
Pdf URL: https://arxiv.org/pdf/2407.15603
Copy Paste: [[2407.15603]] Semi-Supervised Learning for Anomaly Detection in Blockchain-based Supply Chains(https://arxiv.org/abs/2407.15603)
Keywords: attack
Abstract: Blockchain-based supply chain (BSC) systems have tremendously been developed recently and can play an important role in our society in the future. In this study, we develop an anomaly detection model for BSC systems. Our proposed model can detect cyber-attacks at various levels, including the network layer, consensus layer, and beyond, by analyzing only the traffic data at the network layer. To do this, we first build a BSC system at our laboratory to perform experiments and collect datasets. We then propose a novel semi-supervised DAE-MLP (Deep AutoEncoder-Multilayer Perceptron) that combines the advantages of supervised and unsupervised learning to detect anomalies in BSC systems. The experimental results demonstrate the effectiveness of our model for anomaly detection within BSCs, achieving a detection accuracy of 96.5%. Moreover, DAE-MLP can effectively detect new attacks by improving the F1-score up to 33.1% after updating the MLP component.

Title: StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation

Authors: Nauman Riaz, Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15608
Pdf URL: https://arxiv.org/pdf/2407.15608
Copy Paste: [[2407.15608]] StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation(https://arxiv.org/abs/2407.15608)
Keywords: robust, diffusion
Abstract: In this study, we introduce StylusAI, a novel architecture leveraging diffusion models in the domain of handwriting style generation. StylusAI is specifically designed to adapt and integrate the stylistic nuances of one language's handwriting into another, particularly focusing on blending English handwriting styles into the context of the German writing system. This approach enables the generation of German text in English handwriting styles and German handwriting styles into English, enriching machine-generated handwriting diversity while ensuring that the generated text remains legible across both languages. To support the development and evaluation of StylusAI, we present the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset encompassing 37 distinct handwriting styles within the German language. This dataset provides a fundamental resource for training and benchmarking in the realm of handwritten text generation. Our results demonstrate that StylusAI not only introduces a new method for style adaptation in handwritten text generation but also surpasses existing models in generating handwriting samples that improve both text quality and stylistic fidelity, evidenced by its performance on the IAM database and our newly proposed DHSD. Thus, StylusAI represents a significant advancement in the field of handwriting style generation, offering promising avenues for future research and applications in cross-linguistic style adaptation for languages with similar scripts.

Title: RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation

Authors: Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15621
Pdf URL: https://arxiv.org/pdf/2407.15621
Copy Paste: [[2407.15621]] RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation(https://arxiv.org/abs/2407.15621)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from this http URL in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.

Title: Reinforcement Learning Meets Visual Odometry

Authors: Nico Messikommer, Giovanni Cioffi, Mathias Gehrig, Davide Scaramuzza
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2407.15626
Pdf URL: https://arxiv.org/pdf/2407.15626
Copy Paste: [[2407.15626]] Reinforcement Learning Meets Visual Odometry(https://arxiv.org/abs/2407.15626)
Keywords: robust
Abstract: Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.

Title: Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

Authors: Xin Ma, Yaohui Wang, Gengyu Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15642
Pdf URL: https://arxiv.org/pdf/2407.15642
Copy Paste: [[2407.15642]] Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models(https://arxiv.org/abs/2407.15642)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by textual prompts still remains challenging. In this paper, we introduce Cinemo, a novel image animation approach towards achieving better motion controllability, as well as stronger temporal consistency and smoothness. In general, we propose three effective strategies at the training and inference stages of Cinemo to accomplish our goal. At the training stage, Cinemo focuses on learning the distribution of motion residuals, rather than directly predicting subsequent via a motion diffusion model. Additionally, a structural similarity index-based strategy is proposed to enable Cinemo to have better controllability of motion intensity. At the inference stage, a noise refinement technique based on discrete cosine transformation is introduced to mitigate sudden motion changes. Such three strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results. Compared to previous methods, Cinemo offers simpler and more precise user controllability. Extensive experiments against several state-of-the-art methods, including both commercial tools and research approaches, across multiple metrics, demonstrate the effectiveness and superiority of our proposed approach.

Title: SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection

Authors: Daniel Jakab, Alexander Braun, Cathaoir Agnew, Reenu Mohandas, Brian Michael Deegan, Dara Molloy, Enda Ward, Tony Scanlan, Ciarán Eising
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15646
Pdf URL: https://arxiv.org/pdf/2407.15646
Copy Paste: [[2407.15646]] SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection(https://arxiv.org/abs/2407.15646)
Keywords: robust
Abstract: Automotive simulation can potentially compensate for a lack of training data in computer vision applications. However, there has been little to no image quality evaluation of automotive simulation and the impact of optical degradations on simulation is little explored. In this work, we investigate Virtual KITTI and the impact of applying variations of Gaussian blur on image sharpness. Furthermore, we consider object detection, a common computer vision application on three different state-of-the-art models, thus allowing us to characterize the relationship between object detection and sharpness. It was found that while image sharpness (MTF50) degrades from an average of 0.245cy/px to approximately 0.119cy/px; object detection performance stays largely robust within 0.58\%(Faster RCNN), 1.45\%(YOLOF) and 1.93\%(DETR) across all respective held-out test sets.

Title: TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

Authors: Mengqi Guo, Chen Li, Yuyang Zhao, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15648
Pdf URL: https://arxiv.org/pdf/2407.15648
Copy Paste: [[2407.15648]] TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly(https://arxiv.org/abs/2407.15648)
Keywords: transformer
Abstract: Inferring step-wise actions to assemble 3D objects with primitive bricks from images is a challenging task due to complex constraints and the vast number of possible combinations. Recent studies have demonstrated promising results on sequential LEGO brick assembly through the utilization of LEGO-Graph modeling to predict sequential actions. However, existing approaches are class-specific and require significant computational and 3D annotation resources. In this work, we first propose a computationally efficient breadth-first search (BFS) LEGO-Tree structure to model the sequential assembly actions by considering connections between consecutive layers. Based on the LEGO-Tree structure, we then design a class-agnostic tree-transformer framework to predict the sequential assembly actions from the input multi-view images. A major challenge of the sequential brick assembly task is that the step-wise action labels are costly and tedious to obtain in practice. We mitigate this problem by leveraging synthetic-to-real transfer learning. Specifically, our model is first pre-trained on synthetic data with full supervision from the available action labels. We then circumvent the requirement for action labels in the real data by proposing an action-to-silhouette projection that replaces action labels with input image silhouettes for self-supervision. Without any annotation on the real data, our model outperforms existing methods with 3D supervision by 7.8% and 11.3% in mIoU on the MNIST and ModelNet Construction datasets, respectively.

Title: Evaluation of Reinforcement Learning for Autonomous Penetration Testing using A3C, Q-learning and DQN

Authors: Norman Becker, Daniel Reti, Evridiki V. Ntagiou, Marcus Wallum, Hans D. Schotten
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15656
Pdf URL: https://arxiv.org/pdf/2407.15656
Copy Paste: [[2407.15656]] Evaluation of Reinforcement Learning for Autonomous Penetration Testing using A3C, Q-learning and DQN(https://arxiv.org/abs/2407.15656)
Keywords: security, attack
Abstract: Penetration testing is the process of searching for security weaknesses by simulating an attack. It is usually performed by experienced professionals, where scanning and attack tools are applied. By automating the execution of such tools, the need for human interaction and decision-making could be reduced. In this work, a Network Attack Simulator (NASim) was used as an environment to train reinforcement learning agents to solve three predefined security scenarios. These scenarios cover techniques of exploitation, post-exploitation and wiretapping. A large hyperparameter grid search was performed to find the best hyperparameter combinations. The algorithms Q-learning, DQN and A3C were used, whereby A3C was able to solve all scenarios and achieve generalization. In addition, A3C could solve these scenarios with fewer actions than the baseline automated penetration testing. Although the training was performed on rather small scenarios and with small state and action spaces for the agents, the results show that a penetration test can successfully be performed by the RL agent.

Title: DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

Authors: Jiahang Tu, Wei Ji, Hanbin Zhao, Chao Zhang, Roger Zimmermann, Hui Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15661
Pdf URL: https://arxiv.org/pdf/2407.15661
Copy Paste: [[2407.15661]] DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving(https://arxiv.org/abs/2407.15661)
Keywords: diffusion, transformer, generative
Abstract: In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With the rapid development of large generative models, we propose DriveDiTFit, a novel method for efficiently generating autonomous Driving data by Fine-tuning pre-trained Diffusion Transformers (DiTs). Specifically, DriveDiTFit utilizes a gap-driven modulation technique to carefully select and efficiently fine-tune a few parameters in DiTs according to the discrepancy between the pre-trained source data and the target driving data. Additionally, DriveDiTFit develops an effective weather and lighting condition embedding module to ensure diversity in the generated data, which is initialized by a nearest-semantic-similarity initialization approach. Through progressive tuning scheme to refined the process of detail generation in early diffusion process and enlarging the weights corresponding to small objects in training loss, DriveDiTFit ensures high-quality generation of small moving objects in the generated data. Extensive experiments conducted on driving datasets confirm that our method could efficiently produce diverse real driving data. The source codes will be available at this https URL.

Title: MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics

Authors: Alexander Melekhin, Dmitry Yudin, Ilia Petryashin, Vitaly Bezuglyj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15663
Pdf URL: https://arxiv.org/pdf/2407.15663
Copy Paste: [[2407.15663]] MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics(https://arxiv.org/abs/2407.15663)
Keywords: segmentation
Abstract: Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available: this https URL

Title: HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Authors: Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, Golnaz Ghiasi
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2407.15680
Pdf URL: https://arxiv.org/pdf/2407.15680
Copy Paste: [[2407.15680]] HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning(https://arxiv.org/abs/2407.15680)
Keywords: large language model
Abstract: Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

Title: Enhancing Transferability of Targeted Adversarial Examples: A Self-Universal Perspective

Authors: Bowen Peng, Li Liu, Tianpeng Liu, Zhen Liu, Yongxiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15683
Pdf URL: https://arxiv.org/pdf/2407.15683
Copy Paste: [[2407.15683]] Enhancing Transferability of Targeted Adversarial Examples: A Self-Universal Perspective(https://arxiv.org/abs/2407.15683)
Keywords: attack, generative
Abstract: Transfer-based targeted adversarial attacks against black-box deep neural networks (DNNs) have been proven to be significantly more challenging than untargeted ones. The impressive transferability of current SOTA, the generative methods, comes at the cost of requiring massive amounts of additional data and time-consuming training for each targeted label. This results in limited efficiency and flexibility, significantly hindering their deployment in practical applications. In this paper, we offer a self-universal perspective that unveils the great yet underexplored potential of input transformations in pursuing this goal. Specifically, transformations universalize gradient-based attacks with intrinsic but overlooked semantics inherent within individual images, exhibiting similar scalability and comparable results to time-consuming learning over massive additional data from diverse classes. We also contribute a surprising empirical insight that one of the most fundamental transformations, simple image scaling, is highly effective, scalable, sufficient, and necessary in enhancing targeted transferability. We further augment simple scaling with orthogonal transformations and block-wise applicability, resulting in the Simple, faSt, Self-universal yet Strong Scale Transformation (S$^4$ST) for self-universal TTA. On the ImageNet-Compatible benchmark dataset, our method achieves a 19.8% improvement in the average targeted transfer success rate against various challenging victim models over existing SOTA transformation methods while only consuming 36% time for attacking. It also outperforms resource-intensive attacks by a large margin in various challenging settings.

Title: AI-Driven Fast and Early Detection of IoT Botnet Threats: A Comprehensive Network Traffic Analysis Approach

Authors: Abdelaziz Amara korba, Aleddine Diaf, Yacine Ghamri-Doudane
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15688
Pdf URL: https://arxiv.org/pdf/2407.15688
Copy Paste: [[2407.15688]] AI-Driven Fast and Early Detection of IoT Botnet Threats: A Comprehensive Network Traffic Analysis Approach(https://arxiv.org/abs/2407.15688)
Keywords: attack, steal
Abstract: In the rapidly evolving landscape of cyber threats targeting the Internet of Things (IoT) ecosystem, and in light of the surge in botnet-driven Distributed Denial of Service (DDoS) and brute force attacks, this study focuses on the early detection of IoT bots. It specifically addresses the detection of stealth bot communication that precedes and orchestrates attacks. This study proposes a comprehensive methodology for analyzing IoT network traffic, including considerations for both unidirectional and bidirectional flow, as well as packet formats. It explores a wide spectrum of network features critical for representing network traffic and characterizing benign IoT traffic patterns effectively. Moreover, it delves into the modeling of traffic using various semi-supervised learning techniques. Through extensive experimentation with the IoT-23 dataset - a comprehensive collection featuring diverse botnet types and traffic scenarios - we have demonstrated the feasibility of detecting botnet traffic corresponding to different operations and types of bots, specifically focusing on stealth command and control (C2) communications. The results obtained have demonstrated the feasibility of identifying C2 communication with a 100% success rate through packet-based methods and 94% via flow based approaches, with a false positive rate of 1.53%.

Title: Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)

Authors: Ishan Kavathekar, Anku Rani, Ashmit Chamoli, Ponnurangam Kumaraguru, Amit Sheth, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15694
Pdf URL: https://arxiv.org/pdf/2407.15694
Copy Paste: [[2407.15694]] Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)(https://arxiv.org/abs/2407.15694)
Keywords: large language model
Abstract: The widespread adoption of large language models (LLMs) and awareness around multilingual LLMs have raised concerns regarding the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi ($AG_{hi}$) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index ($ADI_{hi}$) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi. We will make the codes and datasets available to encourage further research.

Title: A Life-long Learning Intrusion Detection System for 6G-Enabled IoV

Authors: Abdelaziz Amara korba, Souad Sebaa, Malik Mabrouki, Yacine Ghamri-Doudane, Karima Benatchba
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15700
Pdf URL: https://arxiv.org/pdf/2407.15700
Copy Paste: [[2407.15700]] A Life-long Learning Intrusion Detection System for 6G-Enabled IoV(https://arxiv.org/abs/2407.15700)
Keywords: security, attack, robust, federate
Abstract: The introduction of 6G technology into the Internet of Vehicles (IoV) promises to revolutionize connectivity with ultra-high data rates and seamless network coverage. However, this technological leap also brings significant challenges, particularly for the dynamic and diverse IoV landscape, which must meet the rigorous reliability and security requirements of 6G networks. Furthermore, integrating 6G will likely increase the IoV's susceptibility to a spectrum of emerging cyber threats. Therefore, it is crucial for security mechanisms to dynamically adapt and learn new attack patterns, keeping pace with the rapid evolution and diversification of these threats - a capability currently lacking in existing systems. This paper presents a novel intrusion detection system leveraging the paradigm of life-long (or continual) learning. Our methodology combines class-incremental learning with federated learning, an approach ideally suited to the distributed nature of the IoV. This strategy effectively harnesses the collective intelligence of Connected and Automated Vehicles (CAVs) and edge computing capabilities to train the detection system. To the best of our knowledge, this study is the first to synergize class-incremental learning with federated learning specifically for cyber attack detection. Through comprehensive experiments on a recent network traffic dataset, our system has exhibited a robust adaptability in learning new cyber attack patterns, while effectively retaining knowledge of previously encountered ones. Additionally, it has proven to maintain high accuracy and a low false positive rate.

Title: Estimating Probability Densities with Transformer and Denoising Diffusion

Authors: Henry W. Leung, Jo Bovy, Joshua S. Speagle
Subjects: cs.LG, astro-ph.IM, stat.ML
Abstract URL: https://arxiv.org/abs/2407.15703
Pdf URL: https://arxiv.org/pdf/2407.15703
Copy Paste: [[2407.15703]] Estimating Probability Densities with Transformer and Denoising Diffusion(https://arxiv.org/abs/2407.15703)
Keywords: diffusion, transformer
Abstract: Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.

Title: Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Authors: Jinfu Liu, Chen Chen, Mengyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15706
Pdf URL: https://arxiv.org/pdf/2407.15706
Copy Paste: [[2407.15706]] Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition(https://arxiv.org/abs/2407.15706)
Keywords: robust, large language model
Abstract: Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: this https URL.

Title: SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams

Authors: Liangyan Jiang, Chuang Zhu, Yanxu Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15708
Pdf URL: https://arxiv.org/pdf/2407.15708
Copy Paste: [[2407.15708]] SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams(https://arxiv.org/abs/2407.15708)
Keywords: robust, extraction
Abstract: The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.

Title: Mamba meets crack segmentation

Authors: Zhili He, Yu-Hsing Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15714
Pdf URL: https://arxiv.org/pdf/2407.15714
Copy Paste: [[2407.15714]] Mamba meets crack segmentation(https://arxiv.org/abs/2407.15714)
Keywords: interpretability, transformer, segmentation
Abstract: Cracks pose safety risks to infrastructure and cannot be overlooked. The prevailing structures in existing crack segmentation networks predominantly consist of CNNs or Transformers. However, CNNs exhibit a deficiency in global modeling capability, hindering the representation to entire crack features. Transformers can capture long-range dependencies but suffer from high and quadratic complexity. Recently, Mamba has garnered extensive attention due to its linear spatial and computational complexity and its powerful global perception. This study explores the representation capabilities of Mamba to crack features. Specifically, this paper uncovers the connection between Mamba and the attention mechanism, providing a profound insight, an attention perspective, into interpreting Mamba and devising a novel Mamba module following the principles of attention blocks, namely CrackMamba. We compare CrackMamba with the most prominent visual Mamba modules, Vim and Vmamba, on two datasets comprising asphalt pavement and concrete pavement cracks, and steel cracks, respectively. The quantitative results show that CrackMamba stands out as the sole Mamba block consistently enhancing the baseline model's performance across all evaluation measures, while reducing its parameters and computational costs. Moreover, this paper substantiates that Mamba can achieve global receptive fields through both theoretical analysis and visual interpretability. The discoveries of this study offer a dual contribution. First, as a plug-and-play and simple yet effective Mamba module, CrackMamba exhibits immense potential for integration into various crack segmentation models. Second, the proposed innovative Mamba design concept, integrating Mamba with the attention mechanism, holds significant reference value for all Mamba-based computer vision models, not limited to crack segmentation networks, as investigated in this study.

Title: Harmonizing Flows: Leveraging normalizing flows for unsupervised and source-free MRI harmonization

Authors: Farzad Beizaee, Gregory A. Lodygensky, Chris L. Adamson, Deanne K. Thompso, Jeanie L.Y. Cheon, Alicia J. Spittl. Peter J. Anderso, Christian Desrosier, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15717
Pdf URL: https://arxiv.org/pdf/2407.15717
Copy Paste: [[2407.15717]] Harmonizing Flows: Leveraging normalizing flows for unsupervised and source-free MRI harmonization(https://arxiv.org/abs/2407.15717)
Keywords: segmentation
Abstract: Lack of standardization and various intrinsic parameters for magnetic resonance (MR) image acquisition results in heterogeneous images across different sites and devices, which adversely affects the generalization of deep neural networks. To alleviate this issue, this work proposes a novel unsupervised harmonization framework that leverages normalizing flows to align MR images, thereby emulating the distribution of a source domain. The proposed strategy comprises three key steps. Initially, a normalizing flow network is trained to capture the distribution characteristics of the source domain. Then, we train a shallow harmonizer network to reconstruct images from the source domain via their augmented counterparts. Finally, during inference, the harmonizer network is updated to ensure that the output images conform to the learned source domain distribution, as modeled by the normalizing flow network. Our approach, which is unsupervised, source-free, and task-agnostic is assessed in the context of both adults and neonatal cross-domain brain MRI segmentation, as well as neonatal brain age estimation, demonstrating its generalizability across tasks and population demographics. The results underscore its superior performance compared to existing methodologies. The code is available at this https URL

Title: GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

Authors: Zhaojie Fang, Shenghao Zhu, Yifei Chen, Binfeng Zou, Fan Jia, Linwei Qiu, Chang Liu, Yiyu Huang, Xiang Feng, Feiwei Qin, Changmiao Wang, Yeru Wang, Jin Fan, Changbiao Chu, Wan-Zhen Wu, Hu Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15719
Pdf URL: https://arxiv.org/pdf/2407.15719
Copy Paste: [[2407.15719]] GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI(https://arxiv.org/abs/2407.15719)
Keywords: extraction, interpretability, generative
Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classification requires the inclusion of extensive multimodal data, such as assessment scales and various neuroimaging techniques like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET). However, consistently tracking the diagnosis of the same individual over time and simultaneously collecting multimodal data poses significant challenges. To address this issue, we introduce GFE-Mamba, a classifier based on Generative Feature Extraction (GFE). This classifier effectively integrates data from assessment scales, MRI, and PET, enabling deeper multimodal fusion. It efficiently extracts both long and short sequence information and incorporates additional information beyond the pixel space. This approach not only improves classification accuracy but also enhances the interpretability and stability of the model. We constructed datasets of over 3000 samples based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) for a two-step training process. Our experimental results demonstrate that the GFE-Mamba model is effective in predicting the conversion from MCI to AD and outperforms several state-of-the-art methods. Our source code and ADNI dataset processing code are available at this https URL.

Title: Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

Authors: Zhuoyan Xu, Zhenmei Shi, Yingyu Liang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15720
Pdf URL: https://arxiv.org/pdf/2407.15720
Copy Paste: [[2407.15720]] Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability(https://arxiv.org/abs/2407.15720)
Keywords: large language model
Abstract: Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite LLM's tremendous success, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open question and largely ununderstood. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks that include linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks that involving reasoning multiple steps, where each step represent one task, models typically underperform, and scaling up generally provide no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{this https URL}}.

Title: DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design

Authors: Zhi Hao Luo, Luis Lara, Ge Ya Luo, Florian Golemo, Christopher Beckham, Christopher Pal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15723
Pdf URL: https://arxiv.org/pdf/2407.15723
Copy Paste: [[2407.15723]] DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design(https://arxiv.org/abs/2407.15723)
Keywords: generative, large language model
Abstract: Text conditioned generative models for images have yielded impressive results. Text conditioned floorplan generation as a special type of raster image generation task also received particular attention. However there are many use cases in floorpla generation where numerical properties of the generated result are more important than the aesthetics. For instance, one might want to specify sizes for certain rooms in a floorplan and compare the generated floorplan with given specifications Current approaches, datasets and commonly used evaluations do not support these kinds of constraints. As such, an attractive strategy is to generate an intermediate data structure that contains numerical properties of a floorplan which can be used to generate the final floorplan image. To explore this setting we (1) construct a new dataset for this data-structure to data-structure formulation of floorplan generation using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and provide the tools to convert further procedurally generated ProcTHOR floorplan data into our format. (2) We explore the task of floorplan generation given a partial or complete set of constraints and we design a series of metrics and benchmarks to enable evaluating how well samples generated from models respect the constraints. (3) We create multiple baselines by finetuning a large language model (LLM), Llama3, and demonstrate the feasibility of using floorplan data structure conditioned LLMs for the problem of floorplan generation respecting numerical constraints. We hope that our new datasets and benchmarks will encourage further research on different ways to improve the performance of LLMs and other generative modelling techniques for generating designs where quantitative constraints are only partially specified, but must be respected.

Title: OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context

Authors: Steffen Kleinle, Jakob Prange, Annemarie Friedrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15736
Pdf URL: https://arxiv.org/pdf/2407.15736
Copy Paste: [[2407.15736]] OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context(https://arxiv.org/abs/2407.15736)
Keywords: large language model
Abstract: When immigrating to a new country, it is easy to feel overwhelmed by the need to obtain information on financial support, housing, schooling, language courses, and other issues. If relocation is rushed or even forced, the necessity for high-quality answers to such questions is all the more urgent. Official immigration counselors are usually overbooked, and online systems could guide newcomers to the requested information or a suitable counseling service. To this end, we present OMoS-QA, a dataset of German and English questions paired with relevant trustworthy documents and manually annotated answers, specifically tailored to this scenario. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers with high agreement. With our data, we conduct a comparison of 5 pretrained LLMs on the task of extractive question answering (QA) in German and English. Across all models and both languages, we find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users. This performance even holds up when the question language does not match the document language. When it comes to identifying unanswerable questions given a context, there are larger differences between the two languages.

Title: Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond

Authors: Silvio Galesso, Philipp Schröppel, Hssan Driss, Thomas Brox
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15739
Pdf URL: https://arxiv.org/pdf/2407.15739
Copy Paste: [[2407.15739]] Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond(https://arxiv.org/abs/2407.15739)
Keywords: robust, diffusion, segmentation
Abstract: In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.

Title: MoRSE: Bridging the Gap in Cybersecurity Expertise with Retrieval Augmented Generation

Authors: Marco Simoni, Andrea Saracino, Vinod P., Mauro Conti
Subjects: cs.CR, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2407.15748
Pdf URL: https://arxiv.org/pdf/2407.15748
Copy Paste: [[2407.15748]] MoRSE: Bridging the Gap in Cybersecurity Expertise with Retrieval Augmented Generation(https://arxiv.org/abs/2407.15748)
Keywords: security, large language model
Abstract: In this paper, we introduce MoRSE (Mixture of RAGs Security Experts), the first specialised AI chatbot for cybersecurity. MoRSE aims to provide comprehensive and complete knowledge about cybersecurity. MoRSE uses two RAG (Retrieval Augmented Generation) systems designed to retrieve and organize information from multidimensional cybersecurity contexts. MoRSE differs from traditional RAGs by using parallel retrievers that work together to retrieve semantically related information in different formats and structures. Unlike traditional Large Language Models (LLMs) that rely on Parametric Knowledge Bases, MoRSE retrieves relevant documents from Non-Parametric Knowledge Bases in response to user queries. Subsequently, MoRSE uses this information to generate accurate answers. In addition, MoRSE benefits from real-time updates to its knowledge bases, enabling continuous knowledge enrichment without retraining. We have evaluated the effectiveness of MoRSE against other state-of-the-art LLMs, evaluating the system on 600 cybersecurity specific questions. The experimental evaluation has shown that the improvement in terms of relevance and correctness of the answer is more than 10\% compared to known solutions such as GPT-4 and Mixtral 7x8.

Title: Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

Authors: Brian K. S. Isaac-Medina, Yona Falinie A. Gaus, Neelanjan Bhowmik, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15763
Pdf URL: https://arxiv.org/pdf/2407.15763
Copy Paste: [[2407.15763]] Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis(https://arxiv.org/abs/2407.15763)
Keywords: security
Abstract: Object detection is a pivotal task in computer vision that has received significant attention in previous years. Nonetheless, the capability of a detector to localise objects out of the training distribution remains unexplored. Whilst recent approaches in object-level out-of-distribution (OoD) detection heavily rely on class labels, such approaches contradict truly open-world scenarios where the class distribution is often unknown. In this context, anomaly detection focuses on detecting unseen instances rather than classifying detections as OoD. This work aims to bridge this gap by leveraging an open-world object detector and an OoD detector via virtual outlier synthesis. This is achieved by using the detector backbone features to first learn object pseudo-classes via self-supervision. These pseudo-classes serve as the basis for class-conditional virtual outlier sampling of anomalous features that are classified by an OoD head. Our approach empowers our overall object detector architecture to learn anomaly-aware feature representations without relying on class labels, hence enabling truly open-world object anomaly detection. Empirical validation of our approach demonstrates its effectiveness across diverse datasets encompassing various imaging modalities (visible, infrared, and X-ray). Moreover, our method establishes state-of-the-art performance on object-level anomaly detection, achieving an average recall score improvement of over 5.4% for natural images and 23.5% for a security X-ray dataset compared to the current approaches. In addition, our method detects anomalies in datasets where current approaches fail. Code available at this https URL.

Title: Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels

Authors: Zhuorui Ye, Stephanie Milani, Geoffrey J. Gordon, Fei Fang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15786
Pdf URL: https://arxiv.org/pdf/2407.15786
Copy Paste: [[2407.15786]] Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels(https://arxiv.org/abs/2407.15786)
Keywords: interpretability
Abstract: Recent advances in reinforcement learning (RL) have predominantly leveraged neural network-based policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into neural networks. However, a significant limitation in prior work is the assumption that human annotations for these concepts are readily available during training, necessitating continuous real-time input from human annotators. To overcome this limitation, we introduce a novel training scheme that enables RL algorithms to efficiently learn a concept-based policy by only querying humans to label a small set of data, or in the extreme case, without any human labels. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using a concept ensembles to actively select informative data points for labeling, and decorrelating the concept data with a simple strategy. We show how LICORICE reduces manual labeling efforts to to 500 or fewer concept labels in three environments. Finally, we present an initial study to explore how we can use powerful vision-language models to infer concepts from raw visual inputs without explicit labels at minimal cost to performance.

Title: Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach

Authors: Rian Dolphin, Joe Dursun, Jonathan Chow, Jarrett Blankenship, Katie Adams, Quinton Pike
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2407.15788
Pdf URL: https://arxiv.org/pdf/2407.15788
Copy Paste: [[2407.15788]] Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach(https://arxiv.org/abs/2407.15788)
Keywords: robust, extraction, generative, large language model
Abstract: Financial news plays a crucial role in decision-making processes across the financial sector, yet the efficient processing of this information into a structured format remains challenging. This paper presents a novel approach to financial news processing that leverages Large Language Models (LLMs) to overcome limitations that previously prevented the extraction of structured data from unstructured financial news. We introduce a system that extracts relevant company tickers from raw news article content, performs sentiment analysis at the company level, and generates summaries, all without relying on pre-structured data feeds. Our methodology combines the generative capabilities of LLMs, and recent prompting techniques, with a robust validation framework that uses a tailored string similarity approach. Evaluation on a dataset of 5530 financial news articles demonstrates the effectiveness of our approach, with 90% of articles not missing any tickers compared with current data providers, and 22% of articles having additional relevant tickers. In addition to this paper, the methodology has been implemented at scale with the resulting processed data made available through a live API endpoint, which is updated in real-time with the latest news. To the best of our knowledge, we are the first data provider to offer granular, per-company sentiment analysis from news articles, enhancing the depth of information available to market participants. We also release the evaluation dataset of 5530 processed articles as a static file, which we hope will facilitate further research leveraging financial news.

Title: RADA: Robust and Accurate Feature Learning with Domain Adaptation

Authors: Jingtai He, Gehao Zhang, Tingting Liu, Songlin Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15791
Pdf URL: https://arxiv.org/pdf/2407.15791
Copy Paste: [[2407.15791]] RADA: Robust and Accurate Feature Learning with Domain Adaptation(https://arxiv.org/abs/2407.15791)
Keywords: robust, extraction, transformer
Abstract: Recent advancements in keypoint detection and descriptor extraction have shown impressive performance in local feature learning tasks. However, existing methods generally exhibit suboptimal performance under extreme conditions such as significant appearance changes and domain shifts. In this study, we introduce a multi-level feature aggregation network that incorporates two pivotal components to facilitate the learning of robust and accurate features with domain adaptation. First, we employ domain adaptation supervision to align high-level feature distributions across different domains to achieve invariant domain representations. Second, we propose a Transformer-based booster that enhances descriptor robustness by integrating visual and geometric information through wave position encoding concepts, effectively handling complex conditions. To ensure the accuracy and robustness of features, we adopt a hierarchical architecture to capture comprehensive information and apply meticulous targeted supervision to keypoint detection, descriptor extraction, and their coupled processing. Extensive experiments demonstrate that our method, RADA, achieves excellent results in image matching, camera pose estimation, and visual localization tasks.

Title: Robust Mixture Learning when Outliers Overwhelm Small Groups

Authors: Daniil Dmitriev, Rares-Darius Buhai, Stefan Tiegel, Alexander Wolters, Gleb Novikov, Amartya Sanyal, David Steurer, Fanny Yang
Subjects: cs.LG, cs.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2407.15792
Pdf URL: https://arxiv.org/pdf/2407.15792
Copy Paste: [[2407.15792]] Robust Mixture Learning when Outliers Overwhelm Small Groups(https://arxiv.org/abs/2407.15792)
Keywords: robust
Abstract: We study the problem of estimating the means of well-separated mixtures when an adversary may add arbitrary outliers. While strong guarantees are available when the outlier fraction is significantly smaller than the minimum mixing weight, much less is known when outliers may crowd out low-weight clusters - a setting we refer to as list-decodable mixture learning (LD-ML). In this case, adversarial outliers can simulate additional spurious mixture components. Hence, if all means of the mixture must be recovered up to a small error in the output list, the list size needs to be larger than the number of (true) components. We propose an algorithm that obtains order-optimal error guarantees for each mixture mean with a minimal list-size overhead, significantly improving upon list-decodable mean estimation, the only existing method that is applicable for LD-ML. Although improvements are observed even when the mixture is non-separated, our algorithm achieves particularly strong guarantees when the mixture is separated: it can leverage the mixture structure to partially cluster the samples before carefully iterating a base learner for list-decodable mean estimation at different scales.

Title: CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Authors: Emanuele Frascaroli, Aniello Panariello, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15793
Pdf URL: https://arxiv.org/pdf/2407.15793
Copy Paste: [[2407.15793]] CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning(https://arxiv.org/abs/2407.15793)
Keywords: transformer, generative
Abstract: With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, large pre-trained models have become a common strategy to enhance performance in Continual Learning scenarios. This led to the development of numerous prompting strategies to effectively fine-tune transformer-based models without succumbing to catastrophic forgetting. However, these methods struggle to specialize the model on domains significantly deviating from the pre-training and preserving its zero-shot capabilities. In this work, we propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM, which exploits generative replay to align prompts to tasks. We also introduce a new metric to evaluate zero-shot capabilities within CL benchmarks. Through extensive experiments on different domains, we demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities. Further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at this https URL.

Title: Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video

Authors: Guiqiu Liao, Matjaz Jogan, Sai Koushik, Eric Eaton, Daniel A. Hashimoto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15794
Pdf URL: https://arxiv.org/pdf/2407.15794
Copy Paste: [[2407.15794]] Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video(https://arxiv.org/abs/2407.15794)
Keywords: segmentation
Abstract: Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.

Title: MILAN: Milli-Annotations for Lidar Semantic Segmentation

Authors: Nermin Samet, Gilles Puy, Oriane Siméoni, Renaud Marlet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15797
Pdf URL: https://arxiv.org/pdf/2407.15797
Copy Paste: [[2407.15797]] MILAN: Milli-Annotations for Lidar Semantic Segmentation(https://arxiv.org/abs/2407.15797)
Keywords: segmentation
Abstract: Annotating lidar point clouds for autonomous driving is a notoriously expensive and time-consuming task. In this work, we show that the quality of recent self-supervised lidar scan representations allows a great reduction of the annotation cost. Our method has two main steps. First, we show that self-supervised representations allow a simple and direct selection of highly informative lidar scans to annotate: training a network on these selected scans leads to much better results than a random selection of scans and, more interestingly, to results on par with selections made by SOTA active learning methods. In a second step, we leverage the same self-supervised representations to cluster points in our selected scans. Asking the annotator to classify each cluster, with a single click per cluster, then permits us to close the gap with fully-annotated training sets, while only requiring one thousandth of the point labels.

Title: Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation

Authors: Guanyu Hu, Jie Wei, Siyang Song, Dimitrios Kollias, Xinyu Yang, Zhonglin Sun, Odysseus Kaloidas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15798
Pdf URL: https://arxiv.org/pdf/2407.15798
Copy Paste: [[2407.15798]] Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation(https://arxiv.org/abs/2407.15798)
Keywords: robust
Abstract: The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner (i.e., the speaker). Current methodologies typically assume continuous availability of speech and facial modality data, neglecting real-world scenarios where these data may be intermittently unavailable, which often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context, which is vital for eliciting appropriate facial reactions from human listeners. To address these limitations, we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module, which incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2\% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement, and when facial data is missing, it only exhibits minimal degradation.

Title: Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems

Authors: Siddharth D Jaiswal, Animesh Ganai, Abhisek Dash, Saptarshi Ghosh, Animesh Mukherjee
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2407.15810
Pdf URL: https://arxiv.org/pdf/2407.15810
Copy Paste: [[2407.15810]] Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems(https://arxiv.org/abs/2407.15810)
Keywords: robust
Abstract: Facial Recognition Systems (FRSs) are being developed and deployed globally at unprecedented rates. Most platforms are designed in a limited set of countries but deployed in worldwide, without adequate checkpoints. This is especially problematic for Global South countries which lack strong legislation to safeguard persons facing disparate performance of these systems. A combination of unavailability of datasets, lack of understanding of FRS functionality and low-resource bias mitigation measures accentuate the problem. In this work, we propose a new face dataset composed of 6,579 unique male and female sportspersons from eight countries around the world. More than 50% of the dataset comprises individuals from the Global South countries and is demographically diverse. To aid adversarial audits and robust model training, each image has four adversarial variants, totaling over 40,000 images. We also benchmark five popular FRSs, both commercial and open-source, for the task of gender prediction (and country prediction for one of the open-source models as an example of red-teaming). Experiments on industrial FRSs reveal accuracies ranging from 98.2%--38.1%, with a large disparity between males and females in the Global South (max difference of 38.5%). Biases are also observed in all FRSs between females of the Global North and South (max difference of ~50%). Grad-CAM analysis identifies the nose, forehead and mouth as the regions of interest on one of the open-source FRSs. Utilizing this insight, we design simple, low-resource bias mitigation solutions using few-shot and novel contrastive learning techniques significantly improving the accuracy with disparity between males and females reducing from 50% to 1.5% in one of the settings. In the red-teaming experiment with the open-source Deepface model, contrastive learning proves more effective than simple fine-tuning.

Title: Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Authors: Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, Lingjuan Lyu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2407.15811
Pdf URL: https://arxiv.org/pdf/2407.15811
Copy Paste: [[2407.15811]] Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget(https://arxiv.org/abs/2407.15811)
Keywords: diffusion, transformer, generative
Abstract: As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.

Title: Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Authors: Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15819
Pdf URL: https://arxiv.org/pdf/2407.15819
Copy Paste: [[2407.15819]] Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight(https://arxiv.org/abs/2407.15819)
Keywords: large language model
Abstract: This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

Title: dMel: Speech Tokenization made Simple

Authors: He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2407.15835
Pdf URL: https://arxiv.org/pdf/2407.15835
Copy Paste: [[2407.15835]] dMel: Speech Tokenization made Simple(https://arxiv.org/abs/2407.15835)
Keywords: transformer, large language model
Abstract: Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.

Title: MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Authors: Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15838
Pdf URL: https://arxiv.org/pdf/2407.15838
Copy Paste: [[2407.15838]] MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity(https://arxiv.org/abs/2407.15838)
Keywords: large language model
Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at this https URL.

Title: SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Authors: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15841
Pdf URL: https://arxiv.org/pdf/2407.15841
Copy Paste: [[2407.15841]] SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models(https://arxiv.org/abs/2407.15841)
Keywords: large language model
Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

Title: Artist: Aesthetically Controllable Text-Driven Stylization without Training

Authors: Ruixiang Jiang, Changwen Chen
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2407.15842
Pdf URL: https://arxiv.org/pdf/2407.15842
Copy Paste: [[2407.15842]] Artist: Aesthetically Controllable Text-Driven Stylization without Training(https://arxiv.org/abs/2407.15842)
Keywords: diffusion
Abstract: Diffusion models entangle content and style generation during the denoising process, leading to undesired content modification when directly applied to stylization tasks. Existing methods struggle to effectively control the diffusion model to meet the aesthetic-level requirements for stylization. In this paper, we introduce \textbf{Artist}, a training-free approach that aesthetically controls the content and style generation of a pretrained diffusion model for text-driven stylization. Our key insight is to disentangle the denoising of content and style into separate diffusion processes while sharing information between them. We propose simple yet effective content and style control methods that suppress style-irrelevant content generation, resulting in harmonious stylization results. Extensive experiments demonstrate that our method excels at achieving aesthetic-level stylization requirements, preserving intricate details in the content image and aligning well with the style prompt. Furthermore, we showcase the highly controllability of the stylization strength from various perspectives. Code will be released, project home page: this https URL

Title: CarFormer: Self-Driving with Learned Object-Centric Representations

Authors: Shadi Hamdan, Fatma Güney
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2407.15843
Pdf URL: https://arxiv.org/pdf/2407.15843
Copy Paste: [[2407.15843]] CarFormer: Self-Driving with Learned Object-Centric Representations(https://arxiv.org/abs/2407.15843)
Keywords: transformer
Abstract: The choice of representation plays a key role in self-driving. Bird's eye view (BEV) representations have shown remarkable performance in recent years. In this paper, we propose to learn object-centric representations in BEV to distill a complex scene into more actionable information for self-driving. We first learn to place objects into slots with a slot attention model on BEV sequences. Based on these object-centric representations, we then train a transformer to learn to drive as well as reason about the future of other vehicles. We found that object-centric slot representations outperform both scene-level and object-level approaches that use the exact attributes of objects. Slot representations naturally incorporate information about objects from their spatial and temporal context such as position, heading, and speed without explicitly providing it. Our model with slots achieves an increased completion rate of the provided routes and, consequently, a higher driving score, with a lower variance across multiple runs, affirming slots as a reliable alternative in object-centric approaches. Additionally, we validate our model's performance as a world model through forecasting experiments, demonstrating its capability to predict future slot representations accurately. The code and the pre-trained models can be found at this https URL.

Title: Reconstructing Training Data From Real World Models Trained with Transfer Learning

Authors: Yakir Oz, Gilad Yehudai, Gal Vardi, Itai Antebi, Michal Irani, Niv Haim
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2407.15845
Pdf URL: https://arxiv.org/pdf/2407.15845
Copy Paste: [[2407.15845]] Reconstructing Training Data From Real World Models Trained with Transfer Learning(https://arxiv.org/abs/2407.15845)
Keywords: privacy
Abstract: Current methods for reconstructing training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in realistic settings for models trained on high-resolution images. Our method adapts the reconstruction scheme of arXiv:2206.07758 to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning.

Title: LLMmap: Fingerprinting For Large Language Models

Authors: Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2407.15847
Pdf URL: https://arxiv.org/pdf/2407.15847
Copy Paste: [[2407.15847]] LLMmap: Fingerprinting For Large Language Models(https://arxiv.org/abs/2407.15847)
Keywords: attack, robust, large language model
Abstract: We introduce LLMmap, a first-generation fingerprinting attack targeted at LLM-integrated applications. LLMmap employs an active fingerprinting approach, sending carefully crafted queries to the application and analyzing the responses to identify the specific LLM model in use. With as few as 8 interactions, LLMmap can accurately identify LLMs with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLMs operating under various system prompts, stochastic sampling hyperparameters, and even complex generation frameworks such as RAG or Chain-of-Thought.

Title: AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Authors: Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2407.15850
Pdf URL: https://arxiv.org/pdf/2407.15850
Copy Paste: [[2407.15850]] AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description(https://arxiv.org/abs/2407.15850)
Keywords: large language model
Abstract: Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.