2024-03-19

Title: VISREAS: Complex Visual Reasoning with Unanswerable Questions

Authors: Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10534
Pdf URL: https://arxiv.org/pdf/2403.10534
Copy Paste: [[2403.10534]] VISREAS: Complex Visual Reasoning with Unanswerable Questions(https://arxiv.org/abs/2403.10534)
Keywords: generative
Abstract: Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.

Title: Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows

Authors: Zhangxuan Dang, Yu Zheng, Xinglin Lin, Chunlei Peng, Qiuyu Chen, Xinbo Gao
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10550
Pdf URL: https://arxiv.org/pdf/2403.10550
Copy Paste: [[2403.10550]] Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows(https://arxiv.org/abs/2403.10550)
Keywords: security
Abstract: With the rapid development of the Internet, various types of anomaly traffic are threatening network security. We consider the problem of anomaly network traffic detection and propose a three-stage anomaly detection framework using only normal traffic. Our framework can generate pseudo anomaly samples without prior knowledge of anomalies to achieve the detection of anomaly data. Firstly, we employ a reconstruction method to learn the deep representation of normal samples. Secondly, these representations are normalized to a standard normal distribution using a bidirectional flow module. To simulate anomaly samples, we add noises to the normalized representations which are then passed through the generation direction of the bidirectional flow module. Finally, a simple classifier is trained to differentiate the normal samples and pseudo anomaly samples in the latent space. During inference, our framework requires only two modules to detect anomalous samples, leading to a considerable reduction in model size. According to the experiments, our method achieves the state of-the-art results on the common benchmarking datasets of anomaly network traffic detection. The code is given in the https://github.com/ZxuanDang/ATD-via-Flows.git

Title: Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge Transfer

Authors: Kenta Tsukahara, Kanji Tanaka, Daiki Iwata
Subjects: cs.LG, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.10552
Pdf URL: https://arxiv.org/pdf/2403.10552
Copy Paste: [[2403.10552]] Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge Transfer(https://arxiv.org/abs/2403.10552)
Keywords: privacy, data-free
Abstract: A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.

Title: Learning to Watermark LLM-generated Text via Reinforcement Learning

Authors: Xiaojun Xu, Yuanshun Yao, Yang Liu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10553
Pdf URL: https://arxiv.org/pdf/2403.10553
Copy Paste: [[2403.10553]] Learning to Watermark LLM-generated Text via Reinforcement Learning(https://arxiv.org/abs/2403.10553)
Keywords: attack, robust, watermark
Abstract: We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-llm .

Title: Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models

Authors: Kang Gu, Md Rafi Ur Rashid, Najrin Sultana, Shagufta Mehnaz
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.10557
Pdf URL: https://arxiv.org/pdf/2403.10557
Copy Paste: [[2403.10557]] Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models(https://arxiv.org/abs/2403.10557)
Keywords: privacy, robust, large language model
Abstract: With the rapid development of Large Language Models (LLMs), we have witnessed intense competition among the major LLM products like ChatGPT, LLaMa, and Gemini. However, various issues (e.g. privacy leakage and copyright violation) of the training corpus still remain underexplored. For example, the Times sued OpenAI and Microsoft for infringing on its copyrights by using millions of its articles for training. From the perspective of LLM practitioners, handling such unintended privacy violations can be challenging. Previous work addressed the ``unlearning" problem of LLMs using gradient information, while they mostly introduced significant overheads like data preprocessing or lacked robustness. In this paper, contrasting with the methods based on first-order information, we revisit the unlearning problem via the perspective of second-order information (Hessian). Our unlearning algorithms, which are inspired by classic Newton update, are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee. Through a comprehensive evaluation with four NLP datasets as well as a case study on real-world datasets, our methods consistently show superiority over the first-order methods.

Title: Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack

Authors: Yuanqing Huang, Yinggui Wang, Jianshu Li, Le Yang, Kai Song, Lei Wang
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10558
Pdf URL: https://arxiv.org/pdf/2403.10558
Copy Paste: [[2403.10558]] Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack(https://arxiv.org/abs/2403.10558)
Keywords: privacy, protect, defense, attack
Abstract: The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance between privacy and accuracy. To address this limitation, this paper introduces an adaptive hybrid masking algorithm against MIA. Specifically, face images are masked in the frequency domain using an adaptive MixUp strategy. Unlike the traditional MixUp algorithm, which is predominantly used for data augmentation, our modified approach incorporates frequency domain mixing. Previous studies have shown that increasing the number of images mixed in MixUp can enhance privacy preservation but at the expense of reduced face recognition accuracy. To overcome this trade-off, we develop an enhanced adaptive MixUp strategy based on reinforcement learning, which enables us to mix a larger number of images while maintaining satisfactory recognition accuracy. To optimize privacy protection, we propose maximizing the reward function (i.e., the loss function of the FR system) during the training of the strategy network. While the loss function of the FR network is minimized in the phase of training the FR network. The strategy network and the face recognition network can be viewed as antagonistic entities in the training process, ultimately reaching a more balanced trade-off. Experimental results demonstrate that our proposed hybrid masking scheme outperforms existing defense algorithms in terms of privacy preservation and recognition accuracy against MIA.

Title: Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

Authors: Dong Shu, Zhouyao Zhu
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2403.10559
Pdf URL: https://arxiv.org/pdf/2403.10559
Copy Paste: [[2403.10559]] Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI(https://arxiv.org/abs/2403.10559)
Keywords: generative
Abstract: This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.

Title: Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks

Authors: Roey Bokobza, Yisroel Mirsky
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10562
Pdf URL: https://arxiv.org/pdf/2403.10562
Copy Paste: [[2403.10562]] Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks(https://arxiv.org/abs/2403.10562)
Keywords: attack, robust
Abstract: Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples. Unlike traditional preprocessing defences that rely on sanitizing input samples, our stateless strategy counters the attack process itself. For every query we evaluate a counter-sample instead, where the counter-sample is the original sample optimized against the attacker's objective. By countering every black box query with a targeted white box optimization, our strategy effectively introduces an asymmetry to the game to the defender's advantage. This defence not only effectively misleads the attacker's search for an adversarial example, it also preserves the model's accuracy on legitimate inputs and is generic to multiple types of attacks. We demonstrate that our approach is remarkably effective against state-of-the-art black box attacks and outperforms existing defences for both the CIFAR-10 and ImageNet datasets. Additionally, we also show that the proposed defence is robust against strong adversaries as well.

Title: Cooling-Guide Diffusion Model for Battery Cell Arrangement

Authors: Nicholas Sung, Liu Zheng, Pingfeng Wang, Faez Ahmed
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10566
Pdf URL: https://arxiv.org/pdf/2403.10566
Copy Paste: [[2403.10566]] Cooling-Guide Diffusion Model for Battery Cell Arrangement(https://arxiv.org/abs/2403.10566)
Keywords: diffusion, generative
Abstract: Our study introduces a Generative AI method that employs a cooling-guided diffusion model to optimize the layout of battery cells, a crucial step for enhancing the cooling performance and efficiency of battery thermal management systems. Traditional design processes, which rely heavily on iterative optimization and extensive guesswork, are notoriously slow and inefficient, often leading to suboptimal solutions. In contrast, our innovative method uses a parametric denoising diffusion probabilistic model (DDPM) with classifier and cooling guidance to generate optimized cell layouts with enhanced cooling paths, significantly lowering the maximum temperature of the cells. By incorporating position-based classifier guidance, we ensure the feasibility of generated layouts. Meanwhile, cooling guidance directly optimizes cooling-efficiency, making our approach uniquely effective. When compared to two advanced models, the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) and the Conditional Tabular GAN (CTGAN), our cooling-guided diffusion model notably outperforms both. It is five times more effective than TabDDPM and sixty-six times better than CTGAN across key metrics such as feasibility, diversity, and cooling efficiency. This research marks a significant leap forward in the field, aiming to optimize battery cell layouts for superior cooling efficiency, thus setting the stage for the development of more effective and dependable battery thermal management systems.

Title: Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare

Authors: Tao Li, Quanyan Zhu
Subjects: cs.CR, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2403.10570
Pdf URL: https://arxiv.org/pdf/2403.10570
Copy Paste: [[2403.10570]] Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare(https://arxiv.org/abs/2403.10570)
Keywords: security, defense, attack
Abstract: We are currently facing unprecedented cyber warfare with the rapid evolution of tactics, increasing asymmetry of intelligence, and the growing accessibility of hacking tools. In this landscape, cyber deception emerges as a critical component of our defense strategy against increasingly sophisticated attacks. This chapter aims to highlight the pivotal role of game-theoretic models and foundation models (FMs) in analyzing, designing, and implementing cyber deception tactics. Game models (GMs) serve as a foundational framework for modeling diverse adversarial interactions, allowing us to encapsulate both adversarial knowledge and domain-specific insights. Meanwhile, FMs serve as the building blocks for creating tailored machine learning models suited to given applications. By leveraging the synergy between GMs and FMs, we can advance proactive and automated cyber defense mechanisms by not only securing our networks against attacks but also enhancing their resilience against well-planned operations. This chapter discusses the games at the tactical, operational, and strategic levels of warfare, delves into the symbiotic relationship between these methodologies, and explores relevant applications where such a framework can make a substantial impact in cybersecurity. The chapter discusses the promising direction of the multi-agent neurosymbolic conjectural learning (MANSCOL), which allows the defender to predict adversarial behaviors, design adaptive defensive deception tactics, and synthesize knowledge for the operational level synthesis and adaptation. FMs serve as pivotal tools across various functions for MANSCOL, including reinforcement learning, knowledge assimilation, formation of conjectures, and contextual representation. This chapter concludes with a discussion of the challenges associated with FMs and their application in the domain of cybersecurity.

Title: Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers

Authors: Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10574
Pdf URL: https://arxiv.org/pdf/2403.10574
Copy Paste: [[2403.10574]] Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers(https://arxiv.org/abs/2403.10574)
Keywords: robust, transformer
Abstract: The rich spatio-temporal information is crucial to capture the complicated target appearance variations in visual tracking. However, most top-performing tracking algorithms rely on many hand-crafted components for spatio-temporal information aggregation. Consequently, the spatio-temporal information is far away from being fully explored. To alleviate this issue, we propose an adaptive tracker with spatio-temporal transformers (named AQATrack), which adopts simple autoregressive queries to effectively learn spatio-temporal information without many hand-designed components. Firstly, we introduce a set of learnable and autoregressive queries to capture the instantaneous target appearance changes in a sliding window fashion. Then, we design a novel attention mechanism for the interaction of existing queries to generate a new query in current frame. Finally, based on the initial target template and learnt autoregressive queries, a spatio-temporal information fusion module (STM) is designed for spatiotemporal formation aggregation to locate a target object. Benefiting from the STM, we can effectively combine the static appearance and instantaneous changes to guide robust tracking. Extensive experiments show that our method significantly improves the tracker's performance on six popular tracking benchmarks: LaSOT, LaSOText, TrackingNet, GOT-10k, TNL2K, and UAV123.

Title: Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain

Authors: Eugene Jang, Jian Cui, Dayeon Yim, Youngjin Jin, Jin-Woo Chung, Seungwon Shin, Yongjae Lee
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10576
Pdf URL: https://arxiv.org/pdf/2403.10576
Copy Paste: [[2403.10576]] Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain(https://arxiv.org/abs/2403.10576)
Keywords: security
Abstract: Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We propose different pretraining methodologies and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy (selective MLM and jointly training NLE token classification) outperforms the commonly taken approach of replacing non-linguistic elements (NLEs). We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks.

Title: From Algorithms to Outcomes: Reviewing AI's Role in Non-Muscle-Invasive Bladder Cancer Recurrence Prediction

Authors: Saram Abbas, Dr Rishad Shafik, Prof Naeem Soomro, Prof Rakesh Heer, Dr Kabita Adhikari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10586
Pdf URL: https://arxiv.org/pdf/2403.10586
Copy Paste: [[2403.10586]] From Algorithms to Outcomes: Reviewing AI's Role in Non-Muscle-Invasive Bladder Cancer Recurrence Prediction(https://arxiv.org/abs/2403.10586)
Keywords: robust, interpretability
Abstract: Bladder cancer, the leading urinary tract cancer, is responsible for 15 deaths daily in the UK. This cancer predominantly manifests as non-muscle-invasive bladder cancer (NMIBC), characterised by tumours not yet penetrating the muscle layer of the bladder wall. NMIBC is plagued by a very high recurrence rate of 70-80% and hence the costliest treatments. Current tools for predicting recurrence use scoring systems that overestimate risk and have poor accuracy. Inaccurate and delayed prediction of recurrence significantly elevates the likelihood of mortality. Accurate prediction of recurrence is hence vital for cost-effective management and treatment planning. This is where Machine learning (ML) techniques have emerged as a promising approach for predicting NMIBC recurrence by leveraging molecular and clinical data. This review provides a comprehensive analysis of ML approaches for predicting NMIBC recurrence. Our systematic evaluation demonstrates the potential of diverse ML algorithms and markers, including radiomic, clinical, histopathological, genomic, and biochemical data in enhancing recurrence prediction and personalised patient management. We summarise various prediction tasks, data modalities, and ML models used, highlighting their performance, limitations, and future directions of incorporating cost-effectiveness. Challenges related to generalisability and interpretability of artificial intelligent models are discussed, emphasising the need for collaborative efforts and robust datasets.

Title: Neural Erosion: Emulating Controlled Neurodegeneration and Aging in AI Systems

Authors: Antonios Alexos, Yu-Dai Tsai, Ian Domingo, Maryam Pishgar, Pierre Baldi
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.10596
Pdf URL: https://arxiv.org/pdf/2403.10596
Copy Paste: [[2403.10596]] Neural Erosion: Emulating Controlled Neurodegeneration and Aging in AI Systems(https://arxiv.org/abs/2403.10596)
Keywords: generative, large language model
Abstract: Creating controlled methods to simulate neurodegeneration in artificial intelligence (AI) is crucial for applications that emulate brain function decline and cognitive disorders. We use IQ tests performed by Large Language Models (LLMs) and, more specifically, the LLaMA 2 to introduce the concept of ``neural erosion." This deliberate erosion involves ablating synapses or neurons, or adding Gaussian noise during or after training, resulting in a controlled progressive decline in the LLMs' performance. We are able to describe the neurodegeneration in the IQ tests and show that the LLM first loses its mathematical abilities and then its linguistic abilities, while further losing its ability to understand the questions. To the best of our knowledge, this is the first work that models neurodegeneration with text data, compared to other works that operate in the computer vision domain. Finally, we draw similarities between our study and cognitive decline clinical studies involving test subjects. We find that with the application of neurodegenerative methods, LLMs lose abstract thinking abilities, followed by mathematical degradation, and ultimately, a loss in linguistic ability, responding to prompts incoherently. These findings are in accordance with human studies.

Title: SurvRNC: Learning Ordered Representations for Survival Prediction using Rank-N-Contrast

Authors: Numan Saeed, Muhammad Ridzuan, Fadillah Adamsyah Maani, Hussain Alasmawi, Karthik Nandakumar, Mohammad Yaqub
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10603
Pdf URL: https://arxiv.org/pdf/2403.10603
Copy Paste: [[2403.10603]] SurvRNC: Learning Ordered Representations for Survival Prediction using Rank-N-Contrast(https://arxiv.org/abs/2403.10603)
Keywords: segmentation
Abstract: Predicting the likelihood of survival is of paramount importance for individuals diagnosed with cancer as it provides invaluable information regarding prognosis at an early stage. This knowledge enables the formulation of effective treatment plans that lead to improved patient outcomes. In the past few years, deep learning models have provided a feasible solution for assessing medical images, electronic health records, and genomic data to estimate cancer risk scores. However, these models often fall short of their potential because they struggle to learn regression-aware feature representations. In this study, we propose Survival Rank-N Contrast (SurvRNC) method, which introduces a loss function as a regularizer to obtain an ordered representation based on the survival times. This function can handle censored data and can be incorporated into any survival model to ensure that the learned representation is ordinal. The model was extensively evaluated on a HEad \& NeCK TumOR (HECKTOR) segmentation and the outcome-prediction task dataset. We demonstrate that using the SurvRNC method for training can achieve higher performance on different deep survival models. Additionally, it outperforms state-of-the-art methods by 3.6% on the concordance index. The code is publicly available on https://github.com/numanai/SurvRNC

Title: LightIt: Illumination Modeling and Control for Diffusion Models

Authors: Peter Kocsis (1), Julien Philip (2), Kalyan Sunkavalli (2), Matthias Nießner (1), Yannick Hold-Geoffroy (2) ((1) Technical University of Munich, (2) Adobe Research)
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10615
Pdf URL: https://arxiv.org/pdf/2403.10615
Copy Paste: [[2403.10615]] LightIt: Illumination Modeling and Control for Diffusion Models(https://arxiv.org/abs/2403.10615)
Keywords: diffusion, generative
Abstract: We introduce LightIt, a method for explicit illumination control for image generation. Recent generative methods lack lighting control, which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations, we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading, which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then, we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally, we use our generated dataset to train an identity-preserving relighting model, conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.

Title: DiPaCo: Distributed Path Composition

Authors: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.10616
Pdf URL: https://arxiv.org/pdf/2403.10616
Copy Paste: [[2403.10616]] DiPaCo: Distributed Path Composition(https://arxiv.org/abs/2403.10616)
Keywords: robust, transformer
Abstract: Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

Title: Leveraging CLIP for Inferring Sensitive Information and Improving Model Fairness

Authors: Miao Zhang, Rumi Chunara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10624
Pdf URL: https://arxiv.org/pdf/2403.10624
Copy Paste: [[2403.10624]] Leveraging CLIP for Inferring Sensitive Information and Improving Model Fairness(https://arxiv.org/abs/2403.10624)
Keywords: fair
Abstract: Performance disparities across sub-populations are known to exist in deep learning-based vision recognition models, but previous work has largely addressed such fairness concerns assuming knowledge of sensitive attribute labels. To overcome this reliance, previous strategies have involved separate learning structures to expose and adjust for disparities. In this work, we explore a new paradigm that does not require sensitive attribute labels, and evades the need for extra training by leveraging the vision-language model, CLIP, as a rich knowledge source to infer sensitive information. We present sample clustering based on similarity derived from image and attribute-specified language embeddings and assess their correspondence to true attribute distribution. We train a target model by re-sampling and augmenting under-performed clusters. Extensive experiments on multiple benchmark bias datasets show clear fairness gains of the model over existing baselines, which indicate that CLIP can extract discriminative sensitive information prompted by language, and used to promote model fairness.

Title: MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment

Authors: Wenrui Fan, Mohammod Naimul Islam Suvon, Shuo Zhou, Xianyuan Liu, Samer Alabed, Venet Osmani, Andrew Swift, Chen Chen, Haiping Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10635
Pdf URL: https://arxiv.org/pdf/2403.10635
Copy Paste: [[2403.10635]] MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment(https://arxiv.org/abs/2403.10635)
Keywords: segmentation
Abstract: Vision-language pre-training (VLP) models have shown significant advancements in the medical domain. Yet, most VLP models align raw reports to images at a very coarse level, without modeling fine-grained relationships between anatomical and pathological concepts outlined in reports and the corresponding semantic counterparts in images. To address this problem, we propose a Medical Dual-Stream Language-Image Pre-training (MeDSLIP) framework. Specifically, MeDSLIP establishes vision-language fine-grained alignments via disentangling visual and textual representations into anatomy-relevant and pathology-relevant streams. Moreover, a novel vision-language Prototypical Contr-astive Learning (ProtoCL) method is adopted in MeDSLIP to enhance the alignment within the anatomical and pathological streams. MeDSLIP further employs cross-stream Intra-image Contrastive Learning (ICL) to ensure the consistent coexistence of paired anatomical and pathological concepts within the same image. Such a cross-stream regularization encourages the model to exploit the synchrony between two streams for a more comprehensive representation learning. MeDSLIP is evaluated under zero-shot and supervised fine-tuning settings on three public datasets: NIH CXR14, RSNA Pneumonia, and SIIM-ACR Pneumothorax. Under these settings, MeDSLIP outperforms six leading CNN-based models on classification, grounding, and segmentation tasks.

Title: A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

Authors: Beatrice Casey, Joanna C. S. Santos, George Perry
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10646
Pdf URL: https://arxiv.org/pdf/2403.10646
Copy Paste: [[2403.10646]] A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks(https://arxiv.org/abs/2403.10646)
Keywords: security
Abstract: Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

Title: PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time Adaptation

Authors: Sarthak Kumar Maharana, Baoming Zhang, Yunhui Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10650
Pdf URL: https://arxiv.org/pdf/2403.10650
Copy Paste: [[2403.10650]] PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time Adaptation(https://arxiv.org/abs/2403.10650)
Keywords: robust
Abstract: Real-world vision models in dynamic environments face rapid shifts in domain distributions, leading to decreased recognition performance. Continual test-time adaptation (CTTA) directly adjusts a pre-trained source discriminative model to these changing domains using test data. A highly effective CTTA method involves applying layer-wise adaptive learning rates, and selectively adapting pre-trained layers. However, it suffers from the poor estimation of domain shift and the inaccuracies arising from the pseudo-labels. In this work, we aim to overcome these limitations by identifying layers through the quantification of model prediction uncertainty without relying on pseudo-labels. We utilize the magnitude of gradients as a metric, calculated by backpropagating the KL divergence between the softmax output and a uniform distribution, to select layers for further adaptation. Subsequently, for the parameters exclusively belonging to these selected layers, with the remaining ones frozen, we evaluate their sensitivity in order to approximate the domain shift, followed by adjusting their learning rates accordingly. Overall, this approach leads to a more robust and stable optimization than prior approaches. We conduct extensive image classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C and demonstrate the efficacy of our method against standard benchmarks and prior methods.

Title: Improving Fairness in Credit Lending Models using Subgroup Threshold Optimization

Authors: Cecilia Ying, Stephen Thomas
Subjects: cs.LG, q-fin.RM
Abstract URL: https://arxiv.org/abs/2403.10652
Pdf URL: https://arxiv.org/pdf/2403.10652
Copy Paste: [[2403.10652]] Improving Fairness in Credit Lending Models using Subgroup Threshold Optimization(https://arxiv.org/abs/2403.10652)
Keywords: fair
Abstract: In an effort to improve the accuracy of credit lending decisions, many financial intuitions are now using predictions from machine learning models. While such predictions enjoy many advantages, recent research has shown that the predictions have the potential to be biased and unfair towards certain subgroups of the population. To combat this, several techniques have been introduced to help remove the bias and improve the overall fairness of the predictions. We introduce a new fairness technique, called \textit{Subgroup Threshold Optimizer} (\textit{STO}), that does not require any alternations to the input training data nor does it require any changes to the underlying machine learning algorithm, and thus can be used with any existing machine learning pipeline. STO works by optimizing the classification thresholds for individual subgroups in order to minimize the overall discrimination score between them. Our experiments on a real-world credit lending dataset show that STO can reduce gender discrimination by over 90\%.

Title: Towards Practical Fabrication Stage Attacks Using Interrupt-Resilient Hardware Trojans

Authors: Athanasios Moschos, Fabian Monrose, Angelos D. Keromytis
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10659
Pdf URL: https://arxiv.org/pdf/2403.10659
Copy Paste: [[2403.10659]] Towards Practical Fabrication Stage Attacks Using Interrupt-Resilient Hardware Trojans(https://arxiv.org/abs/2403.10659)
Keywords: attack
Abstract: We introduce a new class of hardware trojans called interrupt-resilient trojans (IRTs). Our work is motivated by the observation that hardware trojan attacks on CPUs, even under favorable attack scenarios (e.g., an attacker with local system access), are affected by unpredictability due to non-deterministic context switching events. As we confirm experimentally, these events can lead to race conditions between trigger signals and the CPU events targeted by the trojan payloads (e.g., a CPU memory access), thus affecting the reliability of the attacks. Our work shows that interrupt-resilient trojans can successfully address the problem of non-deterministic triggering in CPUs, thereby providing high reliability guarantees in the implementation of sophisticated hardware trojan attacks. Specifically, we successfully utilize IRTs in different attack scenarios against a Linux-capable CPU design and showcase its resilience against context-switching events. More importantly, we show that our design allows for seamless integration during fabrication stage attacks.We evaluate different strategies for the implementation of our attacks on a tape-out ready high-speed RISC-V microarchitecture in a 28nm commercial technology process and successfully implement them with an average overhead delay of only 20 picoseconds, while leaving the sign-off characteristics of the layout intact. In doing so, we challenge the common wisdom regarding the low flexibility of late supply chain stages (e.g., fabrication) for the insertion of powerful trojans. To promote further research on microprocessor trojans, we open-source our designs and provide the accompanying supporting software logic.

Title: SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

Authors: Pardis Taghavi, Reza Langari, Gaurav Pandey
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10662
Pdf URL: https://arxiv.org/pdf/2403.10662
Copy Paste: [[2403.10662]] SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images(https://arxiv.org/abs/2403.10662)
Keywords: segmentation
Abstract: This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.

Title: Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data

Authors: Yuxuan Li, Sarthak Kumar Maharana, Yunhui Guo
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10663
Pdf URL: https://arxiv.org/pdf/2403.10663
Copy Paste: [[2403.10663]] Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data(https://arxiv.org/abs/2403.10663)
Keywords: protect, attack, steal, extraction, watermark
Abstract: With the increasing prevalence of Machine Learning as a Service (MLaaS) platforms, there is a growing focus on deep neural network (DNN) watermarking techniques. These methods are used to facilitate the verification of ownership for a target DNN model to protect intellectual property. One of the most widely employed watermarking techniques involves embedding a trigger set into the source model. Unfortunately, existing methodologies based on trigger sets are still susceptible to functionality-stealing attacks, potentially enabling adversaries to steal the functionality of the source model without a reliable means of verifying ownership. In this paper, we first introduce a novel perspective on trigger set-based watermarking methods from a feature learning perspective. Specifically, we demonstrate that by selecting data exhibiting multiple features, also referred to as $\textit{multi-view data}$, it becomes feasible to effectively defend functionality stealing attacks. Based on this perspective, we introduce a novel watermarking technique based on Multi-view dATa, called MAT, for efficiently embedding watermarks within DNNs. This approach involves constructing a trigger set with multi-view data and incorporating a simple feature-based regularization method for training the source model. We validate our method across various benchmarks and demonstrate its efficacy in defending against model extraction attacks, surpassing relevant baselines by a significant margin.

Title: GS-Pose: Cascaded Framework for Generalizable Segmentation-based 6D Object Pose Estimation

Authors: Dingding Cai, Janne Heikkilä, Esa Rahtu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10683
Pdf URL: https://arxiv.org/pdf/2403.10683
Copy Paste: [[2403.10683]] GS-Pose: Cascaded Framework for Generalizable Segmentation-based 6D Object Pose Estimation(https://arxiv.org/abs/2403.10683)
Keywords: segmentation
Abstract: This paper introduces GS-Pose, an end-to-end framework for locating and estimating the 6D pose of objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we utilize 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.

Title: MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Authors: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10691
Pdf URL: https://arxiv.org/pdf/2403.10691
Copy Paste: [[2403.10691]] MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling(https://arxiv.org/abs/2403.10691)
Keywords: fair
Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

Title: On the low-shot transferability of [V]-Mamba

Authors: Diganta Misra, Jay Gala, Antonio Orvieto
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10696
Pdf URL: https://arxiv.org/pdf/2403.10696
Copy Paste: [[2403.10696]] On the low-shot transferability of [V]-Mamba(https://arxiv.org/abs/2403.10696)
Keywords: transformer
Abstract: The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.

Title: Robust Influence-based Training Methods for Noisy Brain MRI

Authors: Minh-Hao Van, Alycia N. Carey, Xintao Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10698
Pdf URL: https://arxiv.org/pdf/2403.10698
Copy Paste: [[2403.10698]] Robust Influence-based Training Methods for Noisy Brain MRI(https://arxiv.org/abs/2403.10698)
Keywords: robust
Abstract: Correctly classifying brain tumors is imperative to the prompt and accurate treatment of a patient. While several classification algorithms based on classical image processing or deep learning methods have been proposed to rapidly classify tumors in MR images, most assume the unrealistic setting of noise-free training data. In this work, we study a difficult but realistic setting of training a deep learning model on noisy MR images to classify brain tumors. We propose two training methods that are robust to noisy MRI training data, Influence-based Sample Reweighing (ISR) and Influence-based Sample Perturbation (ISP), which are based on influence functions from robust statistics. Using the influence functions, in ISR, we adaptively reweigh training examples according to how helpful/harmful they are to the training process, while in ISP, we craft and inject helpful perturbation proportional to the influence score. Both ISR and ISP harden the classification model against noisy training data without significantly affecting the generalization ability of the model on test data. We conduct empirical evaluations over a common brain tumor dataset and compare ISR and ISP to three baselines. Our empirical results show that ISR and ISP can efficiently train deep learning models robust against noisy training data.

Title: IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Authors: Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10701
Pdf URL: https://arxiv.org/pdf/2403.10701
Copy Paste: [[2403.10701]] IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation(https://arxiv.org/abs/2403.10701)
Keywords: diffusion, generative
Abstract: Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.

Title: PERL: Parameter Efficient Reinforcement Learning from Human Feedback

Authors: Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Simral Chaudhary, Bowen Li, Saravanan Ganesh, Bill Byrne, Jessica Hoffmann, Hassan Mansoor, Wei Li, Abhinav Rastogi, Lucas Dixon
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.10704
Pdf URL: https://arxiv.org/pdf/2403.10704
Copy Paste: [[2403.10704]] PERL: Parameter Efficient Reinforcement Learning from Human Feedback(https://arxiv.org/abs/2403.10704)
Keywords: large language model
Abstract: Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for Large Language Models. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF.

Title: Uncovering Latent Themes of Messaging on Social Media by Integrating LLMs: A Case Study on Climate Campaigns

Authors: Tunazzina Islam, Dan Goldwasser
Subjects: cs.CL, cs.AI, cs.CY, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2403.10707
Pdf URL: https://arxiv.org/pdf/2403.10707
Copy Paste: [[2403.10707]] Uncovering Latent Themes of Messaging on Social Media by Integrating LLMs: A Case Study on Climate Campaigns(https://arxiv.org/abs/2403.10707)
Keywords: large language model
Abstract: This paper introduces a novel approach to uncovering and analyzing themes in social media messaging. Recognizing the limitations of traditional topic-level analysis, which tends to capture only the overarching patterns, this study emphasizes the need for a finer-grained, theme-focused exploration. Conventional methods of theme discovery, involving manual processes and a human-in-the-loop approach, are valuable but face challenges in scalability, consistency, and resource intensity in terms of time and cost. To address these challenges, we propose a machine-in-the-loop approach that leverages the advanced capabilities of Large Language Models (LLMs). This approach allows for a deeper investigation into the thematic aspects of social media discourse, enabling us to uncover a diverse array of themes, each with unique characteristics and relevance, thereby offering a comprehensive understanding of the nuances present within broader topics. Furthermore, this method efficiently maps the text and the newly discovered themes, enhancing our understanding of the thematic nuances in social media messaging. We employ climate campaigns as a case study and demonstrate that our methodology yields more accurate and interpretable results compared to traditional topic models. Our results not only demonstrate the effectiveness of our approach in uncovering latent themes but also illuminate how these themes are tailored for demographic targeting in social media contexts. Additionally, our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.

Title: Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency

Authors: Soumyadeep Pal, Yuguang Yao, Ren Wang, Bingquan Shen, Sijia Liu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10717
Pdf URL: https://arxiv.org/pdf/2403.10717
Copy Paste: [[2403.10717]] Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency(https://arxiv.org/abs/2403.10717)
Keywords: defense, attack
Abstract: Modern machine learning (ML) systems demand substantial training data, often resorting to external sources. Nevertheless, this practice renders them vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies have primarily focused on the identification of backdoored models or poisoned data characteristics, typically operating under the assumption of access to clean data. In this work, we delve into a relatively underexplored challenge: the automatic identification of backdoor data within a poisoned dataset, all under realistic conditions, i.e., without the need for additional clean data or without manually defining a threshold for backdoor detection. We draw an inspiration from the scaled prediction consistency (SPC) technique, which exploits the prediction invariance of poisoned data to an input scaling factor. Based on this, we pose the backdoor data identification problem as a hierarchical data splitting optimization problem, leveraging a novel SPC-based loss function as the primary optimization objective. Our innovation unfolds in several key aspects. First, we revisit the vanilla SPC method, unveiling its limitations in addressing the proposed backdoor identification problem. Subsequently, we develop a bi-level optimization-based approach to precisely identify backdoor data by minimizing the advanced SPC loss. Finally, we demonstrate the efficacy of our proposal against a spectrum of backdoor attacks, encompassing basic label-corrupted attacks as well as more sophisticated clean-label attacks, evaluated across various benchmark datasets. Experiment results show that our approach often surpasses the performance of current baselines in identifying backdoor data points, resulting in about 4%-36% improvement in average AUROC. Codes are available at https://github.com/OPTML-Group/BackdoorMSPC.

Title: Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

Authors: Anton Pelykh, Ozge Mercanoglu Sincan, Richard Bowden
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10731
Pdf URL: https://arxiv.org/pdf/2403.10731
Copy Paste: [[2403.10731]] Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation(https://arxiv.org/abs/2403.10731)
Keywords: diffusion, segmentation
Abstract: Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body out-painting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the out-painted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code of the proposed approach is available at https://github.com/apelykh/hand-to-diffusion.

Title: Leveraging Synthetic Data for Generalizable and Fair Facial Action Unit Detection

Authors: Liupei Lu, Yufeng Yin, Yuming Gu, Yizhen Wu, Pratusha Prasad, Yajie Zhao, Mohammad Soleymani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10737
Pdf URL: https://arxiv.org/pdf/2403.10737
Copy Paste: [[2403.10737]] Leveraging Synthetic Data for Generalizable and Fair Facial Action Unit Detection(https://arxiv.org/abs/2403.10737)
Keywords: fair
Abstract: Facial action unit (AU) detection is a fundamental block for objective facial expression analysis. Supervised learning approaches require a large amount of manual labeling which is costly. The limited labeled data are also not diverse in terms of gender which can affect model fairness. In this paper, we propose to use synthetically generated data and multi-source domain adaptation (MSDA) to address the problems of the scarcity of labeled data and the diversity of subjects. Specifically, we propose to generate a diverse dataset through synthetic facial expression re-targeting by transferring the expressions from real faces to synthetic avatars. Then, we use MSDA to transfer the AU detection knowledge from a real dataset and the synthetic dataset to a target dataset. Instead of aligning the overall distributions of different domains, we propose Paired Moment Matching (PM2) to align the features of the paired real and synthetic data with the same facial expression. To further improve gender fairness, PM2 matches the features of the real data with a female and a male synthetic image. Our results indicate that synthetic data and the proposed model improve both AU detection performance and fairness across genders, demonstrating its potential to solve AU detection in-the-wild.

Title: Depression Detection on Social Media with Large Language Models

Authors: Xiaochong Lan, Yiming Cheng, Li Sheng, Chen Gao, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10750
Pdf URL: https://arxiv.org/pdf/2403.10750
Copy Paste: [[2403.10750]] Depression Detection on Social Media with Large Language Models(https://arxiv.org/abs/2403.10750)
Keywords: explainability, large language model
Abstract: Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces two key challenges: 1) it requires professional medical knowledge, and 2) it necessitates both high accuracy and explainability. To address it, we propose a novel depression detection system called DORIS, combining medical knowledge and the recent advances in large language models (LLMs). Specifically, to tackle the first challenge, we proposed an LLM-based solution to first annotate whether high-risk texts meet medical diagnostic criteria. Further, we retrieve texts with high emotional intensity and summarize critical information from the historical mood records of users, so-called mood courses. To tackle the second challenge, we combine LLM and traditional classifiers to integrate medical knowledge-guided features, for which the model can also explain its prediction results, achieving both high accuracy and explainability. Extensive experimental results on benchmarking datasets show that, compared to the current best baseline, our approach improves by 0.036 in AUPRC, which can be considered significant, demonstrating the effectiveness of our approach and its high value as an NLP application.

Title: Rules still work for Open Information Extraction

Authors: Jialin Hua, Liangqing Luo, Weiying Ping, Yan Liao, Chunhai Tao, Xuewen Lub
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10758
Pdf URL: https://arxiv.org/pdf/2403.10758
Copy Paste: [[2403.10758]] Rules still work for Open Information Extraction(https://arxiv.org/abs/2403.10758)
Keywords: extraction
Abstract: Open information extraction (OIE) aims to extract surface relations and their corresponding arguments from natural language text, irrespective of domain. This paper presents an innovative OIE model, APRCOIE, tailored for Chinese text. Diverging from previous models, our model generates extraction patterns autonomously. The model defines a new pattern form for Chinese OIE and proposes an automated pattern generation methodology. In that way, the model can handle a wide array of complex and diverse Chinese grammatical phenomena. We design a preliminary filter based on tensor computing to conduct the extraction procedure efficiently. To train the model, we manually annotated a large-scale Chinese OIE dataset. In the comparative evaluation, we demonstrate that APRCOIE outperforms state-of-the-art Chinese OIE models and significantly expands the boundaries of achievable OIE performance. The code of APRCOIE and the annotated dataset are released on GitHub (https://github.com/jialin666/APRCOIE_v1)

Title: ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference

Authors: Krzysztof Kacprzyk, Samuel Holt, Jeroen Berrevoets, Zhaozhi Qian, Mihaela van der Schaar
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2403.10766
Pdf URL: https://arxiv.org/pdf/2403.10766
Copy Paste: [[2403.10766]] ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference(https://arxiv.org/abs/2403.10766)
Keywords: interpretability
Abstract: Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method.

Title: Detecting Bias in Large Language Models: Fine-tuned KcBERT

Authors: J. K. Lee, T. M. Chung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10774
Pdf URL: https://arxiv.org/pdf/2403.10774
Copy Paste: [[2403.10774]] Detecting Bias in Large Language Models: Fine-tuned KcBERT(https://arxiv.org/abs/2403.10774)
Keywords: transformer, large language model
Abstract: The rapid advancement of large language models (LLMs) has enabled natural language processing capabilities similar to those of humans, and LLMs are being widely utilized across various societal domains such as education and healthcare. While the versatility of these models has increased, they have the potential to generate subjective and normative language, leading to discriminatory treatment or outcomes among social groups, especially due to online offensive language. In this paper, we define such harm as societal bias and assess ethnic, gender, and racial biases in a model fine-tuned with Korean comments using Bidirectional Encoder Representations from Transformers (KcBERT) and KOLD data through template-based Masked Language Modeling (MLM). To quantitatively evaluate biases, we employ LPBS and CBS metrics. Compared to KcBERT, the fine-tuned model shows a reduction in ethnic bias but demonstrates significant changes in gender and racial biases. Based on these results, we propose two methods to mitigate societal bias. Firstly, a data balancing approach during the pre-training phase adjusts the uniformity of data by aligning the distribution of the occurrences of specific words and converting surrounding harmful words into non-harmful words. Secondly, during the in-training phase, we apply Debiasing Regularization by adjusting dropout and regularization, confirming a decrease in training loss. Our contribution lies in demonstrating that societal bias exists in Korean language models due to language-dependent characteristics.

Title: HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection

Authors: Shibiao Xu, ShuChen Zheng, Wenhao Xu, Rongtao Xu, Changwei Wang, Jiguang Zhang, Xiaoqiang Teng, Ao Li, Li Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10778
Pdf URL: https://arxiv.org/pdf/2403.10778
Copy Paste: [[2403.10778]] HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection(https://arxiv.org/abs/2403.10778)
Keywords: extraction
Abstract: Infrared small object detection is an important computer vision task involving the recognition and localization of tiny objects in infrared images, which usually contain only a few pixels. However, it encounters difficulties due to the diminutive size of the objects and the generally complex backgrounds in infrared images. In this paper, we propose a deep learning method, HCF-Net, that significantly improves infrared small object detection performance through multiple practical modules. Specifically, it includes the parallelized patch-aware attention (PPA) module, dimension-aware selective integration (DASI) module, and multi-dilated channel refiner (MDCR) module. The PPA module uses a multi-branch feature extraction strategy to capture feature information at different scales and levels. The DASI module enables adaptive channel selection and fusion. The MDCR module captures spatial features of different receptive field ranges through multiple depth-separable convolutional layers. Extensive experimental results on the SIRST infrared single-frame image dataset show that the proposed HCF-Net performs well, surpassing other traditional and deep learning models. Code is available at https://github.com/zhengshuchen/HCFNet.

Title: LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices

Authors: Jingping Nie, Hanya Shao, Yuang Fan, Qijia Shao, Haoxuan You, Matthias Preindl, Xiaofan Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10779
Pdf URL: https://arxiv.org/pdf/2403.10779
Copy Paste: [[2403.10779]] LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices(https://arxiv.org/abs/2403.10779)
Keywords: large language model
Abstract: Despite the global mental health crisis, access to screenings, professionals, and treatments remains high. In collaboration with licensed psychotherapists, we propose a Conversational AI Therapist with psychotherapeutic Interventions (CaiTI), a platform that leverages large language models (LLM)s and smart devices to enable better mental health self-care. CaiTI can screen the day-to-day functioning using natural and psychotherapeutic conversations. CaiTI leverages reinforcement learning to provide personalized conversation flow. CaiTI can accurately understand and interpret user responses. When the user needs further attention during the conversation, CaiTI can provide conversational psychotherapeutic interventions, including cognitive behavioral therapy (CBT) and motivational interviewing (MI). Leveraging the datasets prepared by the licensed psychotherapists, we experiment and microbenchmark various LLMs' performance in tasks along CaiTI's conversation flow and discuss their strengths and weaknesses. With the psychotherapists, we implement CaiTI and conduct 14-day and 24-week studies. The study results, validated by therapists, demonstrate that CaiTI can converse with users naturally, accurately understand and interpret user responses, and provide psychotherapeutic interventions appropriately and effectively. We showcase the potential of CaiTI LLMs to assist the mental therapy diagnosis and treatment and improve day-to-day functioning screening and precautionary psychotherapeutic intervention systems.

Title: Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation

Authors: Mariia Khan, Yue Qiu, Yuren Cong, Jumana Abu-Khalaf, David Suter, Bodo Rosenhahn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10780
Pdf URL: https://arxiv.org/pdf/2403.10780
Copy Paste: [[2403.10780]] Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation(https://arxiv.org/abs/2403.10780)
Keywords: segmentation
Abstract: Multi-class multi-instance segmentation is the task of identifying masks for multiple object classes and multiple instances of the same class within an image. The foundational Segment Anything Model (SAM) is designed for promptable multi-class multi-instance segmentation but tends to output part or sub-part masks in the "everything" mode for various real-world applications. Whole object segmentation masks play a crucial role for indoor scene understanding, especially in robotics applications. We propose a new domain invariant Real-to-Simulation (Real-Sim) fine-tuning strategy for SAM. We use object images and ground truth data collected from Ai2Thor simulator during fine-tuning (real-to-sim). To allow our Segment Any Object Model (SAOM) to work in the "everything" mode, we propose the novel nearest neighbour assignment method, updating point embeddings for each ground-truth mask. SAOM is evaluated on our own dataset collected from Ai2Thor simulator. SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes. Moreover, our Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real). The dataset and the code will be released after publication.

Title: StableGarment: Garment-Centric Generation via Stable Diffusion

Authors: Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, Peipei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10783
Pdf URL: https://arxiv.org/pdf/2403.10783
Copy Paste: [[2403.10783]] StableGarment: Garment-Centric Generation via Stable Diffusion(https://arxiv.org/abs/2403.10783)
Keywords: robust, diffusion
Abstract: In this paper, we introduce StableGarment, a unified framework to tackle garment-centric(GC) generation tasks, including GC text-to-image, controllable GC text-to-image, stylized GC text-to-image, and robust virtual try-on. The main challenge lies in retaining the intricate textures of the garment while maintaining the flexibility of pre-trained Stable Diffusion. Our solution involves the development of a garment encoder, a trainable copy of the denoising UNet equipped with additive self-attention (ASA) layers. These ASA layers are specifically devised to transfer detailed garment textures, also facilitating the integration of stylized base models for the creation of stylized images. Furthermore, the incorporation of a dedicated try-on ControlNet enables StableGarment to execute virtual try-on tasks with precision. We also build a novel data engine that produces high-quality synthesized data to preserve the model's ability to follow prompts. Extensive experiments demonstrate that our approach delivers state-of-the-art (SOTA) results among existing virtual try-on methods and exhibits high flexibility with broad potential applications in various garment-centric image generation.

Title: Time Series Representation Learning with Supervised Contrastive Temporal Transformer

Authors: Yuansan Liu, Sudanthi Wijewickrema, Christofer Bester, Stephen O'Leary, James Bailey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10787
Pdf URL: https://arxiv.org/pdf/2403.10787
Copy Paste: [[2403.10787]] Time Series Representation Learning with Supervised Contrastive Temporal Transformer(https://arxiv.org/abs/2403.10787)
Keywords: transformer
Abstract: Finding effective representations for time series data is a useful but challenging task. Several works utilize self-supervised or unsupervised learning methods to address this. However, there still remains the open question of how to leverage available label information for better representations. To answer this question, we exploit pre-existing techniques in time series and representation learning domains and develop a simple, yet novel fusion model, called: \textbf{S}upervised \textbf{CO}ntrastive \textbf{T}emporal \textbf{T}ransformer (SCOTT). We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations. Secondly, we combine Transformer and Temporal Convolutional Networks in a simple way to efficiently learn both global and local features. Finally, we simplify Supervised Contrastive Loss for representation learning of labelled time series data. We preliminarily evaluate SCOTT on a downstream task, Time Series Classification, using 45 datasets from the UCR archive. The results show that with the representations learnt by SCOTT, even a weak classifier can perform similar to or better than existing state-of-the-art models (best performance on 23/45 datasets and highest rank against 9 baseline models). Afterwards, we investigate SCOTT's ability to address a real-world task, online Change Point Detection (CPD), on two datasets: a human activity dataset and a surgical patient dataset. We show that the model performs with high reliability and efficiency on the online CPD problem ($\sim$98\% and $\sim$97\% area under precision-recall curve respectively). Furthermore, we demonstrate the model's potential in tackling early detection and show it performs best compared to other candidates.

Title: From Words to Routes: Applying Large Language Models to Vehicle Routing

Authors: Zhehui Huang, Guangyao Shi, Gaurav S. Sukhatme
Subjects: cs.CL, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2403.10795
Pdf URL: https://arxiv.org/pdf/2403.10795
Copy Paste: [[2403.10795]] From Words to Routes: Applying Large Language Models to Vehicle Routing(https://arxiv.org/abs/2403.10795)
Keywords: large language model
Abstract: LLMs have shown impressive progress in robotics (e.g., manipulation and navigation) with natural language task descriptions. The success of LLMs in these tasks leads us to wonder: What is the ability of LLMs to solve vehicle routing problems (VRPs) with natural language task descriptions? In this work, we study this question in three steps. First, we construct a dataset with 21 types of single- or multi-vehicle routing problems. Second, we evaluate the performance of LLMs across four basic prompt paradigms of text-to-code generation, each involving different types of text input. We find that the basic prompt paradigm, which generates code directly from natural language task descriptions, performs the best for GPT-4, achieving 56% feasibility, 40% optimality, and 53% efficiency. Third, based on the observation that LLMs may not be able to provide correct solutions at the initial attempt, we propose a framework that enables LLMs to refine solutions through self-reflection, including self-debugging and self-verification. With GPT-4, our proposed framework achieves a 16% increase in feasibility, a 7% increase in optimality, and a 15% increase in efficiency. Moreover, we examine the sensitivity of GPT-4 to task descriptions, specifically focusing on how its performance changes when certain details are omitted from the task descriptions, yet the core meaning is preserved. Our findings reveal that such omissions lead to a notable decrease in performance: 4% in feasibility, 4% in optimality, and 5% in efficiency. Website: https://sites.google.com/view/words-to-routes/

Title: Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval

Authors: Shichao Kan, Yuhai Deng, Yixiong Liang, Lihui Cen, Zhe Qu, Yigang Cen, Zhihai He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10798
Pdf URL: https://arxiv.org/pdf/2403.10798
Copy Paste: [[2403.10798]] Unsupervised Collaborative Metric Learning with Mixed-Scale Groups for General Object Retrieval(https://arxiv.org/abs/2403.10798)
Keywords: robust
Abstract: The task of searching for visual objects in a large image dataset is difficult because it requires efficient matching and accurate localization of objects that can vary in size. Although the segment anything model (SAM) offers a potential solution for extracting object spatial context, learning embeddings for local objects remains a challenging problem. This paper presents a novel unsupervised deep metric learning approach, termed unsupervised collaborative metric learning with mixed-scale groups (MS-UGCML), devised to learn embeddings for objects of varying scales. Following this, a benchmark of challenges is assembled by utilizing COCO 2017 and VOC 2007 datasets to facilitate the training and evaluation of general object retrieval models. Finally, we conduct comprehensive ablation studies and discuss the complexities faced within the domain of general object retrieval. Our object retrieval evaluations span a range of datasets, including BelgaLogos, Visual Genome, LVIS, in addition to a challenging evaluation set that we have individually assembled for open-vocabulary evaluation. These comprehensive evaluations effectively highlight the robustness of our unsupervised MS-UGCML approach, with an object level and image level mAPs improvement of up to 6.69% and 10.03%, respectively. The code is publicly available at https://github.com/dengyuhai/MS-UGCML.

Title: Efficient Pruning of Large Language Model with Adaptive Estimation Fusion

Authors: Jun Liu, Chao Wu, Changdi Yang, Hao Tang, Haoye Dong, Zhenglun Kong, Geng Yuan, Wei Niu, Dong Huang, Yanzhi Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10799
Pdf URL: https://arxiv.org/pdf/2403.10799
Copy Paste: [[2403.10799]] Efficient Pruning of Large Language Model with Adaptive Estimation Fusion(https://arxiv.org/abs/2403.10799)
Keywords: generative, large language model
Abstract: Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.

Title: Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders

Authors: Andrew Geng, Pin-Yu Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.10800
Pdf URL: https://arxiv.org/pdf/2403.10800
Copy Paste: [[2403.10800]] Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders(https://arxiv.org/abs/2403.10800)
Keywords: robust
Abstract: When evaluating the performance of a pre-trained model transferred to a downstream task, it is imperative to assess not only the in-distribution (ID) accuracy of the downstream model but also its capacity to generalize and identify out-of-distribution (OOD) samples. In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques. Specifically, we demonstrate that commonly used fine-tuning methods not only distort the representations necessary for generalizing to covariate-shifted OOD samples (OOD generalization) but also distort the representations necessary for detecting semantically-shifted OOD samples (OOD detection). To address these challenges, we introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer. Reprogrammer aims to improve the holistic performance of the downstream model across ID, OOD generalization, and OOD detection tasks. Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models. Furthermore, we demonstrate that by appending an additional representation residual connection to Reprogrammer, we can further preserve pre-training representations, resulting in an even more safe and robust downstream model capable of excelling in many ID classification, OOD generalization, and OOD detection settings.

Title: Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples

Authors: Ziqi Zhou, Minghui Li, Wei Liu, Shengshan Hu, Yechao Zhang, Wei Wan, Lulu Xue, Leo Yu Zhang, Dezhong Yang, Hai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10801
Pdf URL: https://arxiv.org/pdf/2403.10801
Copy Paste: [[2403.10801]] Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples(https://arxiv.org/abs/2403.10801)
Keywords: secure, defense, attack, robust
Abstract: With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.

Title: Anomaly Detection Based on Isolation Mechanisms: A Survey

Authors: Yang Cao, Haolong Xiang, Hang Zhang, Ye Zhu, Kai Ming Ting
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.10802
Pdf URL: https://arxiv.org/pdf/2403.10802
Copy Paste: [[2403.10802]] Anomaly Detection Based on Isolation Mechanisms: A Survey(https://arxiv.org/abs/2403.10802)
Keywords: security, robust
Abstract: Anomaly detection is a longstanding and active research area that has many applications in domains such as finance, security, and manufacturing. However, the efficiency and performance of anomaly detection algorithms are challenged by the large-scale, high-dimensional, and heterogeneous data that are prevalent in the era of big data. Isolation-based unsupervised anomaly detection is a novel and effective approach for identifying anomalies in data. It relies on the idea that anomalies are few and different from normal instances, and thus can be easily isolated by random partitioning. Isolation-based methods have several advantages over existing methods, such as low computational complexity, low memory usage, high scalability, robustness to noise and irrelevant features, and no need for prior knowledge or heavy parameter tuning. In this survey, we review the state-of-the-art isolation-based anomaly detection methods, including their data partitioning strategies, anomaly score functions, and algorithmic details. We also discuss some extensions and applications of isolation-based methods in different scenarios, such as detecting anomalies in streaming data, time series, trajectory, and image datasets. Finally, we identify some open challenges and future directions for isolation-based anomaly detection research.

Title: DarkGS: Learning Neural Illumination and 3D Gaussians Relighting for Robotic Exploration in the Dark

Authors: Tianyi Zhang, Kaining Huang, Weiming Zhi, Matthew Johnson-Roberson
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.10814
Pdf URL: https://arxiv.org/pdf/2403.10814
Copy Paste: [[2403.10814]] DarkGS: Learning Neural Illumination and 3D Gaussians Relighting for Robotic Exploration in the Dark(https://arxiv.org/abs/2403.10814)
Keywords: robust
Abstract: Humans have the remarkable ability to construct consistent mental models of an environment, even under limited or varying levels of illumination. We wish to endow robots with this same capability. In this paper, we tackle the challenge of constructing a photorealistic scene representation under poorly illuminated conditions and with a moving light source. We approach the task of modeling illumination as a learning problem, and utilize the developed illumination model to aid in scene reconstruction. We introduce an innovative framework that uses a data-driven approach, Neural Light Simulators (NeLiS), to model and calibrate the camera-light system. Furthermore, we present DarkGS, a method that applies NeLiS to create a relightable 3D Gaussian scene model capable of real-time, photorealistic rendering from novel viewpoints. We show the applicability and robustness of our proposed simulator and system in a variety of real-world environments.

Title: Active Label Correction for Semantic Segmentation with Foundation Models

Authors: Hoyoung Kim, Sehyun Hwang, Suha Kwak, Jungseul Ok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10820
Pdf URL: https://arxiv.org/pdf/2403.10820
Copy Paste: [[2403.10820]] Active Label Correction for Semantic Segmentation with Foundation Models(https://arxiv.org/abs/2403.10820)
Keywords: segmentation
Abstract: Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels, which in turn is more annotator-friendly than the standard one inquiring to classify a pixel directly according to our theoretical analysis and user study. Specifically, leveraging foundation models providing useful zero-shot predictions on pseudo labels and superpixels, our method comprises two key techniques: (i) an annotator-friendly design of correction query with the pseudo labels, and (ii) an acquisition function looking ahead label expansions based on the superpixels. Experimental results on PASCAL, Cityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC framework, outperforming prior methods for active semantic segmentation and label correction. Notably, utilizing our method, we obtained a revised dataset of PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.

Title: Do Large Language Models understand Medical Codes?

Authors: Simon A. Lee, Timothy Lindsey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10822
Pdf URL: https://arxiv.org/pdf/2403.10822
Copy Paste: [[2403.10822]] Do Large Language Models understand Medical Codes?(https://arxiv.org/abs/2403.10822)
Keywords: large language model
Abstract: The overarching goal of recent AI research has been to make steady progress towards achieving Artificial General Intelligence (AGI), prompting the evaluation of Large Language Models (LLMs) across a variety of tasks and domains. One such domain is healthcare, where LLMs can greatly benefit clinical practice by assisting with a wide range of tasks. However, these models are also prone to producing "hallucinations" or incorrect responses when faced with queries they cannot adequately address, raising concerns and skepticism, especially within the healthcare community. Therefore, in this work, we investigate whether LLMs understand the inherent meaning of medical codes, which are widely used in healthcare practice. We evaluate various off-the-shelf LLMs (e.g., GPT, LLaMA, etc.) and LLMs specifically designed for biomedical applications to assess their awareness and understanding of these domain-specific terminologies. Our results indicate that these models do not comprehend the meaning of the medical codes, highlighting the need for better representation of these alphanumeric codes extensively used in healthcare. We call for improved strategies to effectively capture and represent the nuances of medical codes and terminologies within LLMs, enabling them to become more reliable and trustworthy tools for healthcare professionals.

Title: VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

Authors: Hao Wei, Bowen Liu, Minqing Zhang, Peilun Shi, Wu Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10823
Pdf URL: https://arxiv.org/pdf/2403.10823
Copy Paste: [[2403.10823]] VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis(https://arxiv.org/abs/2403.10823)
Keywords: privacy
Abstract: Generalist foundation model has ushered in newfound capabilities in medical domain. However, the contradiction between the growing demand for high-quality annotated data with patient privacy continues to intensify. The utilization of medical artificial intelligence generated content (Med-AIGC) as an inexhaustible resource repository arises as a potential solution to address the aforementioned challenge. Here we harness 1 million open-source synthetic fundus images paired with natural language descriptions, to curate an ethical language-image foundation model for retina image analysis named VisionCLIP. VisionCLIP achieves competitive performance on three external datasets compared with the existing method pre-trained on real-world data in a zero-shot fashion. The employment of artificially synthetic images alongside corresponding textual data for training enables the medical foundation model to successfully assimilate knowledge of disease symptomatology, thereby circumventing potential breaches of patient confidentiality.

Title: Affective Behaviour Analysis via Integrating Multi-Modal Knowledge

Authors: Wei Zhang, Feng Qiu, Chen Liu, Lincheng Li, Heming Du, Tiancheng Guo, Xin Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10825
Pdf URL: https://arxiv.org/pdf/2403.10825
Copy Paste: [[2403.10825]] Affective Behaviour Analysis via Integrating Multi-Modal Knowledge(https://arxiv.org/abs/2403.10825)
Keywords: extraction, transformer
Abstract: Affective Behavior Analysis aims to facilitate technology emotionally smart, creating a world where devices can understand and react to our emotions as humans do. To comprehensively evaluate the authenticity and applicability of emotional behavior analysis techniques in natural environments, the 6th competition on Affective Behavior Analysis in-the-wild (ABAW) utilizes the Aff-Wild2, Hume-Vidmimic2, and C-EXPR-DB datasets to set up five competitive tracks, i.e., Valence-Arousal (VA) Estimation, Expression (EXPR) Recognition, Action Unit (AU) Detection, Compound Expression (CE) Recognition, and Emotional Mimicry Intensity (EMI) Estimation. In this paper, we present our method designs for the five tasks. Specifically, our design mainly includes three aspects: 1) Utilizing a transformer-based feature fusion module to fully integrate emotional information provided by audio signals, visual images, and transcripts, offering high-quality expression features for the downstream tasks. 2) To achieve high-quality facial feature representations, we employ Masked-Auto Encoder as the visual features extraction model and fine-tune it with our facial dataset. 3) Considering the complexity of the video collection scenes, we conduct a more detailed dataset division based on scene characteristics and train the classifier for each scene. Extensive experiments demonstrate the superiority of our designs.

Title: Exploring Learning-based Motion Models in Multi-Object Tracking

Authors: Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10826
Pdf URL: https://arxiv.org/pdf/2403.10826
Copy Paste: [[2403.10826]] Exploring Learning-based Motion Models in Multi-Object Tracking(https://arxiv.org/abs/2403.10826)
Keywords: extraction
Abstract: In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman Filter with various learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman Filter-based systems. In this paper, we proposed MambaTrack, an online motion-based tracker that outperforms all existing motion-based trackers on the challenging DanceTrack and SportsMOT datasets. Moreover, we further exploit the potential of the state-space-model in trajectory feature extraction to boost the tracking performance and proposed MambaTrack+, which achieves the state-of-the-art performance on DanceTrack dataset with 56.1 HOTA and 54.9 IDF1.

Title: Data Availability and Decentralization: New Techniques for zk-Rollups in Layer 2 Blockchain Networks

Authors: Chengpeng Huang, Rui Song, Shang Gao, Yu Guo, Bin Xiao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10828
Pdf URL: https://arxiv.org/pdf/2403.10828
Copy Paste: [[2403.10828]] Data Availability and Decentralization: New Techniques for zk-Rollups in Layer 2 Blockchain Networks(https://arxiv.org/abs/2403.10828)
Keywords: protect, attack, robust
Abstract: The scalability limitations of public blockchains have hindered their widespread adoption in real-world applications. While the Ethereum community is pushing forward in zk-rollup (zero-knowledge rollup) solutions, such as introducing the ``blob transaction'' in EIP-4844, Layer 2 networks encounter a data availability problem: storing transactions completely off-chain poses a risk of data loss, particularly when Layer 2 nodes are untrusted. Additionally, building Layer 2 blocks requires significant computational power, compromising the decentralization aspect of Layer 2 networks. This paper introduces new techniques to address the data availability and decentralization challenges in Layer 2 networks. To ensure data availability, we introduce the concept of ``proof of download'', which ensures that Layer 2 nodes cannot aggregate transactions without downloading historical data. Additionally, we design a ``proof of storage'' scheme that punishes nodes who maliciously delete historical data. For decentralization, we introduce a new role separation for Layer 2, allowing nodes with limited hardware to participate. To further avoid collusion among Layer 2 nodes, we design a ``proof of luck'' scheme, which also provides robust protection against maximal extractable value (MEV) attacks. Experimental results show our techniques not only ensure data availability but also improve overall network efficiency, which implies the practicality and potential of our techniques for real-world implementation.

Title: DUE: Dynamic Uncertainty-Aware Explanation Supervision via 3D Imputation

Authors: Qilong Zhao, Yifei Zhang, Mengdan Zhu, Siyi Gu, Yuyang Gao, Xiaofeng Yang, Liang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10831
Pdf URL: https://arxiv.org/pdf/2403.10831
Copy Paste: [[2403.10831]] DUE: Dynamic Uncertainty-Aware Explanation Supervision via 3D Imputation(https://arxiv.org/abs/2403.10831)
Keywords: explainability, diffusion
Abstract: Explanation supervision aims to enhance deep learning models by integrating additional signals to guide the generation of model explanations, showcasing notable improvements in both the predictability and explainability of the model. However, the application of explanation supervision to higher-dimensional data, such as 3D medical images, remains an under-explored domain. Challenges associated with supervising visual explanations in the presence of an additional dimension include: 1) spatial correlation changed, 2) lack of direct 3D annotations, and 3) uncertainty varies across different parts of the explanation. To address these challenges, we propose a Dynamic Uncertainty-aware Explanation supervision (DUE) framework for 3D explanation supervision that ensures uncertainty-aware explanation guidance when dealing with sparsely annotated 3D data with diffusion-based 3D interpolation. Our proposed framework is validated through comprehensive experiments on diverse real-world medical imaging datasets. The results demonstrate the effectiveness of our framework in enhancing the predictability and explainability of deep learning models in the context of medical imaging diagnosis applications.

Title: Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman Process

Authors: Mohammad Ali Labbaf-Khaniki, Mohammad Manthouri, Hanieh Ajami
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10842
Pdf URL: https://arxiv.org/pdf/2403.10842
Copy Paste: [[2403.10842]] Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman Process(https://arxiv.org/abs/2403.10842)
Keywords: robust, extraction, transformer
Abstract: Fault detection and diagnosis (FDD) is a crucial task for ensuring the safety and efficiency of industrial processes. We propose a novel FDD methodology for the Tennessee Eastman Process (TEP), a widely used benchmark for chemical process control. The model employs two separate Transformer branches, enabling independent processing of input data and potential extraction of diverse information. A novel attention mechanism, Gated Dynamic Learnable Attention (GDLAttention), is introduced which integrates a gating mechanism and dynamic learning capabilities. The gating mechanism modulates the attention weights, allowing the model to focus on the most relevant parts of the input. The dynamic learning approach adapts the attention strategy during training, potentially leading to improved performance. The attention mechanism uses a bilinear similarity function, providing greater flexibility in capturing complex relationships between query and key vectors. In order to assess the effectiveness of our approach, we tested it against 21 and 18 distinct fault scenarios in TEP, and compared its performance with several established FDD techniques. The outcomes indicate that the method outperforms others in terms of accuracy, false alarm rate, and misclassification rate. This underscores the robustness and efficacy of the approach for FDD in intricate industrial processes.

Title: RETINAQA : A Knowledge Base Question Answering Model Robust to both Answerable and Unanswerable Questions

Authors: Prayushi Faldu, Indrajit Bhattacharya, Mausam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10849
Pdf URL: https://arxiv.org/pdf/2403.10849
Copy Paste: [[2403.10849]] RETINAQA : A Knowledge Base Question Answering Model Robust to both Answerable and Unanswerable Questions(https://arxiv.org/abs/2403.10849)
Keywords: robust
Abstract: State-of-the-art KBQA models assume answerability of questions. Recent research has shown that while these can be adapted to detect unaswerability with suitable training and thresholding, this comes at the expense of accuracy for answerable questions, and no single model is able to handle all categories of unanswerability. We propose a new model for KBQA named RetinaQA that is robust against unaswerability. It complements KB-traversal based logical form retrieval with sketch-filling based logical form construction. This helps with questions that have valid logical forms but no data paths in the KB leading to an answer. Additionally, it uses discrimination instead of generation to better identify questions that do not have valid logical forms. We demonstrate that RetinaQA significantly outperforms adaptations of state-of-the-art KBQA models across answerable and unanswerable questions, while showing robustness across unanswerability categories. Remarkably, it also establishes a new state-of-the art for answerable KBQA by surpassing existing models

Title: Just Say the Name: Online Continual Learning with Category Names Only via Data Generation

Authors: Minhyuk Seo, Diganta Misra, Seongwon Cho, Minjae Lee, Jonghyun Choi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.10853
Pdf URL: https://arxiv.org/pdf/2403.10853
Copy Paste: [[2403.10853]] Just Say the Name: Online Continual Learning with Category Names Only via Data Generation(https://arxiv.org/abs/2403.10853)
Keywords: privacy, generative
Abstract: In real-world scenarios, extensive manual annotation for continual learning is impractical due to prohibitive costs. Although prior arts, influenced by large-scale webly supervised training, suggest leveraging web-scraped data in continual learning, this poses challenges such as data imbalance, usage restrictions, and privacy concerns. Addressing the risks of continual webly supervised training, we present an online continual learning framework - Generative Name only Continual Learning (G-NoCL). The proposed G-NoCL uses a set of generators G along with the learner. When encountering new concepts (i.e., classes), G-NoCL employs the novel sample complexity-guided data ensembling technique DIverSity and COmplexity enhancing ensemBlER (DISCOBER) to optimally sample training data from generated data. Through extensive experimentation, we demonstrate superior performance of DISCOBER in G-NoCL online CL benchmarks, covering both In-Distribution (ID) and Out-of-Distribution (OOD) generalization evaluations, compared to naive generator-ensembling, web-supervised, and manually annotated data.

Title: A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Authors: Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10854
Pdf URL: https://arxiv.org/pdf/2403.10854
Copy Paste: [[2403.10854]] A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment(https://arxiv.org/abs/2403.10854)
Keywords: large language model
Abstract: While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

Title: Zero-shot Generative Linguistic Steganography

Authors: Ke Lin, Yiyang Luo, Zijian Zhang, Ping Luo
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2403.10856
Pdf URL: https://arxiv.org/pdf/2403.10856
Copy Paste: [[2403.10856]] Zero-shot Generative Linguistic Steganography(https://arxiv.org/abs/2403.10856)
Keywords: generative
Abstract: Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces $1.926\times$ more innocent and intelligible stegotext than any other method.

Title: RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

Authors: Hongbo Chu, Qiehe Sun, Jiawen Li, Yuxuan Chen, Lizhong Zhang, Tian Guan, Anjia Han, Yonghong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10858
Pdf URL: https://arxiv.org/pdf/2403.10858
Copy Paste: [[2403.10858]] RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification(https://arxiv.org/abs/2403.10858)
Keywords: transformer
Abstract: Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature complexity brought by high heterogeneity and the ultra-long sequences brought by gigapixel size makes Transformer-based MIL suffer from the challenges of high memory consumption, slow inference speed, and lack of performance. To this end, we propose a retentive MIL method called RetMIL, which processes WSI sequences through hierarchical feature propagation structure. At the local level, the WSI sequence is divided into multiple subsequences. Tokens of each subsequence are updated through a parallel linear retention mechanism and aggregated utilizing an attention layer. At the global level, subsequences are fused into a global sequence, then updated through a serial retention mechanism, and finally the slide-level representation is obtained through a global attention pooling. We conduct experiments on two public CAMELYON and BRACS datasets and an public-internal LUNG dataset, confirming that RetMIL not only achieves state-of-the-art performance but also significantly reduces computational overhead. Our code will be accessed shortly.

Title: Characterizing the Solana NFT Ecosystem

Authors: Dechao Kong, Xiaoqi Li, Wenkai Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10879
Pdf URL: https://arxiv.org/pdf/2403.10879
Copy Paste: [[2403.10879]] Characterizing the Solana NFT Ecosystem(https://arxiv.org/abs/2403.10879)
Keywords: security
Abstract: Non-Fungible Tokens (NFTs) are digital assets recorded on the blockchain, providing cryptographic proof of ownership over digital or physical items. Although Solana has only begun to gain popularity in recent years, its NFT market has seen substantial transaction volumes. In this paper, we conduct the first systematic research on the characteristics of Solana NFTs from two perspectives: longitudinal measurement and wash trading security audit. We gathered 132,736 Solana NFT from Solscan and analyzed the sales data within these collections. Investigating users' economic activity and NFT owner information reveals that the top users in Solana NFT are skewed toward a higher distribution of purchases. Subsequently, we employ the Local Outlier Factor algorithm to conduct a wash trading audit on 2,175 popular Solana NFTs. We discovered that 138 NFT pools are involved in wash trading, with 8 of these NFTs having a wash trading rate exceeding 50%. Fortunately, none of these NFTs have been entirely washed out.

Title: Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

Authors: ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10882
Pdf URL: https://arxiv.org/pdf/2403.10882
Copy Paste: [[2403.10882]] Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean(https://arxiv.org/abs/2403.10882)
Keywords: large language model
Abstract: Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

Title: Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction

Authors: Jiyuan Fu, Zhaoyu Chen, Kaixun Jiang, Haijing Guo, Jiafeng Wang, Shuyong Gao, Wenqiang Zhang
Subjects: cs.CV, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2403.10883
Pdf URL: https://arxiv.org/pdf/2403.10883
Copy Paste: [[2403.10883]] Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction(https://arxiv.org/abs/2403.10883)
Keywords: attack, robust
Abstract: Despite the substantial advancements in Vision-Language Pre-training (VLP) models, their susceptibility to adversarial attacks poses a significant challenge. Existing work rarely studies the transferability of attacks on VLP models, resulting in a substantial performance gap from white-box attacks. We observe that prior work overlooks the interaction mechanisms between modalities, which plays a crucial role in understanding the intricacies of VLP models. In response, we propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack), leveraging modality interaction through embedding guidance and interaction enhancement. Specifically, attacking text at the embedding level while preserving semantics, as well as utilizing interaction image gradients to enhance constraints on perturbations of texts and images. Significantly, in the image-text retrieval task on Flickr30K dataset, CMI-Attack raises the transfer success rates from ALBEF to TCL, $\text{CLIP}_{\text{ViT}}$ and $\text{CLIP}_{\text{CNN}}$ by 8.11%-16.75% over state-of-the-art methods. Moreover, CMI-Attack also demonstrates superior performance in cross-task generalization scenarios. Our work addresses the underexplored realm of transfer attacks on VLP models, shedding light on the importance of modality interaction for enhanced adversarial robustness.

Title: Fuzzy Rank-based Late Fusion Technique for Cytology image Segmentation

Authors: Soumyajyoti Dey, Sukanta Chakraborty, Utso Guha Roy, Nibaran Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10884
Pdf URL: https://arxiv.org/pdf/2403.10884
Copy Paste: [[2403.10884]] Fuzzy Rank-based Late Fusion Technique for Cytology image Segmentation(https://arxiv.org/abs/2403.10884)
Keywords: segmentation
Abstract: Cytology image segmentation is quite challenging due to its complex cellular structure and multiple overlapping regions. On the other hand, for supervised machine learning techniques, we need a large amount of annotated data, which is costly. In recent years, late fusion techniques have given some promising performances in the field of image classification. In this paper, we have explored a fuzzy-based late fusion techniques for cytology image segmentation. This fusion rule integrates three traditional semantic segmentation models UNet, SegNet, and PSPNet. The technique is applied on two cytology image datasets, i.e., cervical cytology(HErlev) and breast cytology(JUCYT-v1) image datasets. We have achieved maximum MeanIoU score 84.27% and 83.79% on the HErlev dataset and JUCYT-v1 dataset after the proposed late fusion technique, respectively which are better than that of the traditional fusion rules such as average probability, geometric mean, Borda Count, etc. The codes of the proposed model are available on GitHub.

Title: A Watermark-Conditioned Diffusion Model for IP Protection

Authors: Rui Min, Sen Li, Hongyang Chen, Minhao Cheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10893
Pdf URL: https://arxiv.org/pdf/2403.10893
Copy Paste: [[2403.10893]] A Watermark-Conditioned Diffusion Model for IP Protection(https://arxiv.org/abs/2403.10893)
Keywords: protect, robust, steal, watermark, diffusion, generative
Abstract: The ethical need to protect AI-generated content has been a significant concern in recent years. While existing watermarking strategies have demonstrated success in detecting synthetic content (detection), there has been limited exploration in identifying the users responsible for generating these outputs from a single model (owner identification). In this paper, we focus on both practical scenarios and propose a unified watermarking framework for content copyright protection within the context of diffusion models. Specifically, we consider two parties: the model provider, who grants public access to a diffusion model via an API, and the users, who can solely query the model API and generate images in a black-box manner. Our task is to embed hidden information into the generated contents, which facilitates further detection and owner identification. To tackle this challenge, we propose a Watermark-conditioned Diffusion model called WaDiff, which manipulates the watermark as a conditioned input and incorporates fingerprinting into the generation process. All the generative outputs from our WaDiff carry user-specific information, which can be recovered by an image extractor and further facilitate forensic identification. Extensive experiments are conducted on two popular diffusion models, and we demonstrate that our method is effective and robust in both the detection and owner identification tasks. Meanwhile, our watermarking framework only exerts a negligible impact on the original generation and is more stealthy and efficient in comparison to existing watermarking strategies.

Title: Towards Robustness and Diversity: Continual Learning in Dialog Generation with Text-Mixup and Batch Nuclear-Norm Maximization

Authors: Zihan Wang, Jiayu Xiao, Mengxiang Li, Zhongjiang He, Yongxiang Li, Chao Wang, Shuangyong Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10894
Pdf URL: https://arxiv.org/pdf/2403.10894
Copy Paste: [[2403.10894]] Towards Robustness and Diversity: Continual Learning in Dialog Generation with Text-Mixup and Batch Nuclear-Norm Maximization(https://arxiv.org/abs/2403.10894)
Keywords: robust
Abstract: In our dynamic world where data arrives in a continuous stream, continual learning enables us to incrementally add new tasks/domains without the need to retrain from scratch. A major challenge in continual learning of language model is catastrophic forgetting, the tendency of models to forget knowledge from previously trained tasks/domains when training on new ones. This paper studies dialog generation under the continual learning setting. We propose a novel method that 1) uses \textit{Text-Mixup} as data augmentation to avoid model overfitting on replay memory and 2) leverages Batch-Nuclear Norm Maximization (BNNM) to alleviate the problem of mode collapse. Experiments on a $37$-domain task-oriented dialog dataset and DailyDialog (a $10$-domain chitchat dataset) demonstrate that our proposed approach outperforms the state-of-the-art in continual learning.

Title: Rethinking Multi-view Representation Learning via Distilled Disentangling

Authors: Guanzhou Ke, Bo Wang, Xiaoli Wang, Shengfeng He
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.10897
Pdf URL: https://arxiv.org/pdf/2403.10897
Copy Paste: [[2403.10897]] Rethinking Multi-view Representation Learning via Distilled Disentangling(https://arxiv.org/abs/2403.10897)
Keywords: robust, extraction
Abstract: Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.

Title: BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English

Authors: Sheikh Shafayat, H M Quamran Hasan, Minhajur Rahman Chowdhury Mahim, Rifki Afina Putri, James Thorne, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10900
Pdf URL: https://arxiv.org/pdf/2403.10900
Copy Paste: [[2403.10900]] BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English(https://arxiv.org/abs/2403.10900)
Keywords: large language model
Abstract: In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.

Title: DTOR: Decision Tree Outlier Regressor to explain anomalies

Authors: Riccardo Crupi, Alessandro Damiano Sabatino, Immacolata Marano, Massimiliano Brinis, Luca Albertazzi, Andrea Cirillo, Andrea Claudio Cosentini
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.10903
Pdf URL: https://arxiv.org/pdf/2403.10903
Copy Paste: [[2403.10903]] DTOR: Decision Tree Outlier Regressor to explain anomalies(https://arxiv.org/abs/2403.10903)
Keywords: robust
Abstract: Explaining outliers occurrence and mechanism of their occurrence can be extremely important in a variety of domains. Malfunctions, frauds, threats, in addition to being correctly identified, oftentimes need a valid explanation in order to effectively perform actionable counteracts. The ever more widespread use of sophisticated Machine Learning approach to identify anomalies make such explanations more challenging. We present the Decision Tree Outlier Regressor (DTOR), a technique for producing rule-based explanations for individual data points by estimating anomaly scores generated by an anomaly detection model. This is accomplished by first applying a Decision Tree Regressor, which computes the estimation score, and then extracting the relative path associated with the data point score. Our results demonstrate the robustness of DTOR even in datasets with a large number of features. Additionally, in contrast to other rule-based approaches, the generated rules are consistently satisfied by the points to be explained. Furthermore, our evaluation metrics indicate comparable performance to Anchors in outlier explanation tasks, with reduced execution time.

Title: Graph Regularized NMF with L20-norm for Unsupervised Feature Learning

Authors: Zhen Wang, Wenwen Min
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.10910
Pdf URL: https://arxiv.org/pdf/2403.10910
Copy Paste: [[2403.10910]] Graph Regularized NMF with L20-norm for Unsupervised Feature Learning(https://arxiv.org/abs/2403.10910)
Keywords: robust
Abstract: Nonnegative Matrix Factorization (NMF) is a widely applied technique in the fields of machine learning and data mining. Graph Regularized Non-negative Matrix Factorization (GNMF) is an extension of NMF that incorporates graph regularization constraints. GNMF has demonstrated exceptional performance in clustering and dimensionality reduction, effectively discovering inherent low-dimensional structures embedded within high-dimensional spaces. However, the sensitivity of GNMF to noise limits its stability and robustness in practical applications. In order to enhance feature sparsity and mitigate the impact of noise while mining row sparsity patterns in the data for effective feature selection, we introduce the $\ell_{2,0}$-norm constraint as the sparsity constraints for GNMF. We propose an unsupervised feature learning framework based on GNMF\_$\ell_{20}$ and devise an algorithm based on PALM and its accelerated version to address this problem. Additionally, we establish the convergence of the proposed algorithms and validate the efficacy and superiority of our approach through experiments conducted on both simulated and real image data.

Title: Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Authors: Yeongtak Oh, Jonghyun Lee, Jooyoung Choi, Dahuin Jung, Uiwon Hwang, Sungroh Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10911
Pdf URL: https://arxiv.org/pdf/2403.10911
Copy Paste: [[2403.10911]] Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation(https://arxiv.org/abs/2403.10911)
Keywords: robust, diffusion
Abstract: Test-time adaptation (TTA) addresses the unforeseen distribution shifts occurring during test time. In TTA, both performance and, memory and time consumption serve as crucial considerations. A recent diffusion-based TTA approach for restoring corrupted images involves image-level updates. However, using pixel space diffusion significantly increases resource requirements compared to conventional model updating TTA approaches, revealing limitations as a TTA method. To address this, we propose a novel TTA method by leveraging a latent diffusion model (LDM) based image editing model and fine-tuning it with our newly introduced corruption modeling scheme. This scheme enhances the robustness of the diffusion model against distribution shifts by creating (clean, corrupted) image pairs and fine-tuning the model to edit corrupted images into clean ones. Moreover, we introduce a distilled variant to accelerate the model for corruption editing using only 4 network function evaluations (NFEs). We extensively validated our method across various architectures and datasets including image and video domains. Our model achieves the best performance with a 100 times faster runtime than that of a diffusion-based baseline. Furthermore, it outpaces the speed of the model updating TTA method based on data augmentation threefold, rendering an image-level updating approach more practical.

Title: FishNet: Deep Neural Networks for Low-Cost Fish Stock Estimation

Authors: Moseli Mots'oehli, Anton Nikolaev, Wawan B. IGede, John Lynham, Peter J. Mous, Peter Sadowski
Subjects: cs.CV, econ.GN
Abstract URL: https://arxiv.org/abs/2403.10916
Pdf URL: https://arxiv.org/pdf/2403.10916
Copy Paste: [[2403.10916]] FishNet: Deep Neural Networks for Low-Cost Fish Stock Estimation(https://arxiv.org/abs/2403.10916)
Keywords: segmentation
Abstract: Fish stock assessment often involves manual fish counting by taxonomy specialists, which is both time-consuming and costly. We propose an automated computer vision system that performs both taxonomic classification and fish size estimation from images taken with a low-cost digital camera. The system first performs object detection and segmentation using a Mask R-CNN to identify individual fish from images containing multiple fish, possibly consisting of different species. Then each fish species is classified and the predicted length using separate machine learning models. These models are trained on a dataset of 50,000 hand-annotated images containing 163 different fish species, ranging in length from 10cm to 250cm. Evaluated on held-out test data, our system achieves a $92\%$ intersection over union on the fish segmentation task, a $89\%$ top-1 classification accuracy on single fish species classification, and a $2.3$~cm mean error on the fish length estimation task.

Title: Batch-oriented Element-wise Approximate Activation for Privacy-Preserving Neural Networks

Authors: Peng Zhang, Ao Duan, Xianglu Zou, Yuhong Liu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.10920
Pdf URL: https://arxiv.org/pdf/2403.10920
Copy Paste: [[2403.10920]] Batch-oriented Element-wise Approximate Activation for Privacy-Preserving Neural Networks(https://arxiv.org/abs/2403.10920)
Keywords: privacy, protect
Abstract: Privacy-Preserving Neural Networks (PPNN) are advanced to perform inference without breaching user privacy, which can serve as an essential tool for medical diagnosis to simultaneously achieve big data utility and privacy protection. As one of the key techniques to enable PPNN, Fully Homomorphic Encryption (FHE) is facing a great challenge that homomorphic operations cannot be easily adapted for non-linear activation calculations. In this paper, batch-oriented element-wise data packing and approximate activation are proposed, which train linear low-degree polynomials to approximate the non-linear activation function - ReLU. Compared with other approximate activation methods, the proposed fine-grained, trainable approximation scheme can effectively reduce the accuracy loss caused by approximation errors. Meanwhile, due to element-wise data packing, a large batch of images can be packed and inferred concurrently, leading to a much higher utility ratio of ciphertext slots. Therefore, although the total inference time increases sharply, the amortized time for each image actually decreases, especially when the batch size increases. Furthermore, knowledge distillation is adopted in the training process to further enhance the inference accuracy. Experiment results show that when ciphertext inference is performed on 4096 input images, compared with the current most efficient channel-wise method, the inference accuracy is improved by 1.65%, and the amortized inference time is reduced by 99.5%.

Title: Interpretable Machine Learning for TabPFN

Authors: David Rundel, Julius Kobialka, Constantin von Crailsheim, Matthias Feurer, Thomas Nagler, David Rügamer
Subjects: cs.LG, cs.AI, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2403.10923
Pdf URL: https://arxiv.org/pdf/2403.10923
Copy Paste: [[2403.10923]] Interpretable Machine Learning for TabPFN(https://arxiv.org/abs/2403.10923)
Keywords: interpretability, transformer
Abstract: The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN. Our proposed methods are implemented in a package tabpfn_iml and made available at https://github.com/david-rundel/tabpfn_iml.

Title: Understanding Robustness of Visual State Space Models for Image Classification

Authors: Chengbin Du, Yanxi Li, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10935
Pdf URL: https://arxiv.org/pdf/2403.10935
Copy Paste: [[2403.10935]] Understanding Robustness of Visual State Space Models for Image Classification(https://arxiv.org/abs/2403.10935)
Keywords: attack, robust, transformer
Abstract: Visual State Space Model (VMamba) has recently emerged as a promising architecture, exhibiting remarkable performance in various computer vision tasks. However, its robustness has not yet been thoroughly studied. In this paper, we delve into the robustness of this architecture through comprehensive investigations from multiple perspectives. Firstly, we investigate its robustness to adversarial attacks, employing both whole-image and patch-specific adversarial attacks. Results demonstrate superior adversarial robustness compared to Transformer architectures while revealing scalability weaknesses. Secondly, the general robustness of VMamba is assessed against diverse scenarios, including natural adversarial examples, out-of-distribution data, and common corruptions. VMamba exhibits exceptional generalizability with out-of-distribution data but shows scalability weaknesses against natural adversarial examples and common corruptions. Additionally, we explore VMamba's gradients and back-propagation during white-box attacks, uncovering unique vulnerabilities and defensive capabilities of its novel components. Lastly, the sensitivity of VMamba to image structure variations is examined, highlighting vulnerabilities associated with the distribution of disturbance areas and spatial information, with increased susceptibility closer to the image center. Through these comprehensive studies, we contribute to a deeper understanding of VMamba's robustness, providing valuable insights for refining and advancing the capabilities of deep neural networks in computer vision applications.

Title: ScanTalk: 3D Talking Heads from Unregistered Scans

Authors: Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Stefano Berretti, Mohamed Daoudi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10942
Pdf URL: https://arxiv.org/pdf/2403.10942
Copy Paste: [[2403.10942]] ScanTalk: 3D Talking Heads from Unregistered Scans(https://arxiv.org/abs/2403.10942)
Keywords: diffusion
Abstract: Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model will be made available.

Title: SelfIE: Self-Interpretation of Large Language Model Embeddings

Authors: Haozhe Chen, Carl Vondrick, Chengzhi Mao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10949
Pdf URL: https://arxiv.org/pdf/2403.10949
Copy Paste: [[2403.10949]] SelfIE: Self-Interpretation of Large Language Model Embeddings(https://arxiv.org/abs/2403.10949)
Keywords: large language model
Abstract: How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.

Title: Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

Authors: Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10953
Pdf URL: https://arxiv.org/pdf/2403.10953
Copy Paste: [[2403.10953]] Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription(https://arxiv.org/abs/2403.10953)
Keywords: diffusion
Abstract: Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.

Title: Energy-Based Models with Applications to Speech and Language Processing

Authors: Zhijian Ou
Subjects: cs.LG, cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.10961
Pdf URL: https://arxiv.org/pdf/2403.10961
Copy Paste: [[2403.10961]] Energy-Based Models with Applications to Speech and Language Processing(https://arxiv.org/abs/2403.10961)
Keywords: generative
Abstract: Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.

Title: Exploiting Topological Prior for Boosting Point Cloud Generation

Authors: Baiyuan Chen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.10962
Pdf URL: https://arxiv.org/pdf/2403.10962
Copy Paste: [[2403.10962]] Exploiting Topological Prior for Boosting Point Cloud Generation(https://arxiv.org/abs/2403.10962)
Keywords: generative
Abstract: This paper presents an innovative enhancement to the Sphere as Prior Generative Adversarial Network (SP-GAN) model, a state-of-the-art GAN designed for point cloud generation. A novel method is introduced for point cloud generation that elevates the structural integrity and overall quality of the generated point clouds by incorporating topological priors into the training process of the generator. Specifically, this work utilizes the K-means algorithm to segment a point cloud from the repository into clusters and extract centroids, which are then used as priors in the generation process of the SP-GAN. Furthermore, the discriminator component of the SP-GAN utilizes the identical point cloud that contributed the centroids, ensuring a coherent and consistent learning environment. This strategic use of centroids as intuitive guides not only boosts the efficiency of global feature learning but also substantially improves the structural coherence and fidelity of the generated point clouds. By applying the K-means algorithm to generate centroids as the prior, the work intuitively and experimentally demonstrates that such a prior enhances the quality of generated point clouds.

Title: Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!

Authors: Niyati Bafna, David Yarowsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.10963
Pdf URL: https://arxiv.org/pdf/2403.10963
Copy Paste: [[2403.10963]] Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!(https://arxiv.org/abs/2403.10963)
Keywords: transformer
Abstract: While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.

Title: Enhancing IoT Security Against DDoS Attacks through Federated Learning

Authors: Ghazaleh Shirvani, Saeid Ghasemshirazi, Mohammad Ali Alipour
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10968
Pdf URL: https://arxiv.org/pdf/2403.10968
Copy Paste: [[2403.10968]] Enhancing IoT Security Against DDoS Attacks through Federated Learning(https://arxiv.org/abs/2403.10968)
Keywords: security, privacy, attack, federate
Abstract: The rapid proliferation of the Internet of Things (IoT) has ushered in transformative connectivity between physical devices and the digital realm. Nonetheless, the escalating threat of Distributed Denial of Service (DDoS) attacks jeopardizes the integrity and reliability of IoT networks. Conventional DDoS mitigation approaches are ill-equipped to handle the intricacies of IoT ecosystems, potentially compromising data privacy. This paper introduces an innovative strategy to bolster the security of IoT networks against DDoS attacks by harnessing the power of Federated Learning that allows multiple IoT devices or edge nodes to collaboratively build a global model while preserving data privacy and minimizing communication overhead. The research aims to investigate Federated Learning's effectiveness in detecting and mitigating DDoS attacks in IoT. Our proposed framework leverages IoT devices' collective intelligence for real-time attack detection without compromising sensitive data. This study proposes innovative deep autoencoder approaches for data dimensionality reduction, retraining, and partial selection to enhance the performance and stability of the proposed model. Additionally, two renowned aggregation algorithms, FedAvg and FedAvgM, are employed in this research. Various metrics, including true positive rate, false positive rate, and F1-score, are employed to evaluate the model. The dataset utilized in this research, N-BaIoT, exhibits non-IID data distribution, where data categories are distributed quite differently. The negative impact of these distribution disparities is managed by employing retraining and partial selection techniques, enhancing the final model's stability. Furthermore, evaluation results demonstrate that the FedAvgM aggregation algorithm outperforms FedAvg, indicating that in non-IID datasets, FedAvgM provides better stability and performance.

Title: Task-Aware Low-Rank Adaptation of Segment Anything Model

Authors: Xuehao Wang, Feiyang Ye, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10971
Pdf URL: https://arxiv.org/pdf/2403.10971
Copy Paste: [[2403.10971]] Task-Aware Low-Rank Adaptation of Segment Anything Model(https://arxiv.org/abs/2403.10971)
Keywords: segmentation
Abstract: The Segment Anything Model (SAM), with its remarkable zero-shot capability, has been proven to be a powerful foundation model for image segmentation tasks, which is an important task in computer vision. However, the transfer of its rich semantic information to multiple different downstream tasks remains unexplored. In this paper, we propose the Task-Aware Low-Rank Adaptation (TA-LoRA) method, which enables SAM to work as a foundation model for multi-task learning. Specifically, TA-LoRA injects an update parameter tensor into each layer of the encoder in SAM and leverages a low-rank tensor decomposition method to incorporate both task-shared and task-specific information. Furthermore, we introduce modified SAM (mSAM) for multi-task learning where we remove the prompt encoder of SAM and use task-specific no mask embeddings and mask decoder for each task. Extensive experiments conducted on benchmark datasets substantiate the efficacy of TA-LoRA in enhancing the performance of mSAM across multiple downstream tasks.

Title: OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Authors: Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.10983
Pdf URL: https://arxiv.org/pdf/2403.10983
Copy Paste: [[2403.10983]] OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models(https://arxiv.org/abs/2403.10983)
Keywords: diffusion
Abstract: Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foreground and background. In this work, we propose OMG, an occlusion-friendly personalized generation framework designed to seamlessly integrate multiple concepts within a single image. We propose a novel two-stage sampling solution. The first stage takes charge of layout generation and visual comprehension information collection for handling occlusions. The second one utilizes the acquired visual comprehension information and the designed noise blending to integrate multiple concepts while considering occlusions. We also observe that the initiation denoising timestep for noise blending is the key to identity preservation and layout. Moreover, our method can be combined with various single-concept models, such as LoRA and InstantID without additional tuning. Especially, LoRA models on civitai.com can be exploited directly. Extensive experiments demonstrate that OMG exhibits superior performance in multi-concept personalization.

Title: IoTCO2: Assessing the End-To-End Carbon Footprint of Internet-of-Things-Enabled Deep Learning

Authors: Ahmad Faiz, Shahzeen Attari, Gayle Buck, Fan Chen, Lei Jiang
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2403.10984
Pdf URL: https://arxiv.org/pdf/2403.10984
Copy Paste: [[2403.10984]] IoTCO2: Assessing the End-To-End Carbon Footprint of Internet-of-Things-Enabled Deep Learning(https://arxiv.org/abs/2403.10984)
Keywords: privacy
Abstract: To improve privacy and ensure quality-of-service (QoS), deep learning (DL) models are increasingly deployed on Internet of Things (IoT) devices for data processing, significantly increasing the carbon footprint associated with DL on IoT, covering both operational and embodied aspects. Existing operational energy predictors often overlook quantized DL models and emerging neural processing units (NPUs), while embodied carbon footprint modeling tools neglect non-computing hardware components common in IoT devices, creating a gap in accurate carbon footprint modeling tools for IoT-enabled DL. This paper introduces \textit{\carb}, an end-to-end modeling tool for precise carbon footprint estimation in IoT-enabled DL, demonstrating a maximum $\pm21\%$ deviation in carbon footprint values compared to actual measurements across various DL models. Additionally, practical applications of \carb are showcased through multiple user case studies.

Title: Boosting Flow-based Generative Super-Resolution Models via Learned Prior

Authors: Li-Yuan Tsao, Yi-Chen Lo, Chia-Che Chang, Hao-Wei Chen, Roy Tseng, Chien Feng, Chun-Yi Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.10988
Pdf URL: https://arxiv.org/pdf/2403.10988
Copy Paste: [[2403.10988]] Boosting Flow-based Generative Super-Resolution Models via Learned Prior(https://arxiv.org/abs/2403.10988)
Keywords: generative
Abstract: Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: https://github.com/liyuantsao/FlowSR-LP

Title: Edge Private Graph Neural Networks with Singular Value Perturbation

Authors: Tingting Tang, Yue Niu, Salman Avestimehr, Murali Annavaram
Subjects: cs.LG, cs.AI, cs.CR, cs.SI
Abstract URL: https://arxiv.org/abs/2403.10995
Pdf URL: https://arxiv.org/pdf/2403.10995
Copy Paste: [[2403.10995]] Edge Private Graph Neural Networks with Singular Value Perturbation(https://arxiv.org/abs/2403.10995)
Keywords: privacy, protect, attack, extraction
Abstract: Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature leakage and edge extraction attacks. This paper investigates a scenario where an attacker aims to recover private edge information from a trained GNN model. Previous studies have employed differential privacy (DP) to add noise directly to the adjacency matrix or a compact graph representation. The added perturbations cause the graph structure to be substantially morphed, reducing the model utility. We propose a new privacy-preserving GNN training algorithm, Eclipse, that maintains good model utility while providing strong privacy protection on edges. Eclipse is based on two key observations. First, adjacency matrices in graph structures exhibit low-rank behavior. Thus, Eclipse trains GNNs with a low-rank format of the graph via singular values decomposition (SVD), rather than the original graph. Using the low-rank format, Eclipse preserves the primary graph topology and removes the remaining residual edges. Eclipse adds noise to the low-rank singular values instead of the entire graph, thereby preserving the graph privacy while still maintaining enough of the graph structure to maintain model utility. We theoretically show Eclipse provide formal DP guarantee on edges. Experiments on benchmark graph datasets show that Eclipse achieves significantly better privacy-utility tradeoff compared to existing privacy-preserving GNN training methods. In particular, under strong privacy constraints ($\epsilon$ < 4), Eclipse shows significant gains in the model utility by up to 46%. We further demonstrate that Eclipse also has better resilience against common edge attacks (e.g., LPA), lowering the attack AUC by up to 5% compared to other state-of-the-art baselines.

Title: N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Authors: Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.10997
Pdf URL: https://arxiv.org/pdf/2403.10997
Copy Paste: [[2403.10997]] N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields(https://arxiv.org/abs/2403.10997)
Keywords: segmentation
Abstract: Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.

Title: MASSM: An End-to-End Deep Learning Framework for Multi-Anatomy Statistical Shape Modeling Directly From Images

Authors: Janmesh Ukey, Tushar Kataria, Shireen Y. Elhabian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11008
Pdf URL: https://arxiv.org/pdf/2403.11008
Copy Paste: [[2403.11008]] MASSM: An End-to-End Deep Learning Framework for Multi-Anatomy Statistical Shape Modeling Directly From Images(https://arxiv.org/abs/2403.11008)
Keywords: segmentation
Abstract: Statistical Shape Modeling (SSM) is an effective method for quantitatively analyzing anatomical variations within populations. However, its utility is limited by the need for manual segmentations of anatomies, a task that relies on the scarce expertise of medical professionals. Recent advances in deep learning have provided a promising approach that automatically generates statistical representations from unsegmented images. Once trained, these deep learning-based models eliminate the need for manual segmentation for new subjects. Nonetheless, most current methods still require manual pre-alignment of image volumes and specifying a bounding box around the target anatomy prior for inference, resulting in a partially manual inference process. Recent approaches facilitate anatomy localization but only estimate statistical representations at the population level. However, they cannot delineate anatomy directly in images and are limited to modeling a single anatomy. Here, we introduce MASSM, a novel end-to-end deep learning framework that simultaneously localizes multiple anatomies in an image, estimates population-level statistical representations, and delineates each anatomy. Our findings emphasize the crucial role of local correspondences, showcasing their indispensability in providing superior shape information for medical imaging tasks.

Title: EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration

Authors: Abu Zahid Bin Aziz, Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Y. Elhabian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11026
Pdf URL: https://arxiv.org/pdf/2403.11026
Copy Paste: [[2403.11026]] EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration(https://arxiv.org/abs/2403.11026)
Keywords: transformer
Abstract: Transformers have emerged as the state-of-the-art architecture in medical image registration, outperforming convolutional neural networks (CNNs) by addressing their limited receptive fields and overcoming gradient instability in deeper models. Despite their success, transformer-based models require substantial resources for training, including data, memory, and computational power, which may restrict their applicability for end users with limited resources. In particular, existing transformer-based 3D image registration architectures face three critical gaps that challenge their efficiency and effectiveness. Firstly, while mitigating the quadratic complexity of full attention by focusing on local regions, window-based attention mechanisms often fail to adequately integrate local and global information. Secondly, feature similarities across attention heads that were recently found in multi-head attention architectures indicate a significant computational redundancy, suggesting that the capacity of the network could be better utilized to enhance performance. Lastly, the granularity of tokenization, a key factor in registration accuracy, presents a trade-off; smaller tokens improve detail capture at the cost of higher computational complexity, increased memory demands, and a risk of overfitting. Here, we propose EfficientMorph, a transformer-based architecture for unsupervised 3D image registration. It optimizes the balance between local and global attention through a plane-based attention mechanism, reduces computational redundancy via cascaded group attention, and captures fine details without compromising computational efficiency, thanks to a Hi-Res tokenization strategy complemented by merging operations. Notably, EfficientMorph sets a new benchmark for performance on the OASIS dataset with 16-27x fewer parameters.

Title: Reward Guided Latent Consistency Distillation

Authors: Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11027
Pdf URL: https://arxiv.org/pdf/2403.11027
Copy Paste: [[2403.11027]] Reward Guided Latent Consistency Distillation(https://arxiv.org/abs/2403.11027)
Keywords: diffusion
Abstract: Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM.

Title: Texture Edge detection by Patch consensus (TEP)

Authors: Guangyu Cui, Sung Ha Kang
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2403.11038
Pdf URL: https://arxiv.org/pdf/2403.11038
Copy Paste: [[2403.11038]] Texture Edge detection by Patch consensus (TEP)(https://arxiv.org/abs/2403.11038)
Keywords: segmentation
Abstract: We propose Texture Edge detection using Patch consensus (TEP) which is a training-free method to detect the boundary of texture. We propose a new simple way to identify the texture edge location, using the consensus of segmented local patch information. While on the boundary, even using local patch information, the distinction between textures are typically not clear, but using neighbor consensus give a clear idea of the boundary. We utilize local patch, and its response against neighboring regions, to emphasize the similarities and the differences across different textures. The step of segmentation of response further emphasizes the edge location, and the neighborhood voting gives consensus and stabilize the edge detection. We analyze texture as a stationary process to give insight into the patch width parameter verses the quality of edge detection. We derive the necessary condition for textures to be distinguished, and analyze the patch width with respect to the scale of textures. Various experiments are presented to validate the proposed model.

Title: FAGH: Accelerating Federated Learning with Approximated Global Hessian

Authors: Mrinmay Sen, A. K. Qin, Krishna Mohan C
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2403.11041
Pdf URL: https://arxiv.org/pdf/2403.11041
Copy Paste: [[2403.11041]] FAGH: Accelerating Federated Learning with Approximated Global Hessian(https://arxiv.org/abs/2403.11041)
Keywords: federate
Abstract: In federated learning (FL), the significant communication overhead due to the slow convergence speed of training the global model poses a great challenge. Specifically, a large number of communication rounds are required to achieve the convergence in FL. One potential solution is to employ the Newton-based optimization method for training, known for its quadratic convergence rate. However, the existing Newton-based FL training methods suffer from either memory inefficiency or high computational costs for local clients or the server. To address this issue, we propose an FL with approximated global Hessian (FAGH) method to accelerate FL training. FAGH leverages the first moment of the approximated global Hessian and the first moment of the global gradient to train the global model. By harnessing the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, leading to the reduced number of communication rounds and thus the shortened training time. Experimental results verify FAGH's effectiveness in decreasing the number of communication rounds and the time required to achieve the pre-specified objectives of the global model performance in terms of training and test losses as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods.

Title: From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting

Authors: Zhen Zeng, Rachneet Kaur, Suchetha Siddagangappa, Tucker Balch, Manuela Veloso
Subjects: cs.CV, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2403.11047
Pdf URL: https://arxiv.org/pdf/2403.11047
Copy Paste: [[2403.11047]] From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting(https://arxiv.org/abs/2403.11047)
Keywords: transformer
Abstract: Time series forecasting plays a crucial role in decision-making across various domains, but it presents significant challenges. Recent studies have explored image-driven approaches using computer vision models to address these challenges, often employing lineplots as the visual representation of time series data. In this paper, we propose a novel approach that uses time-frequency spectrograms as the visual representation of time series data. We introduce the use of a vision transformer for multimodal learning, showcasing the advantages of our approach across diverse datasets from different domains. To evaluate its effectiveness, we compare our method against statistical baselines (EMA and ARIMA), a state-of-the-art deep learning-based approach (DeepAR), other visual representations of time series data (lineplot images), and an ablation study on using only the time series as input. Our experiments demonstrate the benefits of utilizing spectrograms as a visual representation for time series data, along with the advantages of employing a vision transformer for simultaneous learning in both the time and frequency domains.

Title: Endora: Video Generation Models as Endoscopy Simulators

Authors: Chenxin Li, Hengyu Liu, Yifan Liu, Brandon Y. Feng, Wuyang Li, Xinyu Liu, Zhen Chen, Jing Shao, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11050
Pdf URL: https://arxiv.org/pdf/2403.11050
Copy Paste: [[2403.11050]] Endora: Video Generation Models as Endoscopy Simulators(https://arxiv.org/abs/2403.11050)
Keywords: transformer, generative
Abstract: Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped.This paper introduces \model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted spatial-temporal video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation. We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor.Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing. Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even generate 3D medical scenes with multi-view consistency. In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation. For more details, please visit our project page: https://endora-medvidgen.github.io/.

Title: Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Authors: Jie Ren, Yaxin Li, Shenglai Zen, Han Xu, Lingjuan Lyu, Yue Xing, Jiliang Tang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2403.11052
Pdf URL: https://arxiv.org/pdf/2403.11052
Copy Paste: [[2403.11052]] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention(https://arxiv.org/abs/2403.11052)
Keywords: privacy, diffusion
Abstract: Recent advancements in text-to-image diffusion models have demonstrated their remarkable capability to generate high-quality images from textual prompts. However, increasing research indicates that these models memorize and replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. In our study, we provide a novel perspective to understand this memorization phenomenon by examining its relationship with cross-attention mechanisms. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. The diffusion model is overfitted to these token embeddings, memorizing corresponding training images. To elucidate this phenomenon, we further identify and discuss various intrinsic findings of cross-attention that contribute to memorization. Building on these insights, we introduce an innovative approach to detect and mitigate memorization in diffusion models. The advantage of our proposed method is that it will not compromise the speed of either the training or the inference processes in these models while preserving the quality of generated images. Our code is available at https://github.com/renjie3/MemAttn .

Title: Large Language Models Powered Context-aware Motion Prediction

Authors: Xiaoji Zheng, Lixiu Wu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, Jiangtao Gong
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11057
Pdf URL: https://arxiv.org/pdf/2403.11057
Copy Paste: [[2403.11057]] Large Language Models Powered Context-aware Motion Prediction(https://arxiv.org/abs/2403.11057)
Keywords: large language model
Abstract: Motion prediction is among the most fundamental tasks in autonomous driving. Traditional methods of motion forecasting primarily encode vector information of maps and historical trajectory data of traffic participants, lacking a comprehensive understanding of overall traffic semantics, which in turn affects the performance of prediction tasks. In this paper, we utilized Large Language Models (LLMs) to enhance the global traffic context understanding for motion prediction tasks. We first conducted systematic prompt engineering, visualizing complex traffic environments and historical trajectory information of traffic participants into image prompts -- Transportation Context Map (TC-Map), accompanied by corresponding text prompts. Through this approach, we obtained rich traffic context information from the LLM. By integrating this information into the motion prediction model, we demonstrate that such context can enhance the accuracy of motion predictions. Furthermore, considering the cost associated with LLMs, we propose a cost-effective deployment strategy: enhancing the accuracy of motion prediction tasks at scale with 0.7\% LLM-augmented datasets. Our research offers valuable insights into enhancing the understanding of traffic scenes of LLMs and the motion prediction performance of autonomous driving.

Title: Intelligent Railroad Grade Crossing: Leveraging Semantic Segmentation and Object Detection for Enhanced Safety

Authors: Al Amin, Deo Chimba, Kamrul Hasan, Emmanuel Samson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11060
Pdf URL: https://arxiv.org/pdf/2403.11060
Copy Paste: [[2403.11060]] Intelligent Railroad Grade Crossing: Leveraging Semantic Segmentation and Object Detection for Enhanced Safety(https://arxiv.org/abs/2403.11060)
Keywords: segmentation
Abstract: Crashes and delays at Railroad Highway Grade Crossings (RHGC), where highways and railroads intersect, pose significant safety concerns for the U.S. Federal Railroad Administration (FRA). Despite the critical importance of addressing accidents and traffic delays at highway-railroad intersections, there is a notable dearth of research on practical solutions for managing these issues. In response to this gap in the literature, our study introduces an intelligent system that leverages machine learning and computer vision techniques to enhance safety at Railroad Highway Grade crossings (RHGC). This research proposed a Non-Maximum Suppression (NMS)- based ensemble model that integrates a variety of YOLO variants, specifically YOLOv5S, YOLOv5M, and YOLOv5L, for grade-crossing object detection, utilizes segmentation techniques from the UNet architecture for detecting approaching rail at a grade crossing. Both methods are implemented on a Raspberry Pi. Moreover, the strategy employs high-definition cameras installed at the RHGC. This framework enables the system to monitor objects within the Region of Interest (ROI) at crossings, detect the approach of trains, and clear the crossing area before a train arrives. Regarding accuracy, precision, recall, and Intersection over Union (IoU), the proposed state-of-the-art NMS-based object detection ensemble model achieved 96% precision. In addition, the UNet segmentation model obtained a 98% IoU value. This automated railroad grade crossing system powered by artificial intelligence represents a promising solution for enhancing safety at highway-railroad intersections.

Title: Tokensome: Towards a Genetic Vision-Language GPT for Explainable and Cognitive Karyotyping

Authors: Haoxi Zhang, Xinxu Zhang, Yuanxin Lin, Maiqi Wang, Yi Lai, Yu Wang, Linfeng Yu, Yufeng Xu, Ran Cheng, Edward Szczerbicki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11073
Pdf URL: https://arxiv.org/pdf/2403.11073
Copy Paste: [[2403.11073]] Tokensome: Towards a Genetic Vision-Language GPT for Explainable and Cognitive Karyotyping(https://arxiv.org/abs/2403.11073)
Keywords: interpretability, explainability
Abstract: Automatic karyotype analysis is often defined as a visual perception task focused solely on chromosomal object-level modeling. This definition has led most existing methods to overlook componential and holistic information, significantly constraining model performance. Moreover, the lack of interpretability in current technologies hinders clinical adoption. In this paper, we introduce Tokensome, a novel vision-language model based on chromosome tokenization for explainable and cognitive karyotyping. Tokensome elevates the method from the conventional visual perception layer to the cognitive decision-making layer. This elevation enables the integration of domain knowledge and cognitive reasoning via knowledge graphs and LLMs, markedly enhancing model's explainability and facilitating abnormality detection.

Title: Audio-Visual Segmentation via Unlabeled Frame Exploitation

Authors: Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang
Subjects: cs.CV, cs.AI, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.11074
Pdf URL: https://arxiv.org/pdf/2403.11074
Copy Paste: [[2403.11074]] Audio-Visual Segmentation via Unlabeled Frame Exploitation(https://arxiv.org/abs/2403.11074)
Keywords: segmentation
Abstract: Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.

Title: Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model

Authors: Kangyang Xie, Binbin Yang, Hao Chen, Meng Wang, Cheng Zou, Hui Xue, Ming Yang, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11077
Pdf URL: https://arxiv.org/pdf/2403.11077
Copy Paste: [[2403.11077]] Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model(https://arxiv.org/abs/2403.11077)
Keywords: diffusion, generative
Abstract: Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To accomplish this, we present Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes. By alternatively selecting one modality as the condition and then applying the diffusion process to the counterpart modality, Zippo is capable of generating RGB images from alpha mattes and predicting transparency from input images. In addition to single-modality prediction, we propose a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance. Our experiments showcase Zippo's ability of efficient text-conditioned transparent image generation and present plausible results of Matte-to-RGB and RGB-to-Matte translation.

Title: RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning

Authors: Javad Rafiei Asl, Prajwal Panzade, Eduardo Blanco, Daniel Takabi, Zhipeng Cai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11082
Pdf URL: https://arxiv.org/pdf/2403.11082
Copy Paste: [[2403.11082]] RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning(https://arxiv.org/abs/2403.11082)
Keywords: attack, robust
Abstract: Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51\% to 38.81\%). The framework also yields improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and various transfer tasks, respectively.

Title: Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

Authors: Xiaohao Xu, Yunkang Cao, Yongqi Chen, Weiming Shen, Xiaonan Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11083
Pdf URL: https://arxiv.org/pdf/2403.11083
Copy Paste: [[2403.11083]] Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning(https://arxiv.org/abs/2403.11083)
Keywords: robust
Abstract: Anomaly detection is vital in various industrial scenarios, including the identification of unusual patterns in production lines and the detection of manufacturing defects for quality control. Existing techniques tend to be specialized in individual scenarios and lack generalization capacities. In this study, we aim to develop a generic anomaly detection model applicable across multiple scenarios. To achieve this, we customize generic visual-language foundation models that possess extensive knowledge and robust reasoning abilities into anomaly detectors and reasoners. Specifically, we introduce a multi-modal prompting strategy that incorporates domain knowledge from experts as conditions to guide the models. Our approach considers multi-modal prompt types, including task descriptions, class context, normality rules, and reference images. In addition, we unify the input representation of multi-modality into a 2D image format, enabling multi-modal anomaly detection and reasoning. Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance. The customized models showcase the ability to detect anomalies across different data modalities such as images and point clouds. Qualitative case studies further highlight the anomaly detection and reasoning capabilities, particularly for multi-object scenes and temporal data. Our code is available at https://github.com/Xiaohao-Xu/Customizable-VLM.

Title: Programming Frameworks for Differential Privacy

Authors: Marco Gaboardi, Michael Hay, Salil Vadhan
Subjects: cs.CR, cs.DB, cs.PL
Abstract URL: https://arxiv.org/abs/2403.11088
Pdf URL: https://arxiv.org/pdf/2403.11088
Copy Paste: [[2403.11088]] Programming Frameworks for Differential Privacy(https://arxiv.org/abs/2403.11088)
Keywords: privacy
Abstract: Many programming frameworks have been introduced to support the development of differentially private software applications. In this chapter, we survey some of the conceptual ideas underlying these frameworks in a way that we hope will be helpful for both practitioners and researchers. For practitioners, the survey can provide a starting point for understanding what features may be valuable when selecting a programming framework. For researchers, it can help organize existing work in a unified way and provide context for understanding new features in future frameworks.

Title: Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang
Subjects: cs.CL, cs.AI, cs.CV, cs.CY, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11092
Pdf URL: https://arxiv.org/pdf/2403.11092
Copy Paste: [[2403.11092]] Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts(https://arxiv.org/abs/2403.11092)
Keywords: fair
Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.

Title: Hierarchical Generative Network for Face Morphing Attacks

Authors: Zuyuan He, Zongyong Deng, Qiaoyun He, Qijun Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11101
Pdf URL: https://arxiv.org/pdf/2403.11101
Copy Paste: [[2403.11101]] Hierarchical Generative Network for Face Morphing Attacks(https://arxiv.org/abs/2403.11101)
Keywords: attack, generative
Abstract: Face morphing attacks circumvent face recognition systems (FRSs) by creating a morphed image that contains multiple identities. However, existing face morphing attack methods either sacrifice image quality or compromise the identity preservation capability. Consequently, these attacks fail to bypass FRSs verification well while still managing to deceive human observers. These methods typically rely on global information from contributing images, ignoring the detailed information from effective facial regions. To address the above issues, we propose a novel morphing attack method to improve the quality of morphed images and better preserve the contributing identities. Our proposed method leverages the hierarchical generative network to capture both local detailed and global consistency information. Additionally, a mask-guided image blending module is dedicated to removing artifacts from areas outside the face to improve the image's visual quality. The proposed attack method is compared to state-of-the-art methods on three public datasets in terms of FRSs' vulnerability, attack detectability, and image quality. The results show our method's potential threat of deceiving FRSs while being capable of passing multiple morphing attack detection (MAD) scenarios.

Title: ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models

Authors: Yuzhao Heng, Chunyuan Deng, Yitong Li, Yue Yu, Yinghao Li, Rongzhi Zhang, Chao Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11103
Pdf URL: https://arxiv.org/pdf/2403.11103
Copy Paste: [[2403.11103]] ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models(https://arxiv.org/abs/2403.11103)
Keywords: extraction, large language model
Abstract: Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs' challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives.

Title: Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Authors: Ruibin Li, Ruihuang Li, Song Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11105
Pdf URL: https://arxiv.org/pdf/2403.11105
Copy Paste: [[2403.11105]] Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models(https://arxiv.org/abs/2403.11105)
Keywords: diffusion
Abstract: Text-driven diffusion models have significantly advanced the image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code be independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant decrease in editing artifacts. In addition to text-driven image editing, with SPDInv we can easily adapt customized image generation models to localized editing tasks and produce promising performance. The source code are available at https://github.com/leeruibin/SPDInv.

Title: Self-supervised co-salient object detection via feature correspondence at multiple scales

Authors: Souradeep Chakraborty, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11107
Pdf URL: https://arxiv.org/pdf/2403.11107
Copy Paste: [[2403.11107]] Self-supervised co-salient object detection via feature correspondence at multiple scales(https://arxiv.org/abs/2403.11107)
Keywords: segmentation
Abstract: Our paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations. Unlike existing unsupervised methods that rely solely on patch-level information (e.g. clustering patch descriptors) or on computation heavy off-the-shelf components for CoSOD, our lightweight model leverages feature correspondences at both patch and region levels, significantly improving prediction performance. In the first stage, we train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images. We obtain the segmentation predictions using confidence-based adaptive thresholding. In the next stage, we refine these intermediate segmentations by eliminating the detected regions (within each image) whose averaged feature representations are dissimilar to the foreground feature representation averaged across all the cross-attention maps (from the previous stage). Extensive experiments on three CoSOD benchmark datasets show that our self-supervised model outperforms the corresponding state-of-the-art models by a huge margin (e.g. on the CoCA dataset, our model has a 13.7% F-measure gain over the SOTA unsupervised CoSOD model). Notably, our self-supervised model also outperforms several recent fully supervised CoSOD models on the three test datasets (e.g., on the CoCA dataset, our model has a 4.6% F-measure gain over a recent supervised CoSOD model).

Title: 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

Authors: Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11111
Pdf URL: https://arxiv.org/pdf/2403.11111
Copy Paste: [[2403.11111]] 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models(https://arxiv.org/abs/2403.11111)
Keywords: diffusion, generative, segmentation
Abstract: In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.

Title: Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis

Authors: Yiyang Chen, Lunhao Duan, Shanshan Zhao, Changxing Ding, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11113
Pdf URL: https://arxiv.org/pdf/2403.11113
Copy Paste: [[2403.11113]] Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis(https://arxiv.org/abs/2403.11113)
Keywords: segmentation
Abstract: Rotation invariance is an important requirement for point shape analysis. To achieve this, current state-of-the-art methods attempt to construct the local rotation-invariant representation through learning or defining the local reference frame (LRF). Although efficient, these LRF-based methods suffer from perturbation of local geometric relations, resulting in suboptimal local rotation invariance. To alleviate this issue, we propose a Local-consistent Transformation (LocoTrans) learning strategy. Specifically, we first construct the local-consistent reference frame (LCRF) by considering the symmetry of the two axes in LRF. In comparison with previous LRFs, our LCRF is able to preserve local geometric relationships better through performing local-consistent transformation. However, as the consistency only exists in local regions, the relative pose information is still lost in the intermediate layers of the network. We mitigate such a relative pose issue by developing a relative pose recovery (RPR) module. RPR aims to restore the relative pose between adjacent transformed patches. Equipped with LCRF and RPR, our LocoTrans is capable of learning local-consistent transformation and preserving local geometry, which benefits rotation invariance learning. Competitive performance under arbitrary rotations on both shape classification and part segmentation tasks and ablations can demonstrate the effectiveness of our method. Code will be available publicly at https://github.com/wdttt/LocoTrans.

Title: PhD: A Prompted Visual Hallucination Evaluation Dataset

Authors: Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11116
Pdf URL: https://arxiv.org/pdf/2403.11116
Copy Paste: [[2403.11116]] PhD: A Prompted Visual Hallucination Evaluation Dataset(https://arxiv.org/abs/2403.11116)
Keywords: large language model
Abstract: The rapid growth of Large Language Models (LLMs) has driven the development of Large Vision-Language Models (LVLMs). The challenge of hallucination, prevalent in LLMs, also emerges in LVLMs. However, most existing efforts mainly focus on object hallucination in LVLM, ignoring diverse types of LVLM hallucinations. In this study, we delve into the Intrinsic Vision-Language Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of IVL-Hallu on their causes and reflections. Specifically, we propose several novel IVL-Hallu tasks and categorize them into four types: (a) object hallucination, which arises from the misidentification of objects, (b) attribute hallucination, which is caused by the misidentification of attributes, (c) multi-modal conflicting hallucination, which derives from the contradictions between textual and visual information, and (d) counter-common-sense hallucination, which owes to the contradictions between the LVLM knowledge and actual images. Based on these taxonomies, we propose a more challenging benchmark named PhD to evaluate and explore IVL-Hallu. An automated pipeline is proposed for generating different types of IVL-Hallu data. Extensive experiments on five SOTA LVLMs reveal their inability to effectively tackle our proposed IVL-Hallu tasks, with detailed analyses and insights on the origins and possible solutions of these new challenging IVL-Hallu tasks, facilitating future researches on IVL-Hallu and LVLM. The benchmark can be accessed at \href{https://github.com/jiazhen-code/IntrinsicHallu}{this https URL}.

Title: Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence

Authors: Sunghwan Hong, Seokju Cho, Seungryong Kim, Stephen Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11120
Pdf URL: https://arxiv.org/pdf/2403.11120
Copy Paste: [[2403.11120]] Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence(https://arxiv.org/abs/2403.11120)
Keywords: transformer
Abstract: This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.

Title: LERENet: Eliminating Intra-class Differences for Metal Surface Defect Few-shot Semantic Segmentation

Authors: Hanze Ding, Zhangkai Wu, Jiyan Zhang, Ming Ping, Yanfang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11122
Pdf URL: https://arxiv.org/pdf/2403.11122
Copy Paste: [[2403.11122]] LERENet: Eliminating Intra-class Differences for Metal Surface Defect Few-shot Semantic Segmentation(https://arxiv.org/abs/2403.11122)
Keywords: segmentation
Abstract: Few-shot segmentation models excel in metal defect detection due to their rapid generalization ability to new classes and pixel-level segmentation, rendering them ideal for addressing data scarcity issues and achieving refined object delineation in industrial applications. Existing works neglect the \textit{Intra-Class Differences}, inherent in metal surface defect data, which hinders the model from learning sufficient knowledge from the support set to guide the query set segmentation. Specifically, it can be categorized into two types: the \textit{Semantic Difference} induced by internal factors in metal samples and the \textit{Distortion Difference} caused by external factors of surroundings. To address these differences, we introduce a \textbf{L}ocal d\textbf{E}scriptor based \textbf{R}easoning and \textbf{E}xcitation \textbf{Net}work (\textbf{LERENet}) to learn the two-view guidance, i.e., local and global information from the graph and feature space, and fuse them to segment precisely. Since the relation structure of local features embedded in graph space will help to eliminate \textit{Semantic Difference}, we employ Multi-Prototype Reasoning (MPR) module, extracting local descriptors based prototypes and analyzing local-view feature relevance in support-query pairs. Besides, due to the global information that will assist in countering the \textit{Distortion Difference} in observations, we utilize Multi-Prototype Excitation (MPE) module to capture the global-view relations in support-query pairs. Finally, we employ an Information Fusion Module (IFM) to fuse learned prototypes in local and global views to generate pixel-level masks. Our comprehensive experiments on defect datasets demonstrate that it outperforms existing benchmarks, establishing a new state-of-the-art.

Title: Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Authors: Feifan Song, Bowen Yu, Hao Lang, Haiyang Yu, Fei Huang, Houfeng Wang, Yongbin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11124
Pdf URL: https://arxiv.org/pdf/2403.11124
Copy Paste: [[2403.11124]] Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment(https://arxiv.org/abs/2403.11124)
Keywords: large language model
Abstract: Alignment with human preference prevents large language models (LLMs) from generating misleading or toxic content while requiring high-cost human feedback. Assuming resources of human annotation are limited, there are two different ways of allocating considered: more diverse PROMPTS or more diverse RESPONSES to be labeled. Nonetheless, a straightforward comparison between their impact is absent. In this work, we first control the diversity of both sides according to the number of samples for fine-tuning, which can directly reflect their influence. We find that instead of numerous prompts, more responses but fewer prompts better trigger LLMs for human alignment. Additionally, the concept of diversity for prompts can be more complex than responses that are typically quantified by single digits. Consequently, a new formulation of prompt diversity is proposed, further implying a linear correlation with the final performance of LLMs after fine-tuning. We also leverage it on data augmentation and conduct experiments to show its effect on different algorithms.

Title: Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

Authors: Honglin Mu, Yang Xu, Yunlong Feng, Xiaofeng Han, Yitong Li, Yutai Hou, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11128
Pdf URL: https://arxiv.org/pdf/2403.11128
Copy Paste: [[2403.11128]] Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities(https://arxiv.org/abs/2403.11128)
Keywords: large language model
Abstract: With the rise of Large Language Models (LLMs), AI assistants' ability to utilize tools, especially through API calls, has advanced notably. This progress has necessitated more accurate evaluation methods. Many existing studies adopt static evaluation, where they assess AI assistants' API call based on pre-defined dialogue histories. However, such evaluation method can be misleading, as an AI assistant might fail in generating API calls from preceding human interaction in real cases. Instead of the resource-intensive method of direct human-machine interactions, we propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent, equipped with a user script to ensure human alignment. Experimental results highlight that AutoDE uncovers errors overlooked by static evaluations, aligning more closely with human assessment. Testing four AI assistants using our crafted benchmark, our method mirrored human evaluation with an correlation of 0.99, marking an 8% enhancement compared to conventional static evaluations.

Title: Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering

Authors: Baiyan Zhang, Qin Chen, Jie Zhou, Jian Jin, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11129
Pdf URL: https://arxiv.org/pdf/2403.11129
Copy Paste: [[2403.11129]] Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering(https://arxiv.org/abs/2403.11129)
Keywords: large language model
Abstract: Document-level Event Causality Identification (DECI) aims to identify causal relations between two events in documents. Recent research tends to use pre-trained language models to generate the event causal relations. Whereas, these methods are prone to the errors of sequential generation due to multiple events in a document. Moreover, the potential structures such as event coreference and related causal chain are neglected. In this paper, we propose a multi-task learning framework to enhance event causality identification with rationale and structure-aware causal question answering. Specifically, the DECI task is transformed into multiple-choice question answering, and the causes and effects of the questioned event are generated with large language models. In addition, we generate the rationales to explain why these events have causal relations. Moreover, we construct an event structure graph, which models the multi-hop potential relations for causal reasoning of the current event. Experiments on two benchmark datasets show the great advantages of our proposed approach compared to the state-of-the-art methods. Moreover, we conduct both quantitative and qualitative analyses, which shed light on why each component of our approach can lead to great improvements.

Title: Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

Authors: Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11130
Pdf URL: https://arxiv.org/pdf/2403.11130
Copy Paste: [[2403.11130]] Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models(https://arxiv.org/abs/2403.11130)
Keywords: robust, segmentation
Abstract: This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the effectiveness of four tokenizers across various tasks, including News Classification, Hate Speech Detection, Sentiment Analysis, and Natural Language Inference. Leveraging a diverse set of vocabulary sizes, we scrutinize the intricate interplay between tokenization approaches and model performance. The results reveal that Byte Pair Encoding (BPE) with Farasa outperforms other strategies in multiple tasks, underscoring the significance of morphological analysis in capturing the nuances of the Arabic language. However, challenges arise in sentiment analysis, where dialect specific segmentation issues impact model efficiency. Computational efficiency analysis demonstrates the stability of BPE with Farasa, suggesting its practical viability. Our study uncovers limited impacts of vocabulary size on model performance while keeping the model size unchanged. This is challenging the established beliefs about the relationship between vocabulary, model size, and downstream tasks, emphasizing the need for the study of models' size and their corresponding vocabulary size to generalize across domains and mitigate biases, particularly in dialect based datasets. Paper's recommendations include refining tokenization strategies to address dialect challenges, enhancing model robustness across diverse linguistic contexts, and expanding datasets to encompass the rich dialect based Arabic. This work not only advances our understanding of Arabic language models but also lays the foundation for responsible and ethical developments in natural language processing technologies tailored to the intricacies of the Arabic language.

Title: Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications

Authors: Yonggan Fu, Huaizhi Qu, Zhifan Ye, Chaojian Li, Kevin Zhao, Yingyan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11131
Pdf URL: https://arxiv.org/pdf/2403.11131
Copy Paste: [[2403.11131]] Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications(https://arxiv.org/abs/2403.11131)
Keywords: diffusion, transformer
Abstract: Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing.

Title: Is Mamba Effective for Time Series Forecasting?

Authors: Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Han Zhao, Daling Wang, Yifei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11144
Pdf URL: https://arxiv.org/pdf/2403.11144
Copy Paste: [[2403.11144]] Is Mamba Effective for Time Series Forecasting?(https://arxiv.org/abs/2403.11144)
Keywords: robust, transformer
Abstract: In the realm of time series forecasting (TSF), the Transformer has consistently demonstrated robust performance due to its ability to focus on the global context and effectively capture long-range dependencies within time, as well as discern correlations between multiple variables. However, due to the inefficiencies of the Transformer model and questions surrounding its ability to capture dependencies, ongoing efforts to refine the Transformer architecture persist. Recently, state space models (SSMs), e.g. Mamba, have gained traction due to their ability to capture complex dependencies in sequences, similar to the Transformer, while maintaining near-linear complexity. In text and image tasks, Mamba-based models can improve performance and cost savings, creating a win-win situation. This has piqued our interest in exploring SSM's potential in TSF tasks. In this paper, we introduce two straightforward SSM-based models for TSF, S-Mamba and D-Mamba, both employing the Mamba Block to extract variate correlations. Remarkably, S-Mamba and D-Mamba achieve superior performance while saving GPU memory and training time. Furthermore, we conduct extensive experiments to delve deeper into the potential of Mamba compared to the Transformer in the TSF, aiming to explore a new research direction for this field. Our code is available at https://github.com/wzhwzhwzh0921/S-D-Mamba.

Title: Evaluation Ethics of LLMs in Legal Domain

Authors: Ruizhe Zhang, Haitao Li, Yueyue Wu, Qingyao Ai, Yiqun Liu, Min Zhang, Shaoping Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11152
Pdf URL: https://arxiv.org/pdf/2403.11152
Copy Paste: [[2403.11152]] Evaluation Ethics of LLMs in Legal Domain(https://arxiv.org/abs/2403.11152)
Keywords: robust, large language model
Abstract: In recent years, the utilization of large language models for natural language dialogue has gained momentum, leading to their widespread adoption across various domains. However, their universal competence in addressing challenges specific to specialized fields such as law remains a subject of scrutiny. The incorporation of legal ethics into the model has been overlooked by researchers. We asserts that rigorous ethic evaluation is essential to ensure the effective integration of large language models in legal domains, emphasizing the need to assess domain-specific proficiency and domain-specific ethic. To address this, we propose a novelty evaluation methodology, utilizing authentic legal cases to evaluate the fundamental language abilities, specialized legal knowledge and legal robustness of large language models (LLMs). The findings from our comprehensive evaluation contribute significantly to the academic discourse surrounding the suitability and performance of large language models in legal domains.

Title: Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model

Authors: Dian Zheng, Xiao-Ming Wu, Shuzhou Yang, Jian Zhang, Jian-Fang Hu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11157
Pdf URL: https://arxiv.org/pdf/2403.11157
Copy Paste: [[2403.11157]] Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model(https://arxiv.org/abs/2403.11157)
Keywords: diffusion
Abstract: Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g., prompt) to guide the model to learn different distributions separately, named multi-partite mapping. However, it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work, we propose an advanced selective hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly, we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly, DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally, which gradually maps different distributions into a shared one. In the reverse process, combined with SDT and strong condition guidance, DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles, by only modifying the mapping strategy, we achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly, by only using a lightweight model (only 0.89M), we could achieve outstanding performance. The source code and pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR

Title: CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion

Authors: Xiaoyu Wu, Yang Hua, Chumeng Liang, Jiaru Zhang, Hao Wang, Tao Song, Haibing Guan
Subjects: cs.CV, cs.AI, cs.CR, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11162
Pdf URL: https://arxiv.org/pdf/2403.11162
Copy Paste: [[2403.11162]] CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion(https://arxiv.org/abs/2403.11162)
Keywords: diffusion
Abstract: Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio.

Title: Pencil: Private and Extensible Collaborative Learning without the Non-Colluding Assumption

Authors: Xuanqi Liu, Zhuotao Liu, Qi Li, Ke Xu, Mingwei Xu
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11166
Pdf URL: https://arxiv.org/pdf/2403.11166
Copy Paste: [[2403.11166]] Pencil: Private and Extensible Collaborative Learning without the Non-Colluding Assumption(https://arxiv.org/abs/2403.11166)
Keywords: secure, security, privacy, attack, federate
Abstract: The escalating focus on data privacy poses significant challenges for collaborative neural network training, where data ownership and model training/deployment responsibilities reside with distinct entities. Our community has made substantial contributions to addressing this challenge, proposing various approaches such as federated learning (FL) and privacy-preserving machine learning based on cryptographic constructs like homomorphic encryption (HE) and secure multiparty computation (MPC). However, FL completely overlooks model privacy, and HE has limited extensibility (confined to only one data provider). While the state-of-the-art MPC frameworks provide reasonable throughput and simultaneously ensure model/data privacy, they rely on a critical non-colluding assumption on the computing servers, and relaxing this assumption is still an open problem. In this paper, we present Pencil, the first private training framework for collaborative learning that simultaneously offers data privacy, model privacy, and extensibility to multiple data providers, without relying on the non-colluding assumption. Our fundamental design principle is to construct the n-party collaborative training protocol based on an efficient two-party protocol, and meanwhile ensuring that switching to different data providers during model training introduces no extra cost. We introduce several novel cryptographic protocols to realize this design principle and conduct a rigorous security and privacy analysis. Our comprehensive evaluations of Pencil demonstrate that (i) models trained in plaintext and models trained privately using Pencil exhibit nearly identical test accuracies; (ii) The training overhead of Pencil is greatly reduced: Pencil achieves 10 ~ 260x higher throughput and 2 orders of magnitude less communication than prior art; (iii) Pencil is resilient against both existing and adaptive (white-box) attacks.

Title: Correcting misinformation on social media with a large language model

Authors: Xinyi Zhou, Ashish Sharma, Amy X. Zhang, Tim Althoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11169
Pdf URL: https://arxiv.org/pdf/2403.11169
Copy Paste: [[2403.11169]] Correcting misinformation on social media with a large language model(https://arxiv.org/abs/2403.11169)
Keywords: large language model
Abstract: Misinformation undermines public trust in science and democracy, particularly on social media where inaccuracies can spread rapidly. Experts and laypeople have shown to be effective in correcting misinformation by manually identifying and explaining inaccuracies. Nevertheless, this approach is difficult to scale, a concern as technologies like large language models (LLMs) make misinformation easier to produce. LLMs also have versatile capabilities that could accelerate misinformation correction; however, they struggle due to a lack of recent information, a tendency to produce plausible but false content and references, and limitations in addressing multimodal information. To address these issues, we propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving contextual evidence and refutations, MUSE can provide accurate and trustworthy explanations and references. It also describes visuals and conducts multimodal searches for correcting multimodal misinformation. We recruit fact-checking and journalism experts to evaluate corrections to real social media posts across 13 dimensions, ranging from the factuality of explanation to the relevance of references. The results demonstrate MUSE's ability to correct misinformation promptly after appearing on social media; overall, MUSE outperforms GPT-4 by 37% and even high-quality corrections from laypeople by 29%. This work underscores the potential of LLMs to combat real-world misinformation effectively and efficiently.

Title: A Tip for IOTA Privacy: IOTA Light Node Deanonymization via Tip Selection

Authors: Hojung Yang, Suhyeon Lee, Seungjoo Kim
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11171
Pdf URL: https://arxiv.org/pdf/2403.11171
Copy Paste: [[2403.11171]] A Tip for IOTA Privacy: IOTA Light Node Deanonymization via Tip Selection(https://arxiv.org/abs/2403.11171)
Keywords: privacy, attack
Abstract: IOTA is a distributed ledger technology that uses a Directed Acyclic Graph (DAG) structure called the Tangle. It is known for its efficiency and is widely used in the Internet of Things (IoT) environment. Tangle can be configured by utilizing the tip selection process. Due to performance issues with light nodes, full nodes are being asked to perform the tip selections of light nodes. However, in this paper, we demonstrate that tip selection can be exploited to compromise users' privacy. An adversary full node can associate a transaction with the identity of a light node by comparing the light node's request with its ledger. We show that these types of attacks are not only viable in the current IOTA environment but also in IOTA 2.0 and the privacy improvement being studied. We also provide solutions to mitigate these attacks and propose ways to enhance anonymity in the IOTA network while maintaining efficiency and scalability.

Title: Artifact Feature Purification for Cross-domain Detection of AI-generated Images

Authors: Zheling Meng, Bo Peng, Jing Dong, Tieniu Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11172
Pdf URL: https://arxiv.org/pdf/2403.11172
Copy Paste: [[2403.11172]] Artifact Feature Purification for Cross-domain Detection of AI-generated Images(https://arxiv.org/abs/2403.11172)
Keywords: security, extraction, diffusion
Abstract: In the era of AIGC, the fast development of visual content generation technologies, such as diffusion models, bring potential security risks to our society. Existing generated image detection methods suffer from performance drop when faced with out-of-domain generators and image scenes. To relieve this problem, we propose Artifact Purification Network (APN) to facilitate the artifact extraction from generated images through the explicit and implicit purification processes. For the explicit one, a suspicious frequency-band proposal method and a spatial feature decomposition method are proposed to extract artifact-related features. For the implicit one, a training strategy based on mutual information estimation is proposed to further purify the artifact-related features. Experiments show that for cross-generator detection, the average accuracy of APN is 5.6% ~ 16.4% higher than the previous 10 methods on GenImage dataset and 1.7% ~ 50.1% on DiffusionForensics dataset. For cross-scene detection, APN maintains its high performance. Via visualization analysis, we find that the proposed method extracts flexible forgery patterns and condenses the forgery information diluted in irrelevant features. We also find that the artifact features APN focuses on across generators and scenes are global and diverse. The code will be available on GitHub.

Title: Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11176
Pdf URL: https://arxiv.org/pdf/2403.11176
Copy Paste: [[2403.11176]] Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment(https://arxiv.org/abs/2403.11176)
Keywords: robust, explainability
Abstract: No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP.

Title: usfAD Based Effective Unknown Attack Detection Focused IDS Framework

Authors: Md. Ashraf Uddin, Sunil Aryal, Mohamed Reda Bouadjenek, Muna Al-Hawawreh, Md. Alamin Talukder
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11180
Pdf URL: https://arxiv.org/pdf/2403.11180
Copy Paste: [[2403.11180]] usfAD Based Effective Unknown Attack Detection Focused IDS Framework(https://arxiv.org/abs/2403.11180)
Keywords: protect, attack, robust
Abstract: The rapid expansion of varied network systems, including the Internet of Things (IoT) and Industrial Internet of Things (IIoT), has led to an increasing range of cyber threats. Ensuring robust protection against these threats necessitates the implementation of an effective Intrusion Detection System (IDS). For more than a decade, researchers have delved into supervised machine learning techniques to develop IDS to classify normal and attack traffic. However, building effective IDS models using supervised learning requires a substantial number of benign and attack samples. To collect a sufficient number of attack samples from real-life scenarios is not possible since cyber attacks occur occasionally. Further, IDS trained and tested on known datasets fails in detecting zero-day or unknown attacks due to the swift evolution of attack patterns. To address this challenge, we put forth two strategies for semi-supervised learning based IDS where training samples of attacks are not required: 1) training a supervised machine learning model using randomly and uniformly dispersed synthetic attack samples; 2) building a One Class Classification (OCC) model that is trained exclusively on benign network traffic. We have implemented both approaches and compared their performances using 10 recent benchmark IDS datasets. Our findings demonstrate that the OCC model based on the state-of-art anomaly detection technique called usfAD significantly outperforms conventional supervised classification and other OCC based techniques when trained and tested considering real-life scenarios, particularly to detect previously unseen attacks.

Title: DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation

Authors: Yuanchen Wu, Xichen Ye, Kequan Yang, Jide Li, Xiaoqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11184
Pdf URL: https://arxiv.org/pdf/2403.11184
Copy Paste: [[2403.11184]] DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2403.11184)
Keywords: robust, segmentation
Abstract: Recently, One-stage Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained increasing interest due to simplification over its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of Class Activation Map (CAM), we observe that one-stage pipelines often encounter confirmation bias caused by incorrect CAM pseudo-labels, impairing their final segmentation performance. Although recent works discard many unreliable pseudo-labels to implicitly alleviate this issue, they fail to exploit sufficient supervision for their models. To this end, we propose a dual student framework with trustworthy progressive learning (DuPL). Specifically, we propose a dual student network with a discrepancy loss to yield diverse CAMs for each sub-net. The two sub-nets generate supervision for each other, mitigating the confirmation bias caused by learning their own incorrect pseudo-labels. In this process, we progressively introduce more trustworthy pseudo-labels to be involved in the supervision through dynamic threshold adjustment with an adaptive noise filtering strategy. Moreover, we believe that every pixel, even discarded from supervision due to its unreliability, is important for WSSS. Thus, we develop consistency regularization on these discarded regions, providing supervision of every pixel. Experiment results demonstrate the superiority of the proposed DuPL over the recent state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is available at https://github.com/Wu0409/DuPL.

Title: NetTrack: Tracking Highly Dynamic Objects with a Net

Authors: Guangze Zheng, Shijie Lin, Haobo Zuo, Changhong Fu, Jia Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11186
Pdf URL: https://arxiv.org/pdf/2403.11186
Copy Paste: [[2403.11186]] NetTrack: Tracking Highly Dynamic Objects with a Net(https://arxiv.org/abs/2403.11186)
Keywords: robust
Abstract: The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT), often manifested as severe deformations, fast motion, and occlusions. Most methods that solely depend on coarse-grained object cues, such as boxes and the overall appearance of the object, are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem, this work proposes NetTrack, an efficient, generic, and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically, NetTrack constructs a dynamicity-aware association with a fine-grained Net, leveraging point-level visual cues. Correspondingly, a fine-grained sampler and matching method have been incorporated. Furthermore, NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity, and thorough transfer experiments on challenging open-world benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong generalization ability of NetTrack even without finetuning. Project page: https://george-zhuang.github.io/nettrack/.

Title: Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

Authors: Kun Xia, Le Wang, Sanping Zhou, Gang Hua, Wei Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11189
Pdf URL: https://arxiv.org/pdf/2403.11189
Copy Paste: [[2403.11189]] Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes(https://arxiv.org/abs/2403.11189)
Keywords: robust
Abstract: The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.

Title: MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

Authors: Yasufumi Kawano, Yoshimitsu Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11194
Pdf URL: https://arxiv.org/pdf/2403.11194
Copy Paste: [[2403.11194]] MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation(https://arxiv.org/abs/2403.11194)
Keywords: diffusion, segmentation
Abstract: Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due to the limited predefined categories in supervised learning, models typically struggle with infrequent classes and are unable to predict novel classes. To address these limitations, we propose MaskDiffusion, an innovative approach that leverages pretrained frozen Stable Diffusion to achieve open-vocabulary semantic segmentation without the need for additional training or annotation, leading to improved performance compared to similar methods. We also demonstrate the superior performance of MaskDiffusion in handling open vocabularies, including fine-grained and proper noun-based categories, thus expanding the scope of segmentation applications. Overall, our MaskDiffusion shows significant qualitative and quantitative improvements in contrast to other comparable unsupervised segmentation methods, i.e. on the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff (+14.8 mIoU compared to DiffSeg). All code and data will be released at https://github.com/Valkyrja3607/MaskDiffusion.

Title: TAG: Guidance-free Open-Vocabulary Semantic Segmentation

Authors: Yasufumi Kawano, Yoshimitsu Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11197
Pdf URL: https://arxiv.org/pdf/2403.11197
Copy Paste: [[2403.11197]] TAG: Guidance-free Open-Vocabulary Semantic Segmentation(https://arxiv.org/abs/2403.11197)
Keywords: segmentation
Abstract: Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive training. Furthermore, because supervised learning uses a limited set of predefined categories, models typically struggle with rare classes and cannot recognize new ones. Unsupervised and open-vocabulary segmentation, proposed to tackle these issues, faces challenges, including the inability to assign specific class labels to clusters and the necessity of user-provided text queries for guidance. In this context, we propose a novel approach, TAG which achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment images into meaningful categories without additional training or dense annotations. It retrieves class labels from an external database, providing flexibility to adapt to new scenarios. Our TAG achieves state-of-the-art results on PascalVOC, PascalContext and ADE20K for open-vocabulary segmentation without given class names, i.e. improvement of +15.3 mIoU on PascalVOC. All code and data will be released at https://github.com/Valkyrja3607/TAG.

Title: TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models

Authors: Junbing Yan, Chengyu Wang, Taolin Zhang, Xiaofeng He, Jun Huang, Longtao Huang, Hui Xue, Wei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11203
Pdf URL: https://arxiv.org/pdf/2403.11203
Copy Paste: [[2403.11203]] TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models(https://arxiv.org/abs/2403.11203)
Keywords: robust
Abstract: KEPLMs are pre-trained models that utilize external knowledge to enhance language understanding. Previous language models facilitated knowledge acquisition by incorporating knowledge-related pre-training tasks learned from relation triples in knowledge graphs. However, these models do not prioritize learning embeddings for entity-related tokens. Moreover, updating the entire set of parameters in KEPLMs is computationally demanding. This paper introduces TRELM, a Robust and Efficient Pre-training framework for Knowledge-Enhanced Language Models. We observe that entities in text corpora usually follow the long-tail distribution, where the representations of some entities are suboptimally optimized and hinder the pre-training process for KEPLMs. To tackle this, we employ a robust approach to inject knowledge triples and employ a knowledge-augmented memory bank to capture valuable information. Furthermore, updating a small subset of neurons in the feed-forward networks (FFNs) that store factual knowledge is both sufficient and efficient. Specifically, we utilize dynamic knowledge routing to identify knowledge paths in FFNs and selectively update parameters during pre-training. Experimental results show that TRELM reduces pre-training time by at least 50% and outperforms other KEPLMs in knowledge probing tasks and multiple knowledge-aware language understanding tasks.

Title: MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Authors: Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.11207
Pdf URL: https://arxiv.org/pdf/2403.11207
Copy Paste: [[2403.11207]] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data(https://arxiv.org/abs/2403.11207)
Keywords: diffusion
Abstract: Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.

Title: THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

Authors: Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, Jingya Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11208
Pdf URL: https://arxiv.org/pdf/2403.11208
Copy Paste: [[2403.11208]] THOR: Text to Human-Object Interaction Diffusion via Relation Intervention(https://arxiv.org/abs/2403.11208)
Keywords: diffusion
Abstract: This paper addresses new methodologies to deal with the challenging task of generating dynamic Human-Object Interactions from textual descriptions (Text2HOI). While most existing works assume interactions with limited body parts or static objects, our task involves addressing the variation in human motion, the diversity of object shapes, and the semantic vagueness of object motion simultaneously. To tackle this, we propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR). THOR is a cohesive diffusion model equipped with a relation intervention mechanism. In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. This intervention enhances the spatial-temporal relations between humans and objects, with human-centric interaction representation providing additional guidance for synthesizing consistent motion from text. To achieve more reasonable and realistic results, interaction losses is introduced at different levels of motion granularity. Moreover, we construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of our proposed model.

Title: RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Authors: Qingrong Sun, Weixiang Zhong, Jie Zhou, Chong Lai, Xiaodong Teng, Maode Lai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11211
Pdf URL: https://arxiv.org/pdf/2403.11211
Copy Paste: [[2403.11211]] RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists(https://arxiv.org/abs/2403.11211)
Keywords: segmentation
Abstract: The annotation of digital pathological slide data for renal cell carcinoma is of paramount importance for correct diagnosis of artificial intelligence models due to the heterogeneous nature of the tumor. This process not only facilitates a deeper understanding of renal cell cancer heterogeneity but also aims to minimize noise in the data for more accurate studies. To enhance the applicability of the data, two pathologists were enlisted to meticulously curate, screen, and label a kidney cancer pathology image dataset from The Cancer Genome Atlas Program (TCGA) database. Subsequently, a Resnet model was developed to validate the annotated dataset against an additional dataset from the First Affiliated Hospital of Zhejiang University. Based on these results, we have meticulously compiled the TCGA digital pathological dataset with independent labeling of tumor regions and adjacent areas (RCdpia), which includes 109 cases of kidney chromophobe cell carcinoma, 486 cases of kidney clear cell carcinoma, and 292 cases of kidney papillary cell carcinoma. This dataset is now publicly accessible at this http URL Furthermore, model analysis has revealed significant discrepancies in predictive outcomes when applying the same model to datasets from different centers. Leveraging the RCdpia, we can now develop more precise digital pathology artificial intelligence models for tasks such as normalization, classification, and segmentation. These advancements underscore the potential for more nuanced and accurate AI applications in the field of digital pathology.

Title: SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream

Authors: Lin Zhu, Kangmin Jia, Yifan Zhao, Yunshan Qi, Lizhi Wang, Hua Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11222
Pdf URL: https://arxiv.org/pdf/2403.11222
Copy Paste: [[2403.11222]] SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream(https://arxiv.org/abs/2403.11222)
Keywords: robust
Abstract: Spike cameras, leveraging spike-based integration sampling and high temporal resolution, offer distinct advantages over standard cameras. However, existing approaches reliant on spike cameras often assume optimal illumination, a condition frequently unmet in real-world scenarios. To address this, we introduce SpikeNeRF, the first work that derives a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF's multi-view consistency to establish robust self-supervision, effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. The framework comprises two core elements: a spike generation model incorporating an integrate-and-fire neuron layer and parameters accounting for non-idealities, such as threshold variation, and a spike rendering loss capable of generalizing across varying illumination conditions. We describe how to effectively optimize neural radiance fields to render photorealistic novel views from the novel continuous spike stream, demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations conducted on both real and novel realistically simulated sequences affirm the efficacy of our methodology. The dataset and source code are released at https://github.com/BIT-Vision/SpikeNeRF.

Title: Cheap Ways of Extracting Clinical Markers from Texts

Authors: Anastasia Sandu, Teodor Mihailescu, Sergiu Nisioi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11227
Pdf URL: https://arxiv.org/pdf/2403.11227
Copy Paste: [[2403.11227]] Cheap Ways of Extracting Clinical Markers from Texts(https://arxiv.org/abs/2403.11227)
Keywords: large language model
Abstract: This paper describes the work of the UniBuc Archaeology team for CLPsych's 2024 Shared Task, which involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs a good old-fashioned machine learning (GOML) pipeline consisting of a tf-idf vectorizer with a logistic regression classifier, whose representative features are used to extract relevant highlights. The second, more resource intensive, uses an LLM for generating the summaries and is guided by chain-of-thought to provide sequences of text indicating clinical markers.

Title: Concatenate, Fine-tuning, Re-training: A SAM-enabled Framework for Semi-supervised 3D Medical Image Segmentation

Authors: Shumeng Li, Lei Qi, Qian Yu, Jing Huo, Yinghuan Shi, Yang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11229
Pdf URL: https://arxiv.org/pdf/2403.11229
Copy Paste: [[2403.11229]] Concatenate, Fine-tuning, Re-training: A SAM-enabled Framework for Semi-supervised 3D Medical Image Segmentation(https://arxiv.org/abs/2403.11229)
Keywords: robust, segmentation
Abstract: Segment Anything Model (SAM) fine-tuning has shown remarkable performance in medical image segmentation in a fully supervised manner, but requires precise annotations. To reduce the annotation cost and maintain satisfactory performance, in this work, we leverage the capabilities of SAM for establishing semi-supervised medical image segmentation models. Rethinking the requirements of effectiveness, efficiency, and compatibility, we propose a three-stage framework, i.e., Concatenate, Fine-tuning, and Re-training (CFR). The current fine-tuning approaches mostly involve 2D slice-wise fine-tuning that disregards the contextual information between adjacent slices. Our concatenation strategy mitigates the mismatch between natural and 3D medical images. The concatenated images are then used for fine-tuning SAM, providing robust initialization pseudo-labels. Afterwards, we train a 3D semi-supervised segmentation model while maintaining the same parameter size as the conventional segmenter such as V-Net. Our CFR framework is plug-and-play, and easily compatible with various popular semi-supervised methods. Extensive experiments validate that our CFR achieves significant improvements in both moderate annotation and scarce annotation across four datasets. In particular, CFR framework improves the Dice score of Mean Teacher from 29.68% to 74.40% with only one labeled data of LA dataset.

Title: Compact 3D Gaussian Splatting For Dense Visual SLAM

Authors: Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Danwei Wang, Weidong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11247
Pdf URL: https://arxiv.org/pdf/2403.11247
Copy Paste: [[2403.11247]] Compact 3D Gaussian Splatting For Dense Visual SLAM(https://arxiv.org/abs/2403.11247)
Keywords: robust
Abstract: Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs, and slow training speed. To address the limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then we observe that the covariance matrix (geometry) of most 3D Gaussian ellipsoids are extremely similar, which motivates a novel geometry codebook to compress 3D Gaussian geometric attributes, i.e., the parameters. Robust and accurate pose estimation is achieved by a global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training and rendering speed while maintaining the state-of-the-art (SOTA) quality of the scene representation.

Title: Uncertainty-Aware Pseudo-Label Filtering for Source-Free Unsupervised Domain Adaptation

Authors: Xi Chen, Haosen Yang, Huicong Zhang, Hongxun Yao, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11256
Pdf URL: https://arxiv.org/pdf/2403.11256
Copy Paste: [[2403.11256]] Uncertainty-Aware Pseudo-Label Filtering for Source-Free Unsupervised Domain Adaptation(https://arxiv.org/abs/2403.11256)
Keywords: robust
Abstract: Source-free unsupervised domain adaptation (SFUDA) aims to enable the utilization of a pre-trained source model in an unlabeled target domain without access to source data. Self-training is a way to solve SFUDA, where confident target samples are iteratively selected as pseudo-labeled samples to guide target model learning. However, prior heuristic noisy pseudo-label filtering methods all involve introducing extra models, which are sensitive to model assumptions and may introduce additional errors or mislabeling. In this work, we propose a method called Uncertainty-aware Pseudo-label-filtering Adaptation (UPA) to efficiently address this issue in a coarse-to-fine manner. Specially, we first introduce a sample selection module named Adaptive Pseudo-label Selection (APS), which is responsible for filtering noisy pseudo labels. The APS utilizes a simple sample uncertainty estimation method by aggregating knowledge from neighboring samples and confident samples are selected as clean pseudo-labeled. Additionally, we incorporate Class-Aware Contrastive Learning (CACL) to mitigate the memorization of pseudo-label noise by learning robust pair-wise representation supervised by pseudo labels. Through extensive experiments conducted on three widely used benchmarks, we demonstrate that our proposed method achieves competitive performance on par with state-of-the-art SFUDA methods. Code is available at https://github.com/chenxi52/UPA.

Title: Understanding Diffusion Models by Feynman's Path Integral

Authors: Yuji Hirono, Akinori Tanaka, Kenji Fukushima
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, hep-th
Abstract URL: https://arxiv.org/abs/2403.11262
Pdf URL: https://arxiv.org/pdf/2403.11262
Copy Paste: [[2403.11262]] Understanding Diffusion Models by Feynman's Path Integral(https://arxiv.org/abs/2403.11262)
Keywords: diffusion, generative
Abstract: Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman's path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions.The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck's constant in quantum physics. This analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes.

Title: Stylized Face Sketch Extraction via Generative Prior with Limited Data

Authors: Kwan Yun, Kwanggyoon Seo, Chang Wook Seo, Soyeon Yoon, Seongcheol Kim, Soohyun Ji, Amirsaman Ashtari, Junyong Noh
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.11263
Pdf URL: https://arxiv.org/pdf/2403.11263
Copy Paste: [[2403.11263]] Stylized Face Sketch Extraction via Generative Prior with Limited Data(https://arxiv.org/abs/2403.11263)
Keywords: extraction, generative
Abstract: Facial sketches are both a concise way of showing the identity of a person and a means to express artistic intention. While a few techniques have recently emerged that allow sketches to be extracted in different styles, they typically rely on a large amount of data that is difficult to obtain. Here, we propose StyleSketch, a method for extracting high-resolution stylized sketches from a face image. Using the rich semantics of the deep features from a pretrained StyleGAN, we are able to train a sketch generator with 16 pairs of face and the corresponding sketch images. The sketch generator utilizes part-based losses with two-stage learning for fast convergence during training for high-quality sketch extraction. Through a set of comparisons, we show that StyleSketch outperforms existing state-of-the-art sketch extraction methods and few-shot image adaptation methods for the task of extracting high-resolution abstract face sketches. We further demonstrate the versatility of StyleSketch by extending its use to other domains and explore the possibility of semantic editing. The project page can be found in https://kwanyun.github.io/stylesketch_project.

Title: Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation

Authors: Silvia Corbara, Alejandro Moreo
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11265
Pdf URL: https://arxiv.org/pdf/2403.11265
Copy Paste: [[2403.11265]] Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation(https://arxiv.org/abs/2403.11265)
Keywords: attack, transformer, generative
Abstract: Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author. In this paper, we investigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of the author of interest. We analyze the improvements in classifier prediction that this augmentation brings to bear in the task of AV in an adversarial setting. In particular, we experiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). We evaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation has yielded negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application.

Title: BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis

Authors: Lutao Jiang, Lin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11273
Pdf URL: https://arxiv.org/pdf/2403.11273
Copy Paste: [[2403.11273]] BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis(https://arxiv.org/abs/2403.11273)
Keywords: generative
Abstract: Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image models with 3D representation methods, e.g., Gaussian Splatting (GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to one-stage generation for any unseen text prompts, which yet remains challenging. A hurdle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end single-stage approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH coefficient), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the triplane feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The project code is available at https://vlislab22.github.io/BrightDreamer.

Title: Fast Personalized Text-to-Image Syntheses With Attention Injection

Authors: Yuxuan Zhang, Yiren Song, Jinpeng Yu, Han Pan, Zhongliang Jing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11284
Pdf URL: https://arxiv.org/pdf/2403.11284
Copy Paste: [[2403.11284]] Fast Personalized Text-to-Image Syntheses With Attention Injection(https://arxiv.org/abs/2403.11284)
Keywords: diffusion
Abstract: Currently, personalized image generation methods mostly require considerable time to finetune and often overfit the concept resulting in generated images that are similar to custom concepts but difficult to edit by prompts. We propose an effective and fast approach that could balance the text-image consistency and identity consistency of the generated image and reference image. Our method can generate personalized images without any fine-tuning while maintaining the inherent text-to-image generation ability of diffusion models. Given a prompt and a reference image, we merge the custom concept into generated images by manipulating cross-attention and self-attention layers of the original diffusion model to generate personalized images that match the text description. Comprehensive experiments highlight the superiority of our method.

Title: Advanced Knowledge Extraction of Physical Design Drawings, Translation and conversion to CAD formats using Deep Learning

Authors: Jesher Joshua M, Ragav V, Syed Ibrahim S P
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11291
Pdf URL: https://arxiv.org/pdf/2403.11291
Copy Paste: [[2403.11291]] Advanced Knowledge Extraction of Physical Design Drawings, Translation and conversion to CAD formats using Deep Learning(https://arxiv.org/abs/2403.11291)
Keywords: extraction
Abstract: The maintenance, archiving and usage of the design drawings is cumbersome in physical form in different industries for longer period. It is hard to extract information by simple scanning of drawing sheets. Converting them to their digital formats such as Computer-Aided Design (CAD), with needed knowledge extraction can solve this problem. The conversion of these machine drawings to its digital form is a crucial challenge which requires advanced techniques. This research proposes an innovative methodology utilizing Deep Learning methods. The approach employs object detection model, such as Yolov7, Faster R-CNN, to detect physical drawing objects present in the images followed by, edge detection algorithms such as canny filter to extract and refine the identified lines from the drawing region and curve detection techniques to detect circle. Also ornaments (complex shapes) within the drawings are extracted. To ensure comprehensive conversion, an Optical Character Recognition (OCR) tool is integrated to identify and extract the text elements from the drawings. The extracted data which includes the lines, shapes and text is consolidated and stored in a structured comma separated values(.csv) file format. The accuracy and the efficiency of conversion is evaluated. Through this, conversion can be automated to help organizations enhance their productivity, facilitate seamless collaborations and preserve valuable design information in a digital format easily accessible. Overall, this study contributes to the advancement of CAD conversions, providing accurate results from the translating process. Future research can focus on handling diverse drawing types, enhanced accuracy in shape and line detection and extraction.

Title: A Modified Word Saliency-Based Adversarial Attack on Text Classification Models

Authors: Hetvi Waghela, Sneha Rakshit, Jaydip Sen
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11297
Pdf URL: https://arxiv.org/pdf/2403.11297
Copy Paste: [[2403.11297]] A Modified Word Saliency-Based Adversarial Attack on Text Classification Models(https://arxiv.org/abs/2403.11297)
Keywords: attack
Abstract: This paper introduces a novel adversarial attack method targeting text classification models, termed the Modified Word Saliency-based Adversarial At-tack (MWSAA). The technique builds upon the concept of word saliency to strategically perturb input texts, aiming to mislead classification models while preserving semantic coherence. By refining the traditional adversarial attack approach, MWSAA significantly enhances its efficacy in evading detection by classification systems. The methodology involves first identifying salient words in the input text through a saliency estimation process, which prioritizes words most influential to the model's decision-making process. Subsequently, these salient words are subjected to carefully crafted modifications, guided by semantic similarity metrics to ensure that the altered text remains coherent and retains its original meaning. Empirical evaluations conducted on diverse text classification datasets demonstrate the effectiveness of the proposed method in generating adversarial examples capable of successfully deceiving state-of-the-art classification models. Comparative analyses with existing adversarial attack techniques further indicate the superiority of the proposed approach in terms of both attack success rate and preservation of text coherence.

Title: SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Authors: Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, Zhiqiang Tao
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11299
Pdf URL: https://arxiv.org/pdf/2403.11299
Copy Paste: [[2403.11299]] SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant(https://arxiv.org/abs/2403.11299)
Keywords: large language model
Abstract: Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.

Title: A Brief Study of Computer Network Security Technologies

Authors: Tulasi Udupa A, Sushma Jayaram, Shreya Ganesh Hegde
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11303
Pdf URL: https://arxiv.org/pdf/2403.11303
Copy Paste: [[2403.11303]] A Brief Study of Computer Network Security Technologies(https://arxiv.org/abs/2403.11303)
Keywords: security, attack
Abstract: The rapid development of computer network system brings both a great convenience and new security threats for users. Network security problem generally includes network system security and data security. Specifically, it refers to the reliability of network system, confidentiality, integrity and availability of data information in the system. This paper introduces the significance of network security systems and highlights related technologies, mainly authentication, data encryption, firewall and antivirus technology. Network security problems can be faced by any network user, therefore we must greatly prioritize network security, try to prevent hostile attacks and ensure the overall security of the network system.

Title: Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Authors: Zichen Wu, HsiuYuan Huang, Fanyi Qu, Yunfang Wu
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2403.11311
Pdf URL: https://arxiv.org/pdf/2403.11311
Copy Paste: [[2403.11311]] Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding(https://arxiv.org/abs/2403.11311)
Keywords: transformer
Abstract: Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

Title: Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts

Authors: Daniel Enström, Viktor Kjellberg, Moa Johansson
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11314
Pdf URL: https://arxiv.org/pdf/2403.11314
Copy Paste: [[2403.11314]] Reasoning in Transformers - Mitigating Spurious Correlations and Reasoning Shortcuts(https://arxiv.org/abs/2403.11314)
Keywords: transformer, generative
Abstract: Transformer language models are neural networks used for a wide variety of tasks concerning natural language, including some that also require logical reasoning. However, a transformer model may easily learn spurious patterns in the data, short-circuiting actual reasoning. In this paper we investigate to what extent transformers can be trained to a) approximate reasoning in propositional logic while b) avoiding known reasoning shortcuts via spurious correlations in the training data. To do so, we use a dataset with known spurious correlation between truth and e.g. the number of rules in the problem. We augment the data with proofs, and train two models: a generative transformer, WP-BART, trained on problems and their whole proofs, and a neuro-symbolic model, SIP-BART, trained on individual proof steps and combining the generative transformer model BART with a symbolic proof checker. We find that SIP-BART succeeds in avoiding reasoning shortcuts, while WP-BART does not. For SIP-BART, we then identify a few remaining reasoning errors, not previously described in the literature, arising from using a pre-trained language model. These are qualitatively analysed to create a taxonomy of four different types of additional pitfalls.

Title: Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

Authors: Igor Sterner, Weizhe Lin, Jinghong Chen, Bill Byrne
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.11317
Pdf URL: https://arxiv.org/pdf/2403.11317
Copy Paste: [[2403.11317]] Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches(https://arxiv.org/abs/2403.11317)
Keywords: large language model
Abstract: Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.

Title: StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows

Authors: Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, Qingyun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11322
Pdf URL: https://arxiv.org/pdf/2403.11322
Copy Paste: [[2403.11322]] StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows(https://arxiv.org/abs/2403.11322)
Keywords: large language model
Abstract: It is a notable trend to use Large Language Models (LLMs) to tackle complex tasks, e.g., tasks that require a sequence of actions and dynamic interaction with tools and environments. In this paper, we propose StateFlow, a novel LLM-based task-solving paradigm that conceptualizes complex task-solving processes backed by LLMs as state machines. With proper construction of states and definition of state transitions, StateFlow grounds the progress of task-solving, ensuring clear tracking and management of LLMs' responses throughout the task-solving process. Within each state, StateFlow allows execution of a series of actions, involving not only the generation of LLM's responses guided by a specific prompt, but also the utilization of external tools as needed. State transitions are controlled by specific rules or decisions made by the LLM, allowing for a dynamic and adaptive progression through the task's pre-defined StateFlow model. Evaluations on the InterCode SQL and Bash benchmarks show that StateFlow significantly enhances LLMs' efficiency.

Title: GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

Authors: Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11324
Pdf URL: https://arxiv.org/pdf/2403.11324
Copy Paste: [[2403.11324]] GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering(https://arxiv.org/abs/2403.11324)
Keywords: generative
Abstract: During the Gaussian Splatting optimization process, the scene's geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue, we propose a novel approach called GeoGaussian. Based on the smoothly connected areas observed from point clouds, this method introduces a novel pipeline to initialize thin Gaussians aligned with the surfaces, where the characteristic can be transferred to new generations through a carefully designed densification strategy. Finally, the pipeline ensures that the scene's geometry and texture are maintained through constrained optimization processes with explicit geometry constraints. Benefiting from the proposed architecture, the generative ability of 3D Gaussians is enhanced, especially in structured regions. Our proposed pipeline achieves state-of-the-art performance in novel view synthesis and geometric reconstruction, as evaluated qualitatively and quantitatively on public datasets.

Title: Domain-Guided Masked Autoencoders for Unique Player Identification

Authors: Bavesh Balaji, Jerrin Bright, Sirisha Rambhatla, Yuhao Chen, Alexander Wong, John Zelek, David A Clausi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11328
Pdf URL: https://arxiv.org/pdf/2403.11328
Copy Paste: [[2403.11328]] Domain-Guided Masked Autoencoders for Unique Player Identification(https://arxiv.org/abs/2403.11328)
Keywords: robust, extraction
Abstract: Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.

Title: Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

Authors: Xue Bai, Tasmiah Haque, Sumit Mohan, Yuliang Cai, Byungheon Jeong, Adam Halasz, Srinjoy Das
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11337
Pdf URL: https://arxiv.org/pdf/2403.11337
Copy Paste: [[2403.11337]] Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction(https://arxiv.org/abs/2403.11337)
Keywords: privacy
Abstract: We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine transformations. Keypoints are extracted by a self-supervised keypoint detector and organized in a time series corresponding to the video frames. Prediction of keypoints, to enable transmission using lower frames per second on the source device, is performed using a Variational Recurrent Neural Network (VRNN). The predicted keypoints are then synthesized to video frames using an optical flow estimator and a generator network. This efficacy of leveraging keypoint based representations in conjunction with VRNN based prediction for both video animation and reconstruction is demonstrated on three diverse datasets. For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.

Title: Federated Transfer Learning with Differential Privacy

Authors: Mengchu Li, Ye Tian, Yang Feng, Yi Yu
Subjects: cs.LG, cs.CR, math.ST, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11343
Pdf URL: https://arxiv.org/pdf/2403.11343
Copy Paste: [[2403.11343]] Federated Transfer Learning with Differential Privacy(https://arxiv.org/abs/2403.11343)
Keywords: privacy, federate
Abstract: Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets.

Title: COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits

Authors: Mintong Kang, Nezihe Merve Gürel, Linyi Li, Bo Li
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11348
Pdf URL: https://arxiv.org/pdf/2403.11348
Copy Paste: [[2403.11348]] COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits(https://arxiv.org/abs/2403.11348)
Keywords: robust
Abstract: Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2.

Title: IGANN Sparse: Bridging Sparsity and Interpretability with Non-linear Insight

Authors: Theodor Stoecker, Nico Hambauer, Patrick Zschech, Mathias Kraus
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2403.11363
Pdf URL: https://arxiv.org/pdf/2403.11363
Copy Paste: [[2403.11363]] IGANN Sparse: Bridging Sparsity and Interpretability with Non-linear Insight(https://arxiv.org/abs/2403.11363)
Keywords: interpretability
Abstract: Feature selection is a critical component in predictive analytics that significantly affects the prediction accuracy and interpretability of models. Intrinsic methods for feature selection are built directly into model learning, providing a fast and attractive option for large amounts of data. Machine learning algorithms, such as penalized regression models (e.g., lasso) are the most common choice when it comes to in-built feature selection. However, they fail to capture non-linear relationships, which ultimately affects their ability to predict outcomes in intricate datasets. In this paper, we propose IGANN Sparse, a novel machine learning model from the family of generalized additive models, which promotes sparsity through a non-linear feature selection process during training. This ensures interpretability through improved model sparsity without sacrificing predictive performance. Moreover, IGANN Sparse serves as an exploratory tool for information systems researchers to unveil important non-linear relationships in domains that are characterized by complex patterns. Our ongoing research is directed at a thorough evaluation of the IGANN Sparse model, including user studies that allow to assess how well users of the model can benefit from the reduced number of features. This will allow for a deeper understanding of the interactions between linear vs. non-linear modeling, number of selected features, and predictive performance.

Title: JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning

Authors: Anique Tahir, Lu Cheng, Huan Liu
Subjects: cs.LG, cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2403.11366
Pdf URL: https://arxiv.org/pdf/2403.11366
Copy Paste: [[2403.11366]] JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning(https://arxiv.org/abs/2403.11366)
Keywords: large language model
Abstract: The scaling of Large Language Models (LLMs) for retrieval-based tasks, particularly in Retrieval Augmented Generation (RAG), faces significant memory constraints, especially when fine-tuning extensive prompt sequences. Current open-source libraries support full-model inference and fine-tuning across multiple GPUs but fall short of accommodating the efficient parameter distribution required for retrieved context. Addressing this gap, we introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training. Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management, thereby enabling accelerated fine-tuning with reduced memory requirements. This advancement significantly improves the scalability and feasibility of fine-tuning LLMs for complex RAG applications, even on systems with limited GPU resources. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPUs while consuming less than half the VRAM per GPU. Our library will be open-sourced in due course.

Title: What Makes Math Word Problems Challenging for LLMs?

Authors: KV Aditya Srivatsa, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11369
Pdf URL: https://arxiv.org/pdf/2403.11369
Copy Paste: [[2403.11369]] What Makes Math Word Problems Challenging for LLMs?(https://arxiv.org/abs/2403.11369)
Keywords: large language model
Abstract: This paper investigates the question of what makes math word problems (MWPs) challenging for large language models (LLMs). We conduct an in-depth analysis of the key linguistic and mathematical characteristics of MWPs. In addition, we train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent LLMs and investigate whether this helps predict how well LLMs fare against specific categories of MWPs.

Title: DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks

Authors: Theresa Huber, Simon Schaefer, Stefan Leutenegger
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11370
Pdf URL: https://arxiv.org/pdf/2403.11370
Copy Paste: [[2403.11370]] DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks(https://arxiv.org/abs/2403.11370)
Keywords: robust
Abstract: The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.

Title: Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Authors: Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11373
Pdf URL: https://arxiv.org/pdf/2403.11373
Copy Paste: [[2403.11373]] Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration(https://arxiv.org/abs/2403.11373)
Keywords: privacy
Abstract: Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic forgetting, diluting the knowledge in pre-trained LMMs. To overcome these challenges, we introduce a novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning. Our preliminary benchmarks reveal that existing methods suffer from a significant performance drop in CMML, even with the aid of advanced continual learning techniques. Therefore, we devise a framework termed Reconstruct before Query (RebQ). It decomposes prompts into modality-specific ones and breaks them into components stored in pools accessible via a key-query mechanism, which facilitates ParameterEfficient Fine-Tuning and enhances knowledge transferability for subsequent tasks. Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality. Comprehensive experiments demonstrate that RebQ effectively reconstructs the missing modality information and retains pre-trained knowledge. Specifically, compared with the baseline, RebQ improves average precision from 20.00 to 50.92 and decreases average forgetting from 75.95 to 8.56. Code and datasets are available on https://github.com/Tree-Shu-Zhao/RebQ.pytorch

Title: ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation

Authors: Minh Tran, Winston Bounsavy, Khoa Vo, Anh Nguyen, Tri Nguyen, Ngan Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11376
Pdf URL: https://arxiv.org/pdf/2403.11376
Copy Paste: [[2403.11376]] ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation(https://arxiv.org/abs/2403.11376)
Keywords: transformer, segmentation
Abstract: Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: https://github.com/UARK-AICV/ShapeFormer

Title: Investigating the Benefits of Projection Head for Representation Learning

Authors: Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, Baharan Mirzasoleiman
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.11391
Pdf URL: https://arxiv.org/pdf/2403.11391
Copy Paste: [[2403.11391]] Investigating the Benefits of Projection Head for Representation Learning(https://arxiv.org/abs/2403.11391)
Keywords: robust
Abstract: An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.

Title: Automated data processing and feature engineering for deep learning and big data applications: a survey

Authors: Alhassan Mumuni amd Fuseini Mumuni
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2403.11395
Pdf URL: https://arxiv.org/pdf/2403.11395
Copy Paste: [[2403.11395]] Automated data processing and feature engineering for deep learning and big data applications: a survey(https://arxiv.org/abs/2403.11395)
Keywords: extraction, generative
Abstract: Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.

Title: Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization

Authors: Yujia Liu, Chenxi Yang, Dingquan Li, Jianhao Ding, Tingting Jiang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11397
Pdf URL: https://arxiv.org/pdf/2403.11397
Copy Paste: [[2403.11397]] Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization(https://arxiv.org/abs/2403.11397)
Keywords: defense, attack, robust
Abstract: The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry, aiding in performance evaluation and optimization guidance. However, these models are found to be vulnerable to adversarial attacks, which introduce imperceptible perturbations to input images, resulting in significant changes in predicted scores. In this paper, we propose a defense method to improve the stability in predicted scores when attacked by small perturbations, thus enhancing the adversarial robustness of NR-IQA models. To be specific, we present theoretical evidence showing that the magnitude of score changes is related to the $\ell_1$ norm of the model's gradient with respect to the input image. Building upon this theoretical foundation, we propose a norm regularization training strategy aimed at reducing the $\ell_1$ norm of the gradient, thereby boosting the robustness of NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge, this work marks the first attempt to defend against adversarial attacks on NR-IQA models. Our study offers valuable insights into the adversarial robustness of NR-IQA models and provides a foundation for future research in this area.

Title: X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Authors: Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11399
Pdf URL: https://arxiv.org/pdf/2403.11399
Copy Paste: [[2403.11399]] X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment(https://arxiv.org/abs/2403.11399)
Keywords: large language model
Abstract: The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on015 these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.

Title: Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Authors: Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11401
Pdf URL: https://arxiv.org/pdf/2403.11401
Copy Paste: [[2403.11401]] Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning(https://arxiv.org/abs/2403.11401)
Keywords: large language model
Abstract: This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

Title: DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Authors: Jeongsol Kim, Geon Yeong Park, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11415
Pdf URL: https://arxiv.org/pdf/2403.11415
Copy Paste: [[2403.11415]] DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation(https://arxiv.org/abs/2403.11415)
Keywords: diffusion
Abstract: Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.

Title: VmambaIR: Visual State Space Model for Image Restoration

Authors: Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, Wenming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11423
Pdf URL: https://arxiv.org/pdf/2403.11423
Copy Paste: [[2403.11423]] VmambaIR: Visual State Space Model for Image Restoration(https://arxiv.org/abs/2403.11423)
Keywords: diffusion, transformer, generative
Abstract: Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.

Title: Benchmarking the Robustness of UAV Tracking Against Common Corruptions

Authors: Xiaoqiong Liu, Yunhe Feng, Shu Hu, Xiaohui Yuan, Heng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11424
Pdf URL: https://arxiv.org/pdf/2403.11424
Copy Paste: [[2403.11424]] Benchmarking the Robustness of UAV Tracking Against Common Corruptions(https://arxiv.org/abs/2403.11424)
Keywords: robust
Abstract: The robustness of unmanned aerial vehicle (UAV) tracking is crucial in many tasks like surveillance and robotics. Despite its importance, little attention is paid to the performance of UAV trackers under common corruptions due to lack of a dedicated platform. Addressing this, we propose UAV-C, a large-scale benchmark for assessing robustness of UAV trackers under common corruptions. Specifically, UAV-C is built upon two popular UAV datasets by introducing 18 common corruptions from 4 representative categories including adversarial, sensor, blur, and composite corruptions in different levels. Finally, UAV-C contains more than 10K sequences. To understand the robustness of existing UAV trackers against corruptions, we extensively evaluate 12 representative algorithms on UAV-C. Our study reveals several key findings: 1) Current trackers are vulnerable to corruptions, indicating more attention needed in enhancing the robustness of UAV trackers; 2) When accompanying together, composite corruptions result in more severe degradation to trackers; and 3) While each tracker has its unique performance profile, some trackers may be more sensitive to specific corruptions. By releasing UAV-C, we hope it, along with comprehensive analysis, serves as a valuable resource for advancing the robustness of UAV tracking against corruption. Our UAV-C will be available at https://github.com/Xiaoqiong-Liu/UAV-C.

Title: Narrative Feature or Structured Feature? A Study of Large Language Models to Identify Cancer Patients at Risk of Heart Failure

Authors: Ziyi Chen, Mengyuan Zhang, Mustafa Mohammed Ahmed, Yi Guo, Thomas J. George, Jiang Bian, Yonghui Wu
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11425
Pdf URL: https://arxiv.org/pdf/2403.11425
Copy Paste: [[2403.11425]] Narrative Feature or Structured Feature? A Study of Large Language Models to Identify Cancer Patients at Risk of Heart Failure(https://arxiv.org/abs/2403.11425)
Keywords: transformer, large language model
Abstract: Cancer treatments are known to introduce cardiotoxicity, negatively impacting outcomes and survivorship. Identifying cancer patients at risk of heart failure (HF) is critical to improving cancer treatment outcomes and safety. This study examined machine learning (ML) models to identify cancer patients at risk of HF using electronic health records (EHRs), including traditional ML, Time-Aware long short-term memory (T-LSTM), and large language models (LLMs) using novel narrative features derived from the structured medical codes. We identified a cancer cohort of 12,806 patients from the University of Florida Health, diagnosed with lung, breast, and colorectal cancers, among which 1,602 individuals developed HF after cancer. The LLM, GatorTron-3.9B, achieved the best F1 scores, outperforming the traditional support vector machines by 39%, the T-LSTM deep learning model by 7%, and a widely used transformer model, BERT, by 5.6%. The analysis shows that the proposed narrative features remarkably increased feature density and improved performance.

Title: BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors

Authors: Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, Baoquan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11427
Pdf URL: https://arxiv.org/pdf/2403.11427
Copy Paste: [[2403.11427]] BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors(https://arxiv.org/abs/2403.11427)
Keywords: diffusion
Abstract: Animatable 3D reconstruction has significant applications across various fields, primarily relying on artists' handcraft creation. Recently, some studies have successfully constructed animatable 3D models from monocular videos. However, these approaches require sufficient view coverage of the object within the input video and typically necessitate significant time and computational costs for training and rendering. This limitation restricts the practical applications. In this work, we propose a method to build animatable 3D Gaussian Splatting from monocular video with diffusion priors. The 3D Gaussian representations significantly accelerate the training and rendering process, and the diffusion priors allow the method to learn 3D models with limited viewpoints. We also present the rigid regularization to enhance the utilization of the priors. We perform an extensive evaluation across various real-world videos, demonstrating its superior performance compared to the current state-of-the-art methods.

Title: A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Authors: Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11430
Pdf URL: https://arxiv.org/pdf/2403.11430
Copy Paste: [[2403.11430]] A Novel Paradigm Boosting Translation Capabilities of Large Language Models(https://arxiv.org/abs/2403.11430)
Keywords: large language model
Abstract: This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

Title: InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions

Authors: Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, Yujiu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11435
Pdf URL: https://arxiv.org/pdf/2403.11435
Copy Paste: [[2403.11435]] InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions(https://arxiv.org/abs/2403.11435)
Keywords: large language model
Abstract: Instruction tuning effectively optimizes Large Language Models (LLMs) for downstream tasks. Due to the changing environment in real-life applications, LLMs necessitate continual task-specific adaptation without catastrophic forgetting. Considering the heavy computational cost, replay-based Continual Learning (CL) methods are the simplest and most widely used for LLMs to address the forgetting issue. However, traditional replay-based methods do not fully utilize instructions to customize the replay strategy. In this work, we propose a novel paradigm called Instruction-based Continual Learning (InsCL). InsCL dynamically replays previous data based on task similarity, calculated by Wasserstein Distance with instructions. Moreover, we further introduce an Instruction Information Metric (InsInfo) to quantify the complexity and diversity of instructions. According to InsInfo, InsCL guides the replay process more inclined to high-quality data. We conduct extensive experiments over 16 tasks with different training orders, observing consistent performance improvements of InsCL. When all tasks have been trained, InsCL achieves performance gains of 3.0 Relative Gain compared with Random Replay, and 27.96 Relative Gain compared with No Replay.

Title: StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation

Authors: Jinpeng Li, Zekai Zhang, Quan Tu, Xin Cheng, Dongyan Zhao, Rui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11439
Pdf URL: https://arxiv.org/pdf/2403.11439
Copy Paste: [[2403.11439]] StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation(https://arxiv.org/abs/2403.11439)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) demonstrate superior performance in generative scenarios and have attracted widespread attention. Among them, stylized dialogue generation is essential in the context of LLMs for building intelligent and engaging dialogue agent. However the ability of LLMs is data-driven and limited by data bias, leading to poor performance on specific tasks. In particular, stylized dialogue generation suffers from a severe lack of supervised data. Furthermore, although many prompt-based methods have been proposed to accomplish specific tasks, their performance in complex real-world scenarios involving a wide variety of dialog styles further enhancement. In this work, we first introduce a stylized dialogue dataset StyleEval with 38 styles by leveraging the generative power of LLMs comprehensively, which has been carefully constructed with rigorous human-led quality control. Based on this, we propose the stylized dialogue framework StyleChat via recitation-augmented memory strategy and multi-task style learning strategy to promote generalization ability. To evaluate the effectiveness of our approach, we created a test benchmark that included both a generation task and a choice task to comprehensively evaluate trained models and assess whether styles and preferences are remembered and understood. Experimental results show that our proposed framework StyleChat outperforms all the baselines and helps to break the style boundary of LLMs.

Title: Boosting Continuous Emotion Recognition with Self-Pretraining using Masked Autoencoders, Temporal Convolutional Networks, and Transformers

Authors: Weiwei Zhou, Jiada Lu, Chenkun Ling, Weifeng Wang, Shaowei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11440
Pdf URL: https://arxiv.org/pdf/2403.11440
Copy Paste: [[2403.11440]] Boosting Continuous Emotion Recognition with Self-Pretraining using Masked Autoencoders, Temporal Convolutional Networks, and Transformers(https://arxiv.org/abs/2403.11440)
Keywords: robust, transformer
Abstract: Human emotion recognition holds a pivotal role in facilitating seamless human-computer interaction. This paper delineates our methodology in tackling the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge within the ambit of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Our study advocates a novel approach aimed at refining continuous emotion recognition. We achieve this by initially harnessing pre-training with Masked Autoencoders (MAE) on facial datasets, followed by fine-tuning on the aff-wild2 dataset annotated with expression (Expr) labels. The pre-trained model serves as an adept visual feature extractor, thereby enhancing the model's robustness. Furthermore, we bolster the performance of continuous emotion recognition by integrating Temporal Convolutional Network (TCN) modules and Transformer Encoder modules into our framework.

Title: Budget Recycling Differential Privacy

Authors: Bo Jiang, Jian Du, Sagar Shamar, Qiang Yan
Subjects: cs.CR, cs.DS, eess.SP
Abstract URL: https://arxiv.org/abs/2403.11445
Pdf URL: https://arxiv.org/pdf/2403.11445
Copy Paste: [[2403.11445]] Budget Recycling Differential Privacy(https://arxiv.org/abs/2403.11445)
Keywords: privacy
Abstract: Differential Privacy (DP) mechanisms usually {force} reduction in data utility by producing ``out-of-bound'' noisy results for a tight privacy budget. We introduce the Budget Recycling Differential Privacy (BR-DP) framework, designed to provide soft-bounded noisy outputs for a broad range of existing DP mechanisms. By ``soft-bounded," we refer to the mechanism's ability to release most outputs within a predefined error boundary, thereby improving utility and maintaining privacy simultaneously. The core of BR-DP consists of two components: a DP kernel responsible for generating a noisy answer per iteration, and a recycler that probabilistically recycles/regenerates or releases the noisy answer. We delve into the privacy accounting of BR-DP, culminating in the development of a budgeting principle that optimally sub-allocates the available budget between the DP kernel and the recycler. Furthermore, we introduce algorithms for tight BR-DP accounting in composition scenarios, and our findings indicate that BR-DP achieves reduced privacy leakage post-composition compared to DP. Additionally, we explore the concept of privacy amplification via subsampling within the BR-DP framework and propose optimal sampling rates for BR-DP across various queries. We experiment with real data, and the results demonstrate BR-DP's effectiveness in lifting the utility-privacy tradeoff provided by DP mechanisms.

Title: Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM

Authors: Linyu Tang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11448
Pdf URL: https://arxiv.org/pdf/2403.11448
Copy Paste: [[2403.11448]] Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM(https://arxiv.org/abs/2403.11448)
Keywords: defense, attack, robust
Abstract: Numerous studies have demonstrated the susceptibility of deep neural networks (DNNs) to subtle adversarial perturbations, prompting the development of many advanced adversarial defense methods aimed at mitigating adversarial attacks. Current defense strategies usually train DNNs for a specific adversarial attack method and can achieve good robustness in defense against this type of adversarial attack. Nevertheless, when subjected to evaluations involving unfamiliar attack modalities, empirical evidence reveals a pronounced deterioration in the robustness of DNNs. Meanwhile, there is a trade-off between the classification accuracy of clean examples and adversarial examples. Most defense methods often sacrifice the accuracy of clean examples in order to improve the adversarial robustness of DNNs. To alleviate these problems and enhance the overall robust generalization of DNNs, we propose the Test-Time Pixel-Level Adversarial Purification (TPAP) method. This approach is based on the robust overfitting characteristic of DNNs to the fast gradient sign method (FGSM) on training and test datasets. It utilizes FGSM for adversarial purification, to process images for purifying unknown adversarial perturbations from pixels at testing time in a "counter changes with changelessness" manner, thereby enhancing the defense capability of DNNs against various unknown adversarial attacks. Extensive experimental results show that our method can effectively improve both overall robust generalization of DNNs, notably over previous methods.

Title: Graph Partial Label Learning with Potential Cause Discovering

Authors: Hang Gao, Jiaguo Yuan, Jiangmeng Li, Chengyu Yao, Fengge Wu, Junsuo Zhao, Changwen Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11449
Pdf URL: https://arxiv.org/pdf/2403.11449
Copy Paste: [[2403.11449]] Graph Partial Label Learning with Potential Cause Discovering(https://arxiv.org/abs/2403.11449)
Keywords: extraction
Abstract: Graph Neural Networks (GNNs) have gained considerable attention for their potential in addressing challenges posed by complex graph-structured data in diverse domains. However, accurately annotating graph data for training is difficult due to the inherent complexity and interconnectedness of graphs. To tackle this issue, we propose a novel graph representation learning method that enables GNN models to effectively learn discriminative information even in the presence of noisy labels within the context of Partially Labeled Learning (PLL). PLL is a critical weakly supervised learning problem, where each training instance is associated with a set of candidate labels, including both the true label and additional noisy labels. Our approach leverages potential cause extraction to obtain graph data that exhibit a higher likelihood of possessing a causal relationship with the labels. By incorporating auxiliary training based on the extracted graph data, our model can effectively filter out the noise contained in the labels. We support the rationale behind our approach with a series of theoretical analyses. Moreover, we conduct extensive evaluations and ablation studies on multiple datasets, demonstrating the superiority of our proposed method.

Title: CasSR: Activating Image Power for Real-World Image Super-Resolution

Authors: Haolan Chen, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11451
Pdf URL: https://arxiv.org/pdf/2403.11451
Copy Paste: [[2403.11451]] CasSR: Activating Image Power for Real-World Image Super-Resolution(https://arxiv.org/abs/2403.11451)
Keywords: extraction, diffusion
Abstract: The objective of image super-resolution is to generate clean and high-resolution images from degraded versions. Recent advancements in diffusion modeling have led to the emergence of various image super-resolution techniques that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the prevalent severe degradation in low-resolution images and the inherent characteristics of diffusion models, achieving high-fidelity image restoration remains challenging. Existing methods often exhibit issues including semantic loss, artifacts, and the introduction of spurious content not present in the original image. To tackle this challenge, we propose Cascaded diffusion for Super-Resolution, CasSR , a novel method designed to produce highly detailed and realistic images. In particular, we develop a cascaded controllable diffusion model that aims to optimize the extraction of information from low-resolution images. This model generates a preliminary reference image to facilitate initial information extraction and degradation mitigation. Furthermore, we propose a multi-attention mechanism to enhance the T2I model's capability in maximizing the restoration of the original image content. Through a comprehensive blend of qualitative and quantitative analyses, we substantiate the efficacy and superiority of our approach.

Title: HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Authors: Huy Nghiem, Hal Daumé III
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2403.11456
Pdf URL: https://arxiv.org/pdf/2403.11456
Copy Paste: [[2403.11456]] HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models(https://arxiv.org/abs/2403.11456)
Keywords: large language model
Abstract: The ubiquitousness of social media has led to the need for reliable and efficient detection of offensive content to limit harmful effects. This has led to a proliferation of datasets and models related to detecting offensive content. While sophisticated models have attained strong performance on individual datasets, these models often do not generalize due to differences between how "offensive content" is conceptualized, and the resulting differences in how these datasets are labeled. In this paper, we introduce HateCOT, a dataset of 52,000 samples drawn from diverse existing sources with explanations generated by GPT-3.5-Turbo and human-curated. We show that pre-training models for the detection of offensive content on HateCOT significantly boots open-sourced Language Models on three benchmark datasets in both zero and few-shot settings, despite differences in domain and task.} We further find that HateCOT enables effective K-shot fine-tuning in the low-resource settings.

Title: Fed3DGS: Scalable 3D Gaussian Splatting with Federated Learning

Authors: Teppei Suzuki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11460
Pdf URL: https://arxiv.org/pdf/2403.11460
Copy Paste: [[2403.11460]] Fed3DGS: Scalable 3D Gaussian Splatting with Federated Learning(https://arxiv.org/abs/2403.11460)
Keywords: federate
Abstract: In this work, we present Fed3DGS, a scalable 3D reconstruction framework based on 3D Gaussian splatting (3DGS) with federated learning. Existing city-scale reconstruction methods typically adopt a centralized approach, which gathers all data in a central server and reconstructs scenes. The approach hampers scalability because it places a heavy load on the server and demands extensive data storage when reconstructing scenes on a scale beyond city-scale. In pursuit of a more scalable 3D reconstruction, we propose a federated learning framework with 3DGS, which is a decentralized framework and can potentially use distributed computational resources across millions of clients. We tailor a distillation-based model update scheme for 3DGS and introduce appearance modeling for handling non-IID data in the scenario of 3D reconstruction with federated learning. We simulate our method on several large-scale benchmarks, and our method demonstrates rendered image quality comparable to centralized approaches. In addition, we also simulate our method with data collected in different seasons, demonstrating that our framework can reflect changes in the scenes and our appearance modeling captures changes due to seasonal variations.

Title: Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Authors: Chaolei Tan, Jianhuang Lai, Wei-Shi Zheng, Jian-Fang Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11463
Pdf URL: https://arxiv.org/pdf/2403.11463
Copy Paste: [[2403.11463]] Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding(https://arxiv.org/abs/2403.11463)
Keywords: transformer
Abstract: Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

Title: FedSPU: Personalized Federated Learning for Resource-constrained Devices with Stochastic Parameter Update

Authors: Ziru Niu, Hai Dong, A. K. Qin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11464
Pdf URL: https://arxiv.org/pdf/2403.11464
Copy Paste: [[2403.11464]] FedSPU: Personalized Federated Learning for Resource-constrained Devices with Stochastic Parameter Update(https://arxiv.org/abs/2403.11464)
Keywords: privacy, robust, federate
Abstract: Personalized Federated Learning (PFL) is widely employed in IoT applications to handle high-volume, non-iid client data while ensuring data privacy. However, heterogeneous edge devices owned by clients may impose varying degrees of resource constraints, causing computation and communication bottlenecks for PFL. Federated Dropout has emerged as a popular strategy to address this challenge, wherein only a subset of the global model, i.e. a \textit{sub-model}, is trained on a client's device, thereby reducing computation and communication overheads. Nevertheless, the dropout-based model-pruning strategy may introduce bias, particularly towards non-iid local data. When biased sub-models absorb highly divergent parameters from other clients, performance degradation becomes inevitable. In response, we propose federated learning with stochastic parameter update (FedSPU). Unlike dropout that tailors the global model to small-size local sub-models, FedSPU maintains the full model architecture on each device but randomly freezes a certain percentage of neurons in the local model during training while updating the remaining neurons. This approach ensures that a portion of the local model remains personalized, thereby enhancing the model's robustness against biased parameters from other clients. Experimental results demonstrate that FedSPU outperforms federated dropout by 7.57\% on average in terms of accuracy. Furthermore, an introduced early stopping scheme leads to a significant reduction of the training time by $24.8\%\sim70.4\%$ while maintaining high accuracy.

Title: Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V

Authors: Siyu Xu, Yunke Wang, Daochang Liu, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11468
Pdf URL: https://arxiv.org/pdf/2403.11468
Copy Paste: [[2403.11468]] Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V(https://arxiv.org/abs/2403.11468)
Keywords: generative
Abstract: Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.

Title: Generative Motion Stylization within Canonical Motion Space

Authors: Jiaxu Zhang, Xin Chen, Gang Yu, Zhigang Tu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.11469
Pdf URL: https://arxiv.org/pdf/2403.11469
Copy Paste: [[2403.11469]] Generative Motion Stylization within Canonical Motion Space(https://arxiv.org/abs/2403.11469)
Keywords: diffusion, generative
Abstract: Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space. Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within this space. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the specific skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative experiments underscore the superiority of our pipeline over state-of-the-art methods, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.

Title: Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Authors: Matthew Zurek, Yudong Chen
Subjects: cs.LG, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11477
Pdf URL: https://arxiv.org/pdf/2403.11477
Copy Paste: [[2403.11477]] Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs(https://arxiv.org/abs/2403.11477)
Keywords: generative
Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. For weakly communicating MDPs, we establish the complexity bound $\tilde{O}(SA\frac{H}{\epsilon^2})$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\epsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. We further investigate sample complexity in general (non-weakly-communicating) average-reward MDPs. We argue a new transient time parameter $B$ is necessary, establish an $\tilde{O}(SA\frac{B+H}{\epsilon^2})$ complexity bound, and prove a matching (up to log factors) minimax lower bound. Both results are based on reducing the average-reward MDP to a discounted MDP, which requires new ideas in the general setting. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\tilde{\Omega}\left(SA\frac{H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice to learn an $\epsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma\geq 1-1/H$, and $\tilde{\Omega}\left(SA\frac{B+H}{(1-\gamma)^2\epsilon^2}\right)$ samples suffice in general MDPs when $\gamma\geq 1-\frac{1}{B+H}$. Both these results circumvent the well-known lower bound of $\tilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\epsilon^2}\right)$ for arbitrary $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span and transient time parameters. The weakly communicating bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.

Title: VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Authors: Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11481
Pdf URL: https://arxiv.org/pdf/2403.11481
Copy Paste: [[2403.11481]] VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding(https://arxiv.org/abs/2403.11481)
Keywords: large language model
Abstract: We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Title: SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and Reconstruction

Authors: Shuang Wang, Fei Deng, Peifan Jiang, Zishan Gong, Xiaolin Wei, Yuqing Wang
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2403.11482
Pdf URL: https://arxiv.org/pdf/2403.11482
Copy Paste: [[2403.11482]] SeisFusion: Constrained Diffusion Model with Input Guidance for 3D Seismic Data Interpolation and Reconstruction(https://arxiv.org/abs/2403.11482)
Keywords: diffusion
Abstract: Geographical, physical, or economic constraints often result in missing traces within seismic data, making the reconstruction of complete seismic data a crucial step in seismic data processing. Traditional methods for seismic data reconstruction require the selection of multiple empirical parameters and struggle to handle large-scale continuous missing data. With the development of deep learning, various neural networks have demonstrated powerful reconstruction capabilities. However, these convolutional neural networks represent a point-to-point reconstruction approach that may not cover the entire distribution of the dataset. Consequently, when dealing with seismic data featuring complex missing patterns, such networks may experience varying degrees of performance degradation. In response to this challenge, we propose a novel diffusion model reconstruction framework tailored for 3D seismic data. To constrain the results generated by the diffusion model, we introduce conditional supervision constraints into the diffusion model, constraining the generated data of the diffusion model based on the input data to be reconstructed. We introduce a 3D neural network architecture into the diffusion model, successfully extending the 2D diffusion model to 3D space. Additionally, we refine the model's generation process by incorporating missing data into the generation process, resulting in reconstructions with higher consistency. Through ablation studies determining optimal parameter values, our method exhibits superior reconstruction accuracy when applied to both field datasets and synthetic datasets, effectively addressing a wide range of complex missing patterns. Our implementation is available at https://github.com/WAL-l/SeisFusion.

Title: Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting

Authors: Mingkui Tan, Guohao Chen, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Peilin Zhao, Shuaicheng Niu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11491
Pdf URL: https://arxiv.org/pdf/2403.11491
Copy Paste: [[2403.11491]] Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting(https://arxiv.org/abs/2403.11491)
Keywords: segmentation
Abstract: Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample. Although recent TTA has shown promising performance, we still face two key challenges: 1) prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even for samples that are underlying uncertain, leading to overconfident predictions. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we measure the model uncertainty by the divergence between predictions from the full network and its sub-networks, on which we propose a divergence loss to encourage consistent predictions instead of overconfident ones. To further recalibrate prediction confidence, we utilize the disagreement among predicted labels as an indicator of the data uncertainty, and then devise a min-max entropy regularizer to selectively increase and decrease prediction confidence for different samples. Experiments on image classification and semantic segmentation verify the effectiveness of our methods.

Title: CCC++: Optimized Color Classified Colorization with Segment Anything Model (SAM) Empowered Object Selective Color Harmonization

Authors: Mrityunjoy Gain, Avi Deb Raha, Rameswar Debnath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11494
Pdf URL: https://arxiv.org/pdf/2403.11494
Copy Paste: [[2403.11494]] CCC++: Optimized Color Classified Colorization with Segment Anything Model (SAM) Empowered Object Selective Color Harmonization(https://arxiv.org/abs/2403.11494)
Keywords: generative
Abstract: In this paper, we formulate the colorization problem into a multinomial classification problem and then apply a weighted function to classes. We propose a set of formulas to transform color values into color classes and vice versa. To optimize the classes, we experiment with different bin sizes for color class transformation. Observing class appearance, standard deviation, and model parameters on various extremely large-scale real-time images in practice we propose 532 color classes for our classification task. During training, we propose a class-weighted function based on true class appearance in each batch to ensure proper saturation of individual objects. We adjust the weights of the major classes, which are more frequently observed, by lowering them, while escalating the weights of the minor classes, which are less commonly observed. In our class re-weight formula, we propose a hyper-parameter for finding the optimal trade-off between the major and minor appeared classes. As we apply regularization to enhance the stability of the minor class, occasional minor noise may appear at the object's edges. We propose a novel object-selective color harmonization method empowered by the Segment Anything Model (SAM) to refine and enhance these edges. We propose two new color image evaluation metrics, the Color Class Activation Ratio (CCAR), and the True Activation Ratio (TAR), to quantify the richness of color components. We compare our proposed model with state-of-the-art models using six different dataset: Place, ADE, Celeba, COCO, Oxford 102 Flower, and ImageNet, in qualitative and quantitative approaches. The experimental results show that our proposed model outstrips other models in visualization, CNR and in our proposed CCAR and TAR measurement criteria while maintaining satisfactory performance in regression (MSE, PSNR), similarity (SSIM, LPIPS, UIUI), and generative criteria (FID).

Title: Semantic-Enhanced Representation Learning for Road Networks with Temporal Dynamics

Authors: Yile Chen, Xiucheng Li, Gao Cong, Zhifeng Bao, Cheng Long
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11495
Pdf URL: https://arxiv.org/pdf/2403.11495
Copy Paste: [[2403.11495]] Semantic-Enhanced Representation Learning for Road Networks with Temporal Dynamics(https://arxiv.org/abs/2403.11495)
Keywords: transformer
Abstract: In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic patterns and traveling semantics. To achieve this, we refine the skip-gram module by incorporating auxiliary objectives aimed at predicting the traffic context associated with a target road segment. Moreover, we leverage trajectory data and design pre-training strategies based on Transformer to distill traveling semantics on road networks. DyToast further augments this framework by employing unified trigonometric functions characterized by their beneficial properties, enabling the capture of temporal evolution and dynamic nature of road networks more effectively. With these proposed techniques, we can obtain representations that encode multi-faceted aspects of knowledge within road networks, applicable across both road segment-based applications and trajectory-based applications. Extensive experiments on two real-world datasets across three tasks demonstrate that our proposed framework consistently outperforms the state-of-the-art baselines by a significant margin.

Title: Do CLIPs Always Generalize Better than ImageNet Models?

Authors: Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11497
Pdf URL: https://arxiv.org/pdf/2403.11497
Copy Paste: [[2403.11497]] Do CLIPs Always Generalize Better than ImageNet Models?(https://arxiv.org/abs/2403.11497)
Keywords: robust
Abstract: Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.

Title: Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Authors: Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11503
Pdf URL: https://arxiv.org/pdf/2403.11503
Copy Paste: [[2403.11503]] Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors(https://arxiv.org/abs/2403.11503)
Keywords: diffusion
Abstract: We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

Title: Circle Representation for Medical Instance Object Segmentation

Authors: Juming Xiong, Ethan H. Nguyen, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Haichun Yang, Agnes B. Fogo, Yuankai Huo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11507
Pdf URL: https://arxiv.org/pdf/2403.11507
Copy Paste: [[2403.11507]] Circle Representation for Medical Instance Object Segmentation(https://arxiv.org/abs/2403.11507)
Keywords: robust, segmentation
Abstract: Recently, circle representation has been introduced for medical imaging, designed specifically to enhance the detection of instance objects that are spherically shaped (e.g., cells, glomeruli, and nuclei). Given its outstanding effectiveness in instance detection, it is compelling to consider the application of circle representation for segmenting instance medical objects. In this study, we introduce CircleSnake, a simple end-to-end segmentation approach that utilizes circle contour deformation for segmenting ball-shaped medical objects at the instance level. The innovation of CircleSnake lies in these three areas: (1) It substitutes the complex bounding box-to-octagon contour transformation with a more consistent and rotation-invariant bounding circle-to-circle contour adaptation. This adaptation specifically targets ball-shaped medical objects. (2) The circle representation employed in CircleSnake significantly reduces the degrees of freedom to two, compared to eight in the octagon representation. This reduction enhances both the robustness of the segmentation performance and the rotational consistency of the method. (3) CircleSnake is the first end-to-end deep instance segmentation pipeline to incorporate circle representation, encompassing consistent circle detection, circle contour proposal, and circular convolution in a unified framework. This integration is achieved through the novel application of circular graph convolution within the context of circle detection and instance segmentation. In practical applications, such as the detection of glomeruli, nuclei, and eosinophils in pathological images, CircleSnake has demonstrated superior performance and greater rotation invariance when compared to benchmarks. The code has been made publicly available: https://github.com/hrlblab/CircleSnake.

Title: DEE: Dual-stage Explainable Evaluation Method for Text Generation

Authors: Shenyu Zhang, Yu Li, Rui Wu, Xiutian Huang, Yongrui Chen, Wenhao Xu, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11509
Pdf URL: https://arxiv.org/pdf/2403.11509
Copy Paste: [[2403.11509]] DEE: Dual-stage Explainable Evaluation Method for Text Generation(https://arxiv.org/abs/2403.11509)
Keywords: explainability, generative, large language model
Abstract: Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.

Title: SSAP: A Shape-Sensitive Adversarial Patch for Comprehensive Disruption of Monocular Depth Estimation in Autonomous Navigation Applications

Authors: Amira Guesmi, Muhammad Abdullah Hanif, Ihsen Alouani, Bassem Ouni, Muhammad Shafique
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11515
Pdf URL: https://arxiv.org/pdf/2403.11515
Copy Paste: [[2403.11515]] SSAP: A Shape-Sensitive Adversarial Patch for Comprehensive Disruption of Monocular Depth Estimation in Autonomous Navigation Applications(https://arxiv.org/abs/2403.11515)
Keywords: attack, transformer
Abstract: Monocular depth estimation (MDE) has advanced significantly, primarily through the integration of convolutional neural networks (CNNs) and more recently, Transformers. However, concerns about their susceptibility to adversarial attacks have emerged, especially in safety-critical domains like autonomous driving and robotic navigation. Existing approaches for assessing CNN-based depth prediction methods have fallen short in inducing comprehensive disruptions to the vision system, often limited to specific local areas. In this paper, we introduce SSAP (Shape-Sensitive Adversarial Patch), a novel approach designed to comprehensively disrupt monocular depth estimation (MDE) in autonomous navigation applications. Our patch is crafted to selectively undermine MDE in two distinct ways: by distorting estimated distances or by creating the illusion of an object disappearing from the system's perspective. Notably, our patch is shape-sensitive, meaning it considers the specific shape and scale of the target object, thereby extending its influence beyond immediate proximity. Furthermore, our patch is trained to effectively address different scales and distances from the camera. Experimental results demonstrate that our approach induces a mean depth estimation error surpassing 0.5, impacting up to 99% of the targeted region for CNN-based MDE models. Additionally, we investigate the vulnerability of Transformer-based MDE models to patch-based attacks, revealing that SSAP yields a significant error of 0.59 and exerts substantial influence over 99% of the target region on these models.

Title: Efficient and Privacy-Preserving Federated Learning based on Full Homomorphic Encryption

Authors: Yuqi Guo, Lin Li, Zhongxiang Zheng, Hanrui Yun, Ruoyan Zhang, Xiaolin Chang, Zhixuan Gao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11519
Pdf URL: https://arxiv.org/pdf/2403.11519
Copy Paste: [[2403.11519]] Efficient and Privacy-Preserving Federated Learning based on Full Homomorphic Encryption(https://arxiv.org/abs/2403.11519)
Keywords: security, privacy, biometric, federate
Abstract: Since the first theoretically feasible full homomorphic encryption (FHE) scheme was proposed in 2009, great progress has been achieved. These improvements have made FHE schemes come off the paper and become quite useful in solving some practical problems. In this paper, we propose a set of novel Federated Learning Schemes by utilizing the latest homomorphic encryption technologies, so as to improve the security, functionality and practicality at the same time. Comparisons have been given in four practical data sets separately from medical, business, biometric and financial fields, covering both horizontal and vertical federated learning scenarios. The experiment results show that our scheme achieves significant improvements in security, efficiency and practicality, compared with classical horizontal and vertical federated learning schemes.

Title: Video Object Segmentation with Dynamic Query Modulation

Authors: Hantao Zhou, Runze Hu, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11529
Pdf URL: https://arxiv.org/pdf/2403.11529
Copy Paste: [[2403.11529]] Video Object Segmentation with Dynamic Query Modulation(https://arxiv.org/abs/2403.11529)
Keywords: segmentation
Abstract: Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.

Title: Continual Forgetting for Pre-trained Vision Models

Authors: Hongbo Zhao, Bolin Ni, Haochen Wang, Junsong Fan, Fei Zhu, Yuxi Wang, Yuntao Chen, Gaofeng Meng, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11530
Pdf URL: https://arxiv.org/pdf/2403.11530
Copy Paste: [[2403.11530]] Continual Forgetting for Pre-trained Vision Models(https://arxiv.org/abs/2403.11530)
Keywords: security, privacy, transformer
Abstract: For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{https://github.com/bjzhb666/GS-LoRA}.

Title: EchoReel: Enhancing Action Generation of Existing Video Diffusion Models

Authors: Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11535
Pdf URL: https://arxiv.org/pdf/2403.11535
Copy Paste: [[2403.11535]] EchoReel: Enhancing Action Generation of Existing Video Diffusion Models(https://arxiv.org/abs/2403.11535)
Keywords: diffusion
Abstract: Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.

Title: OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System

Authors: Chih-Chung Hsu, Chia-Ming Lee, Chun-Hung Sun, Kuang-Ming Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11536
Pdf URL: https://arxiv.org/pdf/2403.11536
Copy Paste: [[2403.11536]] OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System(https://arxiv.org/abs/2403.11536)
Keywords: robust
Abstract: Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.

Title: Reinforcement Learning with Token-level Feedback for Controllable Text Generation

Authors: Wendi Li, Wei Wei, Kaihe Xu, Wenfeng Xie, Dangyang Chen, Yu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11558
Pdf URL: https://arxiv.org/pdf/2403.11558
Copy Paste: [[2403.11558]] Reinforcement Learning with Token-level Feedback for Controllable Text Generation(https://arxiv.org/abs/2403.11558)
Keywords: robust, large language model
Abstract: To meet the requirements of real-world applications, it is essential to control generations of large language models (LLMs). Prior research has tried to introduce reinforcement learning (RL) into controllable text generation while most existing methods suffer from overfitting issues (finetuning-based methods) or semantic collapse (post-processing methods). However, current RL methods are generally guided by coarse-grained (sentence/paragraph-level) feedback, which may lead to suboptimal performance owing to semantic twists or progressions within sentences. To tackle that, we propose a novel reinforcement learning algorithm named TOLE which formulates TOken-LEvel rewards for controllable text generation, and employs a "first-quantize-then-noise" paradigm to enhance the robustness of the RL algorithm.Furthermore, TOLE can be flexibly extended to multiple constraints with little computational expense. Experimental results show that our algorithm can achieve superior performance on both single-attribute and multi-attribute control tasks. We have released our codes at https://github.com/WindyLee0822/CTG

Title: Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection

Authors: Liren He, Zhengkai Jiang, Jinlong Peng, Liang Liu, Qiangang Du, Xiaobin Hu, Wenbing Zhu, Mingmin Chi, Yabiao Wang, Chengjie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11561
Pdf URL: https://arxiv.org/pdf/2403.11561
Copy Paste: [[2403.11561]] Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection(https://arxiv.org/abs/2403.11561)
Keywords: robust
Abstract: In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of ``learning shortcuts'', wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuine anomalies as normal instances, resulting in a failure of anomaly detection. To counter this issue, we present a novel unified feature reconstruction-based anomaly detection framework termed RLR (Reconstruct features from a Learnable Reference representation). Unlike previous methods, RLR utilizes learnable reference representations to compel the model to learn normal feature patterns explicitly, thereby prevents the model from succumbing to the ``learning shortcuts'' issue. Additionally, RLR incorporates locality constraints into the learnable reference to facilitate more effective normal pattern capture and utilizes a masked learnable key attention mechanism to enhance robustness. Evaluation of RLR on the 15-category MVTec-AD dataset and the 12-category VisA dataset shows superior performance compared to state-of-the-art methods under the unified setting. The code of RLR will be publicly available.

Title: EffiVED:Efficient Video Editing via Text-instruction Diffusion Models

Authors: Zhenghao Zhang, Zuozhuo Dai, Long Qin, Weizhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11568
Pdf URL: https://arxiv.org/pdf/2403.11568
Copy Paste: [[2403.11568]] EffiVED:Efficient Video Editing via Text-instruction Diffusion Models(https://arxiv.org/abs/2403.11568)
Keywords: diffusion
Abstract: Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. The datasets will be made publicly available upon publication.

Title: Augment Before Copy-Paste: Data and Memory Efficiency-Oriented Instance Segmentation Framework for Sport-scenes

Authors: Chih-Chung Hsu, Chia-Ming Lee, Ming-Shyen Wu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.11572
Pdf URL: https://arxiv.org/pdf/2403.11572
Copy Paste: [[2403.11572]] Augment Before Copy-Paste: Data and Memory Efficiency-Oriented Instance Segmentation Framework for Sport-scenes(https://arxiv.org/abs/2403.11572)
Keywords: segmentation
Abstract: Instance segmentation is a fundamental task in computer vision with broad applications across various industries. In recent years, with the proliferation of deep learning and artificial intelligence applications, how to train effective models with limited data has become a pressing issue for both academia and industry. In the Visual Inductive Priors challenge (VIPriors2023), participants must train a model capable of precisely locating individuals on a basketball court, all while working with limited data and without the use of transfer learning or pre-trained models. We propose Memory effIciency inStance Segmentation framework based on visual inductive prior flow propagation that effectively incorporates inherent prior information from the dataset into both the data preprocessing and data augmentation stages, as well as the inference phase. Our team (ACVLAB) experiments demonstrate that our model achieves promising performance (0.509 AP@0.50:0.95) even under limited data and memory constraints.

Title: MISS: Memory-efficient Instance Segmentation Framework By Visual Inductive Priors Flow Propagation

Authors: Chih-Chung Hsu, Chia-Ming Lee
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.11576
Pdf URL: https://arxiv.org/pdf/2403.11576
Copy Paste: [[2403.11576]] MISS: Memory-efficient Instance Segmentation Framework By Visual Inductive Priors Flow Propagation(https://arxiv.org/abs/2403.11576)
Keywords: robust, segmentation
Abstract: Instance segmentation, a cornerstone task in computer vision, has wide-ranging applications in diverse industries. The advent of deep learning and artificial intelligence has underscored the criticality of training effective models, particularly in data-scarce scenarios - a concern that resonates in both academic and industrial circles. A significant impediment in this domain is the resource-intensive nature of procuring high-quality, annotated data for instance segmentation, a hurdle that amplifies the challenge of developing robust models under resource constraints. In this context, the strategic integration of a visual prior into the training dataset emerges as a potential solution to enhance congruity with the testing data distribution, consequently reducing the dependency on computational resources and the need for highly complex models. However, effectively embedding a visual prior into the learning process remains a complex endeavor. Addressing this challenge, we introduce the MISS (Memory-efficient Instance Segmentation System) framework. MISS leverages visual inductive prior flow propagation, integrating intrinsic prior knowledge from the Synergy-basketball dataset at various stages: data preprocessing, augmentation, training, and inference. Our empirical evaluations underscore the efficacy of MISS, demonstrating commendable performance in scenarios characterized by limited data availability and memory constraints.

Title: 3DGS-Calib: 3D Gaussian Splatting for Multimodal SpatioTemporal Calibration

Authors: Quentin Herau, Moussab Bennehar, Arthur Moreau, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou, Cyrille Migniot, Pascal Vasseur, Cédric Demonceaux
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11577
Pdf URL: https://arxiv.org/pdf/2403.11577
Copy Paste: [[2403.11577]] 3DGS-Calib: 3D Gaussian Splatting for Multimodal SpatioTemporal Calibration(https://arxiv.org/abs/2403.11577)
Keywords: robust
Abstract: Reliable multimodal sensor fusion algorithms re- quire accurate spatiotemporal calibration. Recently, targetless calibration techniques based on implicit neural representations have proven to provide precise and robust results. Nevertheless, such methods are inherently slow to train given the high compu- tational overhead caused by the large number of sampled points required for volume rendering. With the recent introduction of 3D Gaussian Splatting as a faster alternative to implicit representation methods, we propose to leverage this new ren- dering approach to achieve faster multi-sensor calibration. We introduce 3DGS-Calib, a new calibration method that relies on the speed and rendering accuracy of 3D Gaussian Splatting to achieve multimodal spatiotemporal calibration that is accurate, robust, and with a substantial speed-up compared to methods relying on implicit neural representations. We demonstrate the superiority of our proposal with experimental results on sequences from KITTI-360, a widely used driving dataset.

Title: OurDB: Ouroboric Domain Bridging for Multi-Target Domain Adaptive Semantic Segmentation

Authors: Seungbeom Woo, Geonwoo Baek, Taehoon Kim, Jaemin Na, Joong-won Hwang, Wonjun Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11582
Pdf URL: https://arxiv.org/pdf/2403.11582
Copy Paste: [[2403.11582]] OurDB: Ouroboric Domain Bridging for Multi-Target Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2403.11582)
Keywords: segmentation
Abstract: Multi-target domain adaptation (MTDA) for semantic segmentation poses a significant challenge, as it involves multiple target domains with varying distributions. The goal of MTDA is to minimize the domain discrepancies among a single source and multi-target domains, aiming to train a single model that excels across all target domains. Previous MTDA approaches typically employ multiple teacher architectures, where each teacher specializes in one target domain to simplify the task. However, these architectures hinder the student model from fully assimilating comprehensive knowledge from all target-specific teachers and escalate training costs with increasing target domains. In this paper, we propose an ouroboric domain bridging (OurDB) framework, offering an efficient solution to the MTDA problem using a single teacher architecture. This framework dynamically cycles through multiple target domains, aligning each domain individually to restrain the biased alignment problem, and utilizes Fisher information to minimize the forgetting of knowledge from previous target domains. We also propose a context-guided class-wise mixup (CGMix) that leverages contextual information tailored to diverse target contexts in MTDA. Experimental evaluations conducted on four urban driving datasets (i.e., GTA5, Cityscapes, IDD, and Mapillary) demonstrate the superiority of our method over existing state-of-the-art approaches.

Title: Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines

Authors: Ekaterina Trofimova, Emil Sataev, Andrey E. Ustyuzhanin
Subjects: cs.LG, cs.AI, cs.CL, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2403.11585
Pdf URL: https://arxiv.org/pdf/2403.11585
Copy Paste: [[2403.11585]] Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines(https://arxiv.org/abs/2403.11585)
Keywords: large language model
Abstract: In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model (LLM), empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.

Title: End-to-end multi-modal product matching in fashion e-commerce

Authors: Sándor Tóth, Stephen Wilson, Alexia Tsoukara, Enric Moreu, Anton Masalovich, Lars Roemheld
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11593
Pdf URL: https://arxiv.org/pdf/2403.11593
Copy Paste: [[2403.11593]] End-to-end multi-modal product matching in fashion e-commerce(https://arxiv.org/abs/2403.11593)
Keywords: robust
Abstract: Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.

Title: CRS-Diff: Controllable Generative Remote Sensing Foundation Model

Authors: Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11614
Pdf URL: https://arxiv.org/pdf/2403.11614
Copy Paste: [[2403.11614]] CRS-Diff: Controllable Generative Remote Sensing Foundation Model(https://arxiv.org/abs/2403.11614)
Keywords: diffusion, generative
Abstract: The emergence of diffusion models has revolutionized the field of image generation, providing new methods for creating high-quality, high-resolution images across various applications. However, the potential of these models for generating domain-specific images, particularly remote sensing (RS) images, remains largely untapped. RS images that are notable for their high resolution, extensive coverage, and rich information content, bring new challenges that general diffusion models may not adequately address. This paper proposes CRS-Diff, a pioneering diffusion modeling framework specifically tailored for generating remote sensing imagery, leveraging the inherent advantages of diffusion models while integrating advanced control mechanisms to ensure that the imagery is not only visually clear but also enriched with geographic and temporal information. The model integrates global and local control inputs, enabling precise combinations of generation conditions to refine the generation process. A comprehensive evaluation of CRS-Diff has demonstrated its superior capability to generate RS imagery both in a single condition and multiple conditions compared with previous methods in terms of image quality and diversity.

Title: Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model

Authors: Haoyun Xu, Runzhe Zhan, Derek F. Wong, Lidia S. Chao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11621
Pdf URL: https://arxiv.org/pdf/2403.11621
Copy Paste: [[2403.11621]] Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model(https://arxiv.org/abs/2403.11621)
Keywords: large language model
Abstract: Large Language Models (LLMs) are composed of neurons that exhibit various behaviors and roles, which become increasingly diversified as models scale. Recent studies have revealed that not all neurons are active across different datasets, and this sparsity correlates positively with the task-specific ability, leading to advancements in model pruning and training efficiency. Traditional fine-tuning methods engage all parameters of LLMs, which is computationally expensive and may not be necessary. In contrast, Parameter-Efficient Fine-Tuning (PEFT) approaches aim to minimize the number of trainable parameters, yet they still operate at a relatively macro scale (e.g., layer-level). We introduce Neuron-Level Fine-Tuning (NeFT), a novel approach that refines the granularity of parameter training down to the individual neuron, enabling more precise and computationally efficient model updates. The experimental results show that NeFT not only exceeded the performance of full-parameter fine-tuning and PEFT but also provided insights into the analysis of neurons.

Title: LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Authors: Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Boxi Wu, Wei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11627
Pdf URL: https://arxiv.org/pdf/2403.11627
Copy Paste: [[2403.11627]] LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models(https://arxiv.org/abs/2403.11627)
Keywords: diffusion
Abstract: Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a Low-Rank Adaptations (LoRA) fusion matrix of multiple LoRA to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, which occurs when the model cannot preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through Concept Injection Constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, Concept Isolation Constraints are introduced, refining the self-attention computation. Furthermore, Latent Re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer's performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations. Code is released at https://github.com/Young98CN/LoRA\_Composer.

Title: Arc2Face: A Foundation Model of Human Faces

Authors: Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11641
Pdf URL: https://arxiv.org/pdf/2403.11641
Copy Paste: [[2403.11641]] Arc2Face: A Foundation Model of Human Faces(https://arxiv.org/abs/2403.11641)
Keywords: robust, diffusion
Abstract: This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Title: Diffusion-Based Environment-Aware Trajectory Prediction

Authors: Theodor Westny, Björn Olofsson, Erik Frisk
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11643
Pdf URL: https://arxiv.org/pdf/2403.11643
Copy Paste: [[2403.11643]] Diffusion-Based Environment-Aware Trajectory Prediction(https://arxiv.org/abs/2403.11643)
Keywords: diffusion, generative
Abstract: The ability to predict the future trajectories of traffic participants is crucial for the safe and efficient operation of autonomous vehicles. In this paper, a diffusion-based generative model for multi-agent trajectory prediction is proposed. The model is capable of capturing the complex interactions between traffic participants and the environment, accurately learning the multimodal nature of the data. The effectiveness of the approach is assessed on large-scale datasets of real-world traffic scenarios, showing that our model outperforms several well-established methods in terms of prediction accuracy. By the incorporation of differential motion constraints on the model output, we illustrate that our model is capable of generating a diverse set of realistic future trajectories. Through the use of an interaction-aware guidance signal, we further demonstrate that the model can be adapted to predict the behavior of less cooperative agents, emphasizing its practical applicability under uncertain traffic conditions.

Title: LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model

Authors: Yuxin Cao, Jinghao Li, Xi Xiao, Derui Wang, Minhui Xue, Hao Ge, Wei Liu, Guangwu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11656
Pdf URL: https://arxiv.org/pdf/2403.11656
Copy Paste: [[2403.11656]] LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model(https://arxiv.org/abs/2403.11656)
Keywords: security, attack, segmentation
Abstract: Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.

Title: Normalized Validity Scores for DNNs in Regression based Eye Feature Extraction

Authors: Wolfgang Fuhl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11665
Pdf URL: https://arxiv.org/pdf/2403.11665
Copy Paste: [[2403.11665]] Normalized Validity Scores for DNNs in Regression based Eye Feature Extraction(https://arxiv.org/abs/2403.11665)
Keywords: extraction, segmentation
Abstract: We propose an improvement to the landmark validity loss. Landmark detection is widely used in head pose estimation, eyelid shape extraction, as well as pupil and iris segmentation. There are numerous additional applications where landmark detection is used to estimate the shape of complex objects. One part of this process is the accurate and fine-grained detection of the shape. The other part is the validity or inaccuracy per landmark, which can be used to detect unreliable areas, where the shape possibly does not fit, and to improve the accuracy of the entire shape extraction by excluding inaccurate landmarks. We propose a normalization in the loss formulation, which improves the accuracy of the entire approach due to the numerical balance of the normalized inaccuracy. In addition, we propose a margin for the inaccuracy to reduce the impact of gradients, which are produced by negligible errors close to the ground truth.

Title: Binary Noise for Binary Tasks: Masked Bernoulli Diffusion for Unsupervised Anomaly Detection

Authors: Julia Wolleb, Florentin Bieder, Paul Friedrich, Peter Zhang, Alicia Durrer, Philippe C. Cattin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11667
Pdf URL: https://arxiv.org/pdf/2403.11667
Copy Paste: [[2403.11667]] Binary Noise for Binary Tasks: Masked Bernoulli Diffusion for Unsupervised Anomaly Detection(https://arxiv.org/abs/2403.11667)
Keywords: diffusion
Abstract: The high performance of denoising diffusion models for image generation has paved the way for their application in unsupervised medical anomaly detection. As diffusion-based methods require a lot of GPU memory and have long sampling times, we present a novel and fast unsupervised anomaly detection approach based on latent Bernoulli diffusion models. We first apply an autoencoder to compress the input images into a binary latent representation. Next, a diffusion model that follows a Bernoulli noise schedule is employed to this latent space and trained to restore binary latent representations from perturbed ones. The binary nature of this diffusion model allows us to identify entries in the latent space that have a high probability of flipping their binary code during the denoising process, which indicates out-of-distribution data. We propose a masking algorithm based on these probabilities, which improves the anomaly detection scores. We achieve state-of-the-art performance compared to other diffusion-based unsupervised anomaly detection algorithms while significantly reducing sampling time and memory consumption. The code is available at https://github.com/JuliaWolleb/Anomaly_berdiff.

Title: Semantic Data Representation for Explainable Windows Malware Detection Models

Authors: Peter Švec, Štefan Balogh, Martin Homola, Ján Kľuka, Tomáš Bisták
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11669
Pdf URL: https://arxiv.org/pdf/2403.11669
Copy Paste: [[2403.11669]] Semantic Data Representation for Explainable Windows Malware Detection Models(https://arxiv.org/abs/2403.11669)
Keywords: security
Abstract: Ontologies are a standard tool for creating semantic schemata in many knowledge intensive domains of human interest. They are becoming increasingly important also in the areas that have been until very recently dominated by subsymbolic knowledge representation and machine-learning (ML) based data processing. One such area is information security, and specifically, malware detection. We thus propose PE Malware Ontology that offers a reusable semantic schema for Portable Executable (PE - the Windows binary format) malware files. This ontology is inspired by the structure of the EMBER dataset, which focuses on the static malware analysis of PE files. With this proposal, we hope to provide a unified semantic representation for the existing and future PE-malware datasets and facilitate the application of symbolic, neuro-symbolic, or otherwise explainable approaches in the PE-malware-detection domain, which may produce interpretable results described by the terms defined in our ontology. In addition, we also publish semantically treated EMBER data, including fractional datasets, to support the reproducibility of experiments on EMBER. We supplement our work with a preliminary case study, conducted using concept learning, to show the general feasibility of our approach. While we were not able to match the precision of the state-of-the-art ML tools, the learned malware discriminators were interesting and highly interpretable.

Title: Better (pseudo-)labels for semi-supervised instance segmentation

Authors: François Porcher, Camille Couprie, Marc Szafraniec, Jakob Verbeek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11675
Pdf URL: https://arxiv.org/pdf/2403.11675
Copy Paste: [[2403.11675]] Better (pseudo-)labels for semi-supervised instance segmentation(https://arxiv.org/abs/2403.11675)
Keywords: segmentation
Abstract: Despite the availability of large datasets for tasks like image classification and image-text alignment, labeled data for more complex recognition tasks, such as detection and segmentation, is less abundant. In particular, for instance segmentation annotations are time-consuming to produce, and the distribution of instances is often highly skewed across classes. While semi-supervised teacher-student distillation methods show promise in leveraging vast amounts of unlabeled data, they suffer from miscalibration, resulting in overconfidence in frequently represented classes and underconfidence in rarer ones. Additionally, these methods encounter difficulties in efficiently learning from a limited set of examples. We introduce a dual-strategy to enhance the teacher model's training process, substantially improving the performance on few-shot learning. Secondly, we propose a calibration correction mechanism that that enables the student model to correct the teacher's calibration errors. Using our approach, we observed marked improvements over a state-of-the-art supervised baseline performance on the LVIS dataset, with an increase of 2.8% in average precision (AP) and 10.3% gain in AP for rare classes.

Title: NEDS-SLAM: A Novel Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting

Authors: Yiming Ji, Yang Liu, Guanghu Xie, Boyu Ma, Zongwu Xie
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11679
Pdf URL: https://arxiv.org/pdf/2403.11679
Copy Paste: [[2403.11679]] NEDS-SLAM: A Novel Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting(https://arxiv.org/abs/2403.11679)
Keywords: robust, segmentation
Abstract: We propose NEDS-SLAM, an Explicit Dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier GS points, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.

Title: Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding

Authors: Tatsunori Taniai, Ryo Igarashi, Yuta Suzuki, Naoya Chiba, Kotaro Saito, Yoshitaka Ushiku, Kanta Ono
Subjects: cs.LG, cond-mat.mtrl-sci, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2403.11686
Pdf URL: https://arxiv.org/pdf/2403.11686
Copy Paste: [[2403.11686]] Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding(https://arxiv.org/abs/2403.11686)
Keywords: transformer
Abstract: Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.

Title: TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

Authors: Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11691
Pdf URL: https://arxiv.org/pdf/2403.11691
Copy Paste: [[2403.11691]] TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models(https://arxiv.org/abs/2403.11691)
Keywords: segmentation
Abstract: Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.

Title: Urban Scene Diffusion through Semantic Occupancy Map

Authors: Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11697
Pdf URL: https://arxiv.org/pdf/2403.11697
Copy Paste: [[2403.11697]] Urban Scene Diffusion through Semantic Occupancy Map(https://arxiv.org/abs/2403.11697)
Keywords: diffusion
Abstract: Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.

Title: PITA: Physics-Informed Trajectory Autoencoder

Authors: Johannes Fischer, Kevin Rösch, Martin Lauer, Christoph Stiller
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2403.11728
Pdf URL: https://arxiv.org/pdf/2403.11728
Copy Paste: [[2403.11728]] PITA: Physics-Informed Trajectory Autoencoder(https://arxiv.org/abs/2403.11728)
Keywords: generative
Abstract: Validating robotic systems in safety-critical appli-cations requires testing in many scenarios including rare edgecases that are unlikely to occur, requiring to complement real-world testing with testing in simulation. Generative models canbe used to augment real-world datasets with generated data toproduce edge case scenarios by sampling in a learned latentspace. Autoencoders can learn said latent representation for aspecific domain by learning to reconstruct the input data froma lower-dimensional intermediate representation. However, theresulting trajectories are not necessarily physically plausible, butinstead typically contain noise that is not present in the inputtrajectory. To resolve this issue, we propose the novel Physics-Informed Trajectory Autoencoder (PITA) architecture, whichincorporates a physical dynamics model into the loss functionof the autoencoder. This results in smooth trajectories that notonly reconstruct the input trajectory but also adhere to thephysical model. We evaluate PITA on a real-world dataset ofvehicle trajectories and compare its performance to a normalautoencoder and a state-of-the-art action-space autoencoder.

Title: LSKNet: A Foundation Lightweight Backbone for Remote Sensing

Authors: Yuxuan Li, Xiang Li, Yimain Dai, Qibin Hou, Li Liu, Yongxiang Liu, Ming-Ming Cheng, Jian Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11735
Pdf URL: https://arxiv.org/pdf/2403.11735
Copy Paste: [[2403.11735]] LSKNet: A Foundation Lightweight Backbone for Remote Sensing(https://arxiv.org/abs/2403.11735)
Keywords: segmentation
Abstract: Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.

Title: Post-Quantum Cryptography: Securing Digital Communication in the Quantum Era

Authors: Dr. G S Mamatha, Namya Dimri, Rasha Sinha
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11741
Pdf URL: https://arxiv.org/pdf/2403.11741
Copy Paste: [[2403.11741]] Post-Quantum Cryptography: Securing Digital Communication in the Quantum Era(https://arxiv.org/abs/2403.11741)
Keywords: security, attack
Abstract: The advent of quantum computing poses a profound threat to traditional cryptographic systems, exposing vulnerabilities that compromise the security of digital communication channels reliant on RSA, ECC, and similar classical encryption methods. Quantum algorithms, notably Shor's algorithm, exploit the inherent computational power of quantum computers to efficiently solve mathematical problems underlying these cryptographic schemes. In response, post-quantum cryptography (PQC) emerged as a critical field aimed at developing resilient cryptographic algorithms impervious to quantum attacks. This paper delineates the vulnerabilities of classical cryptographic systems to quantum attacks, elucidates the principles of quantum computing, and introduces various PQC algorithms such as lattice-based cryptography, code-based cryptography, hash-based cryptography, and multivariate polynomial cryptography. Highlighting the importance of PQC in securing digital communication amidst quantum computing advancements, this research underscores its pivotal role in safeguarding data integrity, confidentiality, and authenticity in the face of emerging quantum threats.

Title: Embedded Named Entity Recognition using Probing Classifiers

Authors: Nicholas Popovič, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11747
Pdf URL: https://arxiv.org/pdf/2403.11747
Copy Paste: [[2403.11747]] Embedded Named Entity Recognition using Probing Classifiers(https://arxiv.org/abs/2403.11747)
Keywords: extraction
Abstract: Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained language models using probing classifiers, enabling efficient simultaneous text generation and information extraction. For this, we introduce an approach called EMBER and show that it enables named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments using GPT-2 show that EMBER maintains high token generation rates during streaming text generation, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline using a separate NER model. Code and data are available at https://github.com/nicpopovic/EMBER.

Title: Relational Representation Learning Network for Cross-Spectral Image Patch Matching

Authors: Chuang Yu, Yunpeng Liu, Jinmiao Zhao, Dou Quan, Zelin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11751
Pdf URL: https://arxiv.org/pdf/2403.11751
Copy Paste: [[2403.11751]] Relational Representation Learning Network for Cross-Spectral Image Patch Matching(https://arxiv.org/abs/2403.11751)
Keywords: extraction
Abstract: Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which simultaneously focuses on sufficiently mining the intrinsic features of individual image patches and the relations between image patch features. Based on this, we construct a lightweight Relational Representation Learning Network (RRL-Net). Specifically, we innovatively construct an autoencoder to fully characterize the individual intrinsic features, and introduce a Feature Interaction Learning (FIL) module to extract deep-level feature relations. To further fully mine individual intrinsic features, a lightweight Multi-dimensional Global-to-Local Attention (MGLA) module is constructed to enhance the global feature extraction of individual image patches and capture local dependencies within global features. By combining the MGLA module, we further explore the feature extraction network and construct an Attention-based Lightweight Feature Extraction (ALFE) network. In addition, we propose a Multi-Loss Post-Pruning (MLPP) optimization strategy, which greatly promotes network optimization while avoiding increases in parameters and inference time. Extensive experiments demonstrate that our RRL-Net achieves state-of-the-art (SOTA) performance on multiple public datasets. Our code will be made public later.

Title: Revisiting The Classics: A Study on Identifying and Rectifying Gender Stereotypes in Rhymes and Poems

Authors: Aditya Narayan Sankaran, Vigneshwaran Shankaran, Sampath Lonka, Rajesh Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11752
Pdf URL: https://arxiv.org/pdf/2403.11752
Copy Paste: [[2403.11752]] Revisiting The Classics: A Study on Identifying and Rectifying Gender Stereotypes in Rhymes and Poems(https://arxiv.org/abs/2403.11752)
Keywords: large language model
Abstract: Rhymes and poems are a powerful medium for transmitting cultural norms and societal roles. However, the pervasive existence of gender stereotypes in these works perpetuates biased perceptions and limits the scope of individuals' identities. Past works have shown that stereotyping and prejudice emerge in early childhood, and developmental research on causal mechanisms is critical for understanding and controlling stereotyping and prejudice. This work contributes by gathering a dataset of rhymes and poems to identify gender stereotypes and propose a model with 97\% accuracy to identify gender bias. Gender stereotypes were rectified using a Large Language Model (LLM) and its effectiveness was evaluated in a comparative survey against human educator rectifications. To summarize, this work highlights the pervasive nature of gender stereotypes in literary works and reveals the potential of LLMs to rectify gender stereotypes. This study raises awareness and promotes inclusivity within artistic expressions, making a significant contribution to the discourse on gender equality.

Title: Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuhene, Horst Possegger
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11755
Pdf URL: https://arxiv.org/pdf/2403.11755
Copy Paste: [[2403.11755]] Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs(https://arxiv.org/abs/2403.11755)
Keywords: large language model
Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

Title: DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding

Authors: Wenhua Wu, Guangming Wang, Ting Deng, Sebastian Aegidius, Stuart Shanks, Valerio Modugno, Dimitrios Kanoulas, Hesheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11776
Pdf URL: https://arxiv.org/pdf/2403.11776
Copy Paste: [[2403.11776]] DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding(https://arxiv.org/abs/2403.11776)
Keywords: robust
Abstract: Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.

Title: Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Authors: Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11781
Pdf URL: https://arxiv.org/pdf/2403.11781
Copy Paste: [[2403.11781]] Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm(https://arxiv.org/abs/2403.11781)
Keywords: diffusion
Abstract: Drawing on recent advancements in diffusion models for text-to-image generation, identity-preserved personalization has made significant progress in accurately capturing specific identities with just a single reference image. However, existing methods primarily integrate reference images within the text embedding space, leading to a complex entanglement of image and text information, which poses challenges for preserving both identity fidelity and semantic consistency. To tackle this challenge, we propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. Specifically, we introduce identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information while deactivating the original text cross-attention module of the diffusion model. This ensures that the image stream faithfully represents the identity provided by the reference image while mitigating interference from textual input. Additionally, we introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams. This mechanism not only enhances the fidelity of identity and semantic consistency but also enables convenient control over the styles of the generated images. Extensive experimental results on both raw photo generation and style image generation demonstrate the superior performance of our proposed method.

Title: Construction of Hyper-Relational Knowledge Graphs Using Pre-Trained Large Language Models

Authors: Preetha Datta, Fedor Vitiugin, Anastasiia Chizhikova, Nitin Sawhney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11786
Pdf URL: https://arxiv.org/pdf/2403.11786
Copy Paste: [[2403.11786]] Construction of Hyper-Relational Knowledge Graphs Using Pre-Trained Large Language Models(https://arxiv.org/abs/2403.11786)
Keywords: large language model
Abstract: Extracting hyper-relations is crucial for constructing comprehensive knowledge graphs, but there are limited supervised methods available for this task. To address this gap, we introduce a zero-shot prompt-based method using OpenAI's GPT-3.5 model for extracting hyper-relational knowledge from text. Comparing our model with a baseline, we achieved promising results, with a recall of 0.77. Although our precision is currently lower, a detailed analysis of the model outputs has uncovered potential pathways for future research in this area.

Title: Deep Medial Voxels: Learned Medial Axis Approximations for Anatomical Shape Modeling

Authors: Antonio Pepe, Richard Schussnig, Jianning Li, Christina Gsaxner, Dieter Schmalstieg, Jan Egger
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11790
Pdf URL: https://arxiv.org/pdf/2403.11790
Copy Paste: [[2403.11790]] Deep Medial Voxels: Learned Medial Axis Approximations for Anatomical Shape Modeling(https://arxiv.org/abs/2403.11790)
Keywords: segmentation
Abstract: Shape reconstruction from imaging volumes is a recurring need in medical image analysis. Common workflows start with a segmentation step, followed by careful post-processing and,finally, ad hoc meshing algorithms. As this sequence can be timeconsuming, neural networks are trained to reconstruct shapes through template deformation. These networks deliver state-ofthe-art results without manual intervention, but, so far, they have primarily been evaluated on anatomical shapes with little topological variety between individuals. In contrast, other works favor learning implicit shape models, which have multiple benefits for meshing and visualization. Our work follows this direction by introducing deep medial voxels, a semi-implicit representation that faithfully approximates the topological skeleton from imaging volumes and eventually leads to shape reconstruction via convolution surfaces. Our reconstruction technique shows potential for both visualization and computer simulations.

Title: SETA: Semantic-Aware Token Augmentation for Domain Generalization

Authors: Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11792
Pdf URL: https://arxiv.org/pdf/2403.11792
Copy Paste: [[2403.11792]] SETA: Semantic-Aware Token Augmentation for Domain Generalization(https://arxiv.org/abs/2403.11792)
Keywords: robust, transformer
Abstract: Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the SEmantic-aware Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.

Title: Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Authors: Seungpil Lee, Woochang Sim, Donghyeon Shin, Sanha Hwang, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sundong Kim
Subjects: cs.CL, cs.AI, cs.ET, cs.SC
Abstract URL: https://arxiv.org/abs/2403.11793
Pdf URL: https://arxiv.org/pdf/2403.11793
Copy Paste: [[2403.11793]] Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus(https://arxiv.org/abs/2403.11793)
Keywords: large language model
Abstract: The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been results-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.

Title: Low-Cost Privacy-Aware Decentralized Learning

Authors: Sayan Biswas, Davide Frey, Romaric Gaudel, Anne-Marie Kermarrec, Dimitri Lerévérend, Rafael Pires, Rishi Sharma, François Taïani
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2403.11795
Pdf URL: https://arxiv.org/pdf/2403.11795
Copy Paste: [[2403.11795]] Low-Cost Privacy-Aware Decentralized Learning(https://arxiv.org/abs/2403.11795)
Keywords: privacy, protect, attack, membership infer
Abstract: This paper introduces ZIP-DL, a novel privacy-aware decentralized learning (DL) algorithm that relies on adding correlated noise to each model update during the model training process. This technique ensures that the added noise almost neutralizes itself during the aggregation process due to its correlation, thus minimizing the impact on model accuracy. In addition, ZIP-DL does not require multiple communication rounds for noise cancellation, addressing the common trade-off between privacy protection and communication overhead. We provide theoretical guarantees for both convergence speed and privacy guarantees, thereby making ZIP-DL applicable to practical scenarios. Our extensive experimental study shows that ZIP-DL achieves the best trade-off between vulnerability and accuracy. In particular, ZIP-DL (i) reduces the effectiveness of a linkability attack by up to 52 points compared to baseline DL, and (ii) achieves up to 37 more accuracy points for the same vulnerability under membership inference attacks against a privacy-preserving competitor

Title: Is It Really You Who Forgot the Password? When Account Recovery Meets Risk-Based Authentication

Authors: Andre Büttner, Andreas Thue Pedersen, Stephan Wiefling, Nils Gruschka, Luigi Lo Iacono
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11798
Pdf URL: https://arxiv.org/pdf/2403.11798
Copy Paste: [[2403.11798]] Is It Really You Who Forgot the Password? When Account Recovery Meets Risk-Based Authentication(https://arxiv.org/abs/2403.11798)
Keywords: security, protect, attack
Abstract: Risk-based authentication (RBA) is used in online services to protect user accounts from unauthorized takeover. RBA commonly uses contextual features that indicate a suspicious login attempt when the characteristic attributes of the login context deviate from known and thus expected values. Previous research on RBA and anomaly detection in authentication has mainly focused on the login process. However, recent attacks have revealed vulnerabilities in other parts of the authentication process, specifically in the account recovery function. Consequently, to ensure comprehensive authentication security, the use of anomaly detection in the context of account recovery must also be investigated. This paper presents the first study to investigate risk-based account recovery (RBAR) in the wild. We analyzed the adoption of RBAR by five prominent online services (that are known to use RBA). Our findings confirm the use of RBAR at Google, LinkedIn, and Amazon. Furthermore, we provide insights into the different RBAR mechanisms of these services and explore the impact of multi-factor authentication on them. Based on our findings, we create a first maturity model for RBAR challenges. The goal of our work is to help developers, administrators, and policy-makers gain an initial understanding of RBAR and to encourage further research in this direction.

Title: Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models

Authors: Mingyang Song, Mao Zheng, Xuan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11802
Pdf URL: https://arxiv.org/pdf/2403.11802
Copy Paste: [[2403.11802]] Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models(https://arxiv.org/abs/2403.11802)
Keywords: robust, large language model
Abstract: While recent research endeavors have concentrated on developing Large Language Models (LLMs) with robust long-context capabilities, due to the lack of appropriate evaluation strategies, relatively little is known about how well the long-context processing abilities and performance of leading LLMs (e.g., ChatGPT and KimiChat). To address this gap, we propose a simple, efficient, and reasonable strategy for evaluating long-context LLMs as a new benchmark, named Counting-Stars. The Counting-Stars is designed to require LLMs to fully understand and capture long dependencies in long contexts and be able to collect inter-dependency across multiple pieces of evidence spanning the entire context to finish the task. Based on the Counting-Stars, we conduct experiments to evaluate the two leading long-context LLMs, i.e., GPT-4 Turbo and Kimi Chat. The experimental results indicate that GPT-4 Turbo and Kimi Chat achieve significant performance in the long context from 4K to 128K. We further present two intriguing analyses regarding the behavior of LLMs processing long context.

Title: Federated Modality-specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation

Authors: Qian Dai, Dong Wei, Hong Liu, Jinghan Sun, Liansheng Wang, Yefeng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11803
Pdf URL: https://arxiv.org/pdf/2403.11803
Copy Paste: [[2403.11803]] Federated Modality-specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation(https://arxiv.org/abs/2403.11803)
Keywords: federate, segmentation
Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, it is not uncommon that some FL participants only possess a subset of the complete imaging modalities, posing inter-modal heterogeneity as a challenge to effectively training a global model on all participants' data. In addition, each participant would expect to obtain a personalized model tailored for its local data characteristics from the FL in such a scenario. In this work, we propose a new FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address the two concurrent issues. Above all, FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity in the first place. In the meantime, while the encoders are shared by the participants, the decoders are personalized to meet individual needs. Specifically, a server with full-modal data employs a fusion decoder to aggregate and fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation reversely. Meanwhile, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the encoder parameters. On the other end, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up the information loss due to absent modalities while adapting the representations of present ones. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation. Results show that it outperforms various up-to-date methods for multimodal and personalized FL and that its novel designs are effective. Our code is available.

Title: Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

Authors: Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11808
Pdf URL: https://arxiv.org/pdf/2403.11808
Copy Paste: [[2403.11808]] Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation(https://arxiv.org/abs/2403.11808)
Keywords: transformer, segmentation
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves comparable or even superior performance compared to existing PEFT methods while evoking only 71%-85% of their FLOPs on the VTAB-1K benchmark.

Title: Metaphor Understanding Challenge Dataset for LLMs

Authors: Xiaoyu Tong, Rochelle Choenni, Martha Lewis, Ekaterina Shutova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11810
Pdf URL: https://arxiv.org/pdf/2403.11810
Copy Paste: [[2403.11810]] Metaphor Understanding Challenge Dataset for LLMs(https://arxiv.org/abs/2403.11810)
Keywords: large language model
Abstract: Metaphors in natural language are a reflection of fundamental cognitive processes such as analogical reasoning and categorisation, and are deeply rooted in everyday communication. Metaphor understanding is therefore an essential task for large language models (LLMs). We release the Metaphor Understanding Challenge Dataset (MUNCH), designed to evaluate the metaphor understanding capabilities of LLMs. The dataset provides over 10k paraphrases for sentences containing metaphor use, as well as 1.5k instances containing inapt paraphrases. The inapt paraphrases were carefully selected to serve as control to determine whether the model indeed performs full metaphor interpretation or rather resorts to lexical similarity. All apt and inapt paraphrases were manually annotated. The metaphorical sentences cover natural metaphor uses across 4 genres (academic, news, fiction, and conversation), and they exhibit different levels of novelty. Experiments with LLaMA and GPT-3.5 demonstrate that MUNCH presents a challenging task for LLMs. The dataset is freely accessible at https://github.com/xiaoyuisrain/metaphor-understanding-challenge.

Title: Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery

Authors: Yuqi Zhang, Guanying Chen, Jiaxing Chen, Shuguang Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11812
Pdf URL: https://arxiv.org/pdf/2403.11812
Copy Paste: [[2403.11812]] Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery(https://arxiv.org/abs/2403.11812)
Keywords: segmentation
Abstract: We present a neural radiance field method for urban-scale semantic and building-level instance segmentation from aerial images by lifting noisy 2D labels to 3D. This is a challenging problem due to two primary reasons. Firstly, objects in urban aerial images exhibit substantial variations in size, including buildings, cars, and roads, which pose a significant challenge for accurate 2D segmentation. Secondly, the 2D labels generated by existing segmentation methods suffer from the multi-view inconsistency problem, especially in the case of aerial images, where each image captures only a small portion of the entire scene. To overcome these limitations, we first introduce a scale-adaptive semantic label fusion strategy that enhances the segmentation of objects of varying sizes by combining labels predicted from different altitudes, harnessing the novel-view synthesis capabilities of NeRF. We then introduce a novel cross-view instance label grouping strategy based on the 3D scene representation to mitigate the multi-view inconsistency problem in the 2D instance labels. Furthermore, we exploit multi-view reconstructed depth priors to improve the geometric quality of the reconstructed radiance field, resulting in enhanced segmentation results. Experiments on multiple real-world urban-scale datasets demonstrate that our approach outperforms existing methods, highlighting its effectiveness.

Title: Problem space structural adversarial attacks for Network Intrusion Detection Systems based on Graph Neural Networks

Authors: Andrea Venturi, Dario Stabili, Mirco Marchetti
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11830
Pdf URL: https://arxiv.org/pdf/2403.11830
Copy Paste: [[2403.11830]] Problem space structural adversarial attacks for Network Intrusion Detection Systems based on Graph Neural Networks(https://arxiv.org/abs/2403.11830)
Keywords: attack, robust
Abstract: Machine Learning (ML) algorithms have become increasingly popular for supporting Network Intrusion Detection Systems (NIDS). Nevertheless, extensive research has shown their vulnerability to adversarial attacks, which involve subtle perturbations to the inputs of the models aimed at compromising their performance. Recent proposals have effectively leveraged Graph Neural Networks (GNN) to produce predictions based also on the structural patterns exhibited by intrusions to enhance the detection robustness. However, the adoption of GNN-based NIDS introduces new types of risks. In this paper, we propose the first formalization of adversarial attacks specifically tailored for GNN in network intrusion detection. Moreover, we outline and model the problem space constraints that attackers need to consider to carry out feasible structural attacks in real-world scenarios. As a final contribution, we conduct an extensive experimental campaign in which we launch the proposed attacks against state-of-the-art GNN-based NIDS. Our findings demonstrate the increased robustness of the models against classical feature-based adversarial attacks, while highlighting their susceptibility to structure-based attacks.

Title: SSCAE -- Semantic, Syntactic, and Context-aware natural language Adversarial Examples generator

Authors: Javad Rafiei Asl, Mohammad H. Rafiei, Manar Alohaly, Daniel Takabi
Subjects: cs.CL, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11833
Pdf URL: https://arxiv.org/pdf/2403.11833
Copy Paste: [[2403.11833]] SSCAE -- Semantic, Syntactic, and Context-aware natural language Adversarial Examples generator(https://arxiv.org/abs/2403.11833)
Keywords: attack, robust
Abstract: Machine learning models are vulnerable to maliciously crafted Adversarial Examples (AEs). Training a machine learning model with AEs improves its robustness and stability against adversarial attacks. It is essential to develop models that produce high-quality AEs. Developing such models has been much slower in natural language processing (NLP) than in areas such as computer vision. This paper introduces a practical and efficient adversarial attack model called SSCAE for \textbf{S}emantic, \textbf{S}yntactic, and \textbf{C}ontext-aware natural language \textbf{AE}s generator. SSCAE identifies important words and uses a masked language model to generate an early set of substitutions. Next, two well-known language models are employed to evaluate the initial set in terms of semantic and syntactic characteristics. We introduce (1) a dynamic threshold to capture more efficient perturbations and (2) a local greedy search to generate high-quality AEs. As a black-box method, SSCAE generates humanly imperceptible and context-aware AEs that preserve semantic consistency and the source language's syntactical and grammatical requirements. The effectiveness and superiority of the proposed SSCAE model are illustrated with fifteen comparative experiments and extensive sensitivity analysis for parameter optimization. SSCAE outperforms the existing models in all experiments while maintaining a higher semantic consistency with a lower query number and a comparable perturbation rate.

Title: Towards Understanding the Relationship between In-context Learning and Compositional Generalization

Authors: Sungjun Han, Sebastian Padó
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11834
Pdf URL: https://arxiv.org/pdf/2403.11834
Copy Paste: [[2403.11834]] Towards Understanding the Relationship between In-context Learning and Compositional Generalization(https://arxiv.org/abs/2403.11834)
Keywords: transformer
Abstract: According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ordinary learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones (i.e. in-context learning). In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.

Title: Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Authors: Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11835
Pdf URL: https://arxiv.org/pdf/2403.11835
Copy Paste: [[2403.11835]] Agent3D-Zero: An Agent for Zero-shot 3D Understanding(https://arxiv.org/abs/2403.11835)
Keywords: large language model
Abstract: The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence. The current common practice is to finetune Large Language Models (LLMs) with 3D data and texts to enable 3D understanding. Despite their effectiveness, these approaches are inherently limited by the scale and diversity of the available 3D data. Alternatively, in this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework addressing the 3D scene understanding in a zero-shot manner. The essence of our approach centers on reconceptualizing the challenge of 3D scene perception as a process of understanding and synthesizing insights from multiple images, inspired by how our human beings attempt to understand 3D scenes. By consolidating this idea, we propose a novel way to make use of a Large Visual Language Model (VLM) via actively selecting and analyzing a series of viewpoints for 3D understanding. Specifically, given an input 3D scene, Agent3D-Zero first processes a bird's-eye view image with custom-designed visual prompts, then iteratively chooses the next viewpoints to observe and summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is the introduction of novel visual prompts, which significantly unleash the VLMs' ability to identify the most informative viewpoints and thus facilitate observing 3D scenes. Extensive experiments demonstrate the effectiveness of the proposed framework in understanding diverse and previously unseen 3D environments.

Title: Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models

Authors: Yi Luo, Zhenghao Lin, Yuhao Zhang, Jiashuo Sun, Chen Lin, Chengjin Xu, Xiangdong Su, Yelong Shen, Jian Guo, Yeyun Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11838
Pdf URL: https://arxiv.org/pdf/2403.11838
Copy Paste: [[2403.11838]] Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models(https://arxiv.org/abs/2403.11838)
Keywords: security, privacy, large language model
Abstract: Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, thereby establishing a comprehensive library of guidelines and models for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with pertinent guidelines, guiding LLMs in response generation to ensure safe and high-quality outputs, thus aligning with human values. An additional optional stage involves fine-tuning a model with new well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluated our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.

Title: Near-Optimal Solutions of Constrained Learning Problems

Authors: Juan Elenter, Luiz F. O. Chamon, Alejandro Ribeiro
Subjects: cs.LG, eess.SP, math.OC
Abstract URL: https://arxiv.org/abs/2403.11844
Pdf URL: https://arxiv.org/pdf/2403.11844
Copy Paste: [[2403.11844]] Near-Optimal Solutions of Constrained Learning Problems(https://arxiv.org/abs/2403.11844)
Keywords: robust, fair
Abstract: With the widespread adoption of machine learning systems, the need to curtail their behavior has become increasingly apparent. This is evidenced by recent advancements towards developing models that satisfy robustness, safety, and fairness requirements. These requirements can be imposed (with generalization guarantees) by formulating constrained learning problems that can then be tackled by dual ascent algorithms. Yet, though these algorithms converge in objective value, even in non-convex settings, they cannot guarantee that their outcome is feasible. Doing so requires randomizing over all iterates, which is impractical in virtually any modern applications. Still, final iterates have been observed to perform well in practice. In this work, we address this gap between theory and practice by characterizing the constraint violation of Lagrangian minimizers associated with optimal dual variables, despite lack of convexity. To do this, we leverage the fact that non-convex, finite-dimensional constrained learning problems can be seen as parametrizations of convex, functional problems. Our results show that rich parametrizations effectively mitigate the issue of feasibility in dual methods, shedding light on prior empirical successes of dual learning. We illustrate our findings in fair learning tasks.

Title: GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Authors: Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11848
Pdf URL: https://arxiv.org/pdf/2403.11848
Copy Paste: [[2403.11848]] GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection(https://arxiv.org/abs/2403.11848)
Keywords: robust
Abstract: Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise.

Title: Complete and Efficient Graph Transformers for Crystal Material Property Prediction

Authors: Keqiang Yan, Cong Fu, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2403.11857
Pdf URL: https://arxiv.org/pdf/2403.11857
Copy Paste: [[2403.11857]] Complete and Efficient Graph Transformers for Crystal Material Property Prediction(https://arxiv.org/abs/2403.11857)
Keywords: transformer
Abstract: Crystal structures are characterized by atomic bases within a primitive unit cell that repeats along a regular lattice throughout 3D space. The periodic and infinite nature of crystals poses unique challenges for geometric graph representation learning. Specifically, constructing graphs that effectively capture the complete geometric information of crystals and handle chiral crystals remains an unsolved and challenging problem. In this paper, we introduce a novel approach that utilizes the periodic patterns of unit cells to establish the lattice-based representation for each atom, enabling efficient and expressive graph representations of crystals. Furthermore, we propose ComFormer, a SE(3) transformer designed specifically for crystalline materials. ComFormer includes two variants; namely, iComFormer that employs invariant geometric descriptors of Euclidean distances and angles, and eComFormer that utilizes equivariant vector representations. Experimental results demonstrate the state-of-the-art predictive accuracy of ComFormer variants on various tasks across three widely-used crystal benchmarks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).

Title: GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture

Authors: Shanglong Yang, Zhipeng Yuan, Shunbao Li, Ruoling Peng, Kang Liu, Po Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11858
Pdf URL: https://arxiv.org/pdf/2403.11858
Copy Paste: [[2403.11858]] GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture(https://arxiv.org/abs/2403.11858)
Keywords: transformer, generative, large language model
Abstract: In the rapidly evolving field of artificial intelligence (AI), the application of large language models (LLMs) in agriculture, particularly in pest management, remains nascent. We aimed to prove the feasibility by evaluating the content of the pest management advice generated by LLMs, including the Generative Pre-trained Transformer (GPT) series from OpenAI and the FLAN series from Google. Considering the context-specific properties of agricultural advice, automatically measuring or quantifying the quality of text generated by LLMs becomes a significant challenge. We proposed an innovative approach, using GPT-4 as an evaluator, to score the generated content on Coherence, Logical Consistency, Fluency, Relevance, Comprehensibility, and Exhaustiveness. Additionally, we integrated an expert system based on crop threshold data as a baseline to obtain scores for Factual Accuracy on whether pests found in crop fields should take management action. Each model's score was weighted by percentage to obtain a final score. The results showed that GPT-3.4 and GPT-4 outperform the FLAN models in most evaluation categories. Furthermore, the use of instruction-based prompting containing domain-specific knowledge proved the feasibility of LLMs as an effective tool in agriculture, with an accuracy rate of 72%, demonstrating LLMs' effectiveness in providing pest management suggestions.

Title: Towards automated formal security analysis of SAML V2.0 Web Browser SSO standard - the POST/Artifact use case

Authors: Zvonimir Hartl, Ante Đerek
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.11859
Pdf URL: https://arxiv.org/pdf/2403.11859
Copy Paste: [[2403.11859]] Towards automated formal security analysis of SAML V2.0 Web Browser SSO standard - the POST/Artifact use case(https://arxiv.org/abs/2403.11859)
Keywords: security
Abstract: Single Sign-On (SSO) protocols streamline user authentication with a unified login for multiple online services, improving usability and security. One of the most common SSO protocol frameworks - the Security Assertion Markup Language V2.0 (SAML) Web SSO Profile - has been in use for more than two decades, primarily in government, education and enterprise environments. Despite its mission-critical nature, only certain deployments and configurations of the Web SSO Profile have been formally analyzed. This paper attempts to bridge this gap by performing a comprehensive formal security analysis of the SAML V2.0 SP-initiated SSO with POST/Artifact Bindings use case. Rather than focusing on a specific deployment and configuration, we closely follow the specification with the goal of capturing many different deployments allowed by the standard. Modeling and analysis is performed using Tamarin prover - state-of-the-art tool for automated verification of security protocols in the symbolic model of cryptography. Technically, we build a meta-model of the use case that we instantiate to eight different protocol variants. Using the Tamarin prover, we formally verify a number of critical security properties for those protocol variants, while identifying certain drawbacks and potential vulnerabilities.

Title: IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images

Authors: Meilin Wang, Yexing Song, Pengxu Wei, Xiaoyu Xian, Yukai Shi, Liang Lin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11870
Pdf URL: https://arxiv.org/pdf/2403.11870
Copy Paste: [[2403.11870]] IDF-CR: Iterative Diffusion Process for Divide-and-Conquer Cloud Removal in Remote-sensing Images(https://arxiv.org/abs/2403.11870)
Keywords: diffusion, generative
Abstract: Deep learning technologies have demonstrated their effectiveness in removing cloud cover from optical remote-sensing images. Convolutional Neural Networks (CNNs) exert dominance in the cloud removal tasks. However, constrained by the inherent limitations of convolutional operations, CNNs can address only a modest fraction of cloud occlusion. In recent years, diffusion models have achieved state-of-the-art (SOTA) proficiency in image generation and reconstruction due to their formidable generative capabilities. Inspired by the rapid development of diffusion models, we first present an iterative diffusion process for cloud removal (IDF-CR), which exhibits a strong generative capabilities to achieve component divide-and-conquer cloud removal. IDF-CR consists of a pixel space cloud removal module (Pixel-CR) and a latent space iterative noise diffusion network (IND). Specifically, IDF-CR is divided into two-stage models that address pixel space and latent space. The two-stage model facilitates a strategic transition from preliminary cloud reduction to meticulous detail refinement. In the pixel space stage, Pixel-CR initiates the processing of cloudy images, yielding a suboptimal cloud removal prior to providing the diffusion model with prior cloud removal knowledge. In the latent space stage, the diffusion model transforms low-quality cloud removal into high-quality clean output. We refine the Stable Diffusion by implementing ControlNet. In addition, an unsupervised iterative noise refinement (INR) module is introduced for diffusion model to optimize the distribution of the predicted noise, thereby enhancing advanced detail recovery. Our model performs best with other SOTA methods, including image reconstruction and optical remote-sensing cloud removal on the optical remote-sensing datasets.

Title: CO3: Low-resource Contrastive Co-training for Generative Conversational Query Rewrite

Authors: Yifei Yuan, Chen Shi, Runze Wang, Liyi Chen, Renjun Hu, Zengming Zhang, Feijun Jiang, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.11873
Pdf URL: https://arxiv.org/pdf/2403.11873
Copy Paste: [[2403.11873]] CO3: Low-resource Contrastive Co-training for Generative Conversational Query Rewrite(https://arxiv.org/abs/2403.11873)
Keywords: robust, generative
Abstract: Generative query rewrite generates reconstructed query rewrites using the conversation history while rely heavily on gold rewrite pairs that are expensive to obtain. Recently, few-shot learning is gaining increasing popularity for this task, whereas these methods are sensitive to the inherent noise due to limited data size. Besides, both attempts face performance degradation when there exists language style shift between training and testing cases. To this end, we study low-resource generative conversational query rewrite that is robust to both noise and language style shift. The core idea is to utilize massive unlabeled data to make further improvements via a contrastive co-training paradigm. Specifically, we co-train two dual models (namely Rewriter and Simplifier) such that each of them provides extra guidance through pseudo-labeling for enhancing the other in an iterative manner. We also leverage contrastive learning with data augmentation, which enables our model pay more attention on the truly valuable information than the noise. Extensive experiments demonstrate the superiority of our model under both few-shot and zero-shot scenarios. We also verify the better generalization ability of our model when encountering language style shift.

Title: Towards Real-Time Fast Unmanned Aerial Vehicle Detection Using Dynamic Vision Sensors

Authors: Jakub Mandula, Jonas Kühne, Luca Pascarella, Michele Magno
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11875
Pdf URL: https://arxiv.org/pdf/2403.11875
Copy Paste: [[2403.11875]] Towards Real-Time Fast Unmanned Aerial Vehicle Detection Using Dynamic Vision Sensors(https://arxiv.org/abs/2403.11875)
Keywords: security, privacy
Abstract: Unmanned Aerial Vehicles (UAVs) are gaining popularity in civil and military applications. However, uncontrolled access to restricted areas threatens privacy and security. Thus, prevention and detection of UAVs are pivotal to guarantee confidentiality and safety. Although active scanning, mainly based on radars, is one of the most accurate technologies, it can be expensive and less versatile than passive inspections, e.g., object recognition. Dynamic vision sensors (DVS) are bio-inspired event-based vision models that leverage timestamped pixel-level brightness changes in fast-moving scenes that adapt well to low-latency object detection. This paper presents F-UAV-D (Fast Unmanned Aerial Vehicle Detector), an embedded system that enables fast-moving drone detection. In particular, we propose a setup to exploit DVS as an alternative to RGB cameras in a real-time and low-power configuration. Our approach leverages the high-dynamic range (HDR) and background suppression of DVS and, when trained with various fast-moving drones, outperforms RGB input in suboptimal ambient conditions such as low illumination and fast-moving scenes. Our results show that F-UAV-D can (i) detect drones by using less than <15 W on average and (ii) perform real-time inference (i.e., <50 ms) by leveraging the CPU and GPU nodes of our edge computer.

Title: InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting

Authors: Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang Zeng, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11878
Pdf URL: https://arxiv.org/pdf/2403.11878
Copy Paste: [[2403.11878]] InTeX: Interactive Text-to-texture Synthesis via Unified Depth-aware Inpainting(https://arxiv.org/abs/2403.11878)
Keywords: diffusion
Abstract: Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.

Title: ReGenNet: Towards Human Action-Reaction Synthesis

Authors: Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11882
Pdf URL: https://arxiv.org/pdf/2403.11882
Copy Paste: [[2403.11882]] ReGenNet: Towards Human Action-Reaction Synthesis(https://arxiv.org/abs/2403.11882)
Keywords: diffusion, transformer, generative
Abstract: Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.

Title: QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback based Self-Correction

Authors: Xiang Huang, Sitao Cheng, Shanshan Huang, Jiayu Shen, Yong Xu, Chaoyun Zhang, Yuzhong Qu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11886
Pdf URL: https://arxiv.org/pdf/2403.11886
Copy Paste: [[2403.11886]] QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback based Self-Correction(https://arxiv.org/abs/2403.11886)
Keywords: large language model
Abstract: Employing Large Language Models (LLMs) for semantic parsing has achieved remarkable success. However, we find existing methods fall short in terms of reliability and efficiency when hallucinations are encountered. In this paper, we address these challenges with a framework called QueryAgent, which solves a question step-by-step and performs step-wise self-correction. We introduce an environmental feedback-based self-correction method called ERASER. Unlike traditional approaches, ERASER leverages rich environmental feedback in the intermediate steps to perform selective and differentiated self-correction only when necessary. Experimental results demonstrate that QueryAgent notably outperforms all previous few-shot methods using only one example on GrailQA and GraphQ by 7.0 and 15.0 F1. Moreover, our approach exhibits superiority in terms of efficiency, including runtime, query overhead, and API invocation costs. By leveraging ERASER, we further improve another baseline (i.e., AgentBench) by approximately 10 points, revealing the strong transferability of our approach.

Title: SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

Authors: Xiangyu Chen, Jing Liu, Ye Wang, Pu (Perry)Wang, Matthew Brand, Guanghui Wang, Toshiaki Koike-Akino
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11887
Pdf URL: https://arxiv.org/pdf/2403.11887
Copy Paste: [[2403.11887]] SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules(https://arxiv.org/abs/2403.11887)
Keywords: diffusion, large language model
Abstract: Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.

Title: KnFu: Effective Knowledge Fusion

Authors: S. Jamal Seyedmohammadi, S. Kawa Atapour, Jamshid Abouei, Arash Mohammadi
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2403.11892
Pdf URL: https://arxiv.org/pdf/2403.11892
Copy Paste: [[2403.11892]] KnFu: Effective Knowledge Fusion(https://arxiv.org/abs/2403.11892)
Keywords: security, privacy, attack, extraction, federate
Abstract: Federated Learning (FL) has emerged as a prominent alternative to the traditional centralized learning approach. Generally speaking, FL is a decentralized approach that allows for collaborative training of Machine Learning (ML) models across multiple local nodes, ensuring data privacy and security while leveraging diverse datasets. Conventional FL, however, is susceptible to gradient inversion attacks, restrictively enforces a uniform architecture on local models, and suffers from model heterogeneity (model drift) due to non-IID local datasets. To mitigate some of these challenges, the new paradigm of Federated Knowledge Distillation (FKD) has emerged. FDK is developed based on the concept of Knowledge Distillation (KD), which involves extraction and transfer of a large and well-trained teacher model's knowledge to lightweight student models. FKD, however, still faces the model drift issue. Intuitively speaking, not all knowledge is universally beneficial due to the inherent diversity of data among local nodes. This calls for innovative mechanisms to evaluate the relevance and effectiveness of each client's knowledge for others, to prevent propagation of adverse knowledge. In this context, the paper proposes Effective Knowledge Fusion (KnFu) algorithm that evaluates knowledge of local models to only fuse semantic neighbors' effective knowledge for each client. The KnFu is a personalized effective knowledge fusion scheme for each client, that analyzes effectiveness of different local models' knowledge prior to the aggregation phase. Comprehensive experiments were performed on MNIST and CIFAR10 datasets illustrating effectiveness of the proposed KnFu in comparison to its state-of-the-art counterparts. A key conclusion of the work is that in scenarios with large and highly heterogeneous local datasets, local training could be preferable to knowledge fusion-based solutions.

Title: From explainable to interpretable deep learning for natural language processing in healthcare: how far from reality?

Authors: Guangming Huang, Yunfei Long, Yingya Li, Giorgos Papanastasiou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11894
Pdf URL: https://arxiv.org/pdf/2403.11894
Copy Paste: [[2403.11894]] From explainable to interpretable deep learning for natural language processing in healthcare: how far from reality?(https://arxiv.org/abs/2403.11894)
Keywords: interpretability, explainability
Abstract: Deep learning (DL) has substantially enhanced healthcare research by addressing various natural language processing (NLP) tasks. Yet, the increasing complexity of DL-based NLP methods necessitates transparent model interpretability, or at least explainability, for reliable decision-making. This work presents a thorough scoping review on explainable and interpretable DL in healthcare NLP. The term "XIAI" (eXplainable and Interpretable Artificial Intelligence) was introduced to distinguish XAI from IAI. Methods were further categorized based on their functionality (model-, input-, output-based) and scope (local, global). Our analysis shows that attention mechanisms were the most dominant emerging IAI. Moreover, IAI is increasingly used against XAI. The major challenges identified are that most XIAI do not explore "global" modeling processes, the lack of best practices, and the unmet need for systematic evaluation and benchmarks. Important opportunities were raised such as using "attention" to enhance multi-modal XIAI for personalized medicine and combine DL with causal reasoning. Our discussion encourages the integration of XIAI in LLMs and domain-specific smaller models. Our review can stimulate further research and benchmarks toward improving inherent IAI and engaging complex NLP in healthcare.

Title: Investigating Markers and Drivers of Gender Bias in Machine Translations

Authors: Peter J Barclay, Ashkan Sami (Edinburgh Napier University)
Subjects: cs.CL, cs.CY, cs.SE
Abstract URL: https://arxiv.org/abs/2403.11896
Pdf URL: https://arxiv.org/pdf/2403.11896
Copy Paste: [[2403.11896]] Investigating Markers and Drivers of Gender Bias in Machine Translations(https://arxiv.org/abs/2403.11896)
Keywords: robust, large language model
Abstract: Implicit gender bias in Large Language Models (LLMs) is a well-documented problem, and implications of gender introduced into automatic translations can perpetuate real-world biases. However, some LLMs use heuristics or post-processing to mask such bias, making investigation difficult. Here, we examine bias in LLMss via back-translation, using the DeepL translation API to investigate the bias evinced when repeatedly translating a set of 56 Software Engineering tasks used in a previous study. Each statement starts with 'she', and is translated first into a 'genderless' intermediate language then back into English; we then examine pronoun- choice in the back-translated texts. We expand prior research in the following ways: (1) by comparing results across five intermediate languages, namely Finnish, Indonesian, Estonian, Turkish and Hungarian; (2) by proposing a novel metric for assessing the variation in gender implied in the repeated translations, avoiding the over-interpretation of individual pronouns, apparent in earlier work; (3) by investigating sentence features that drive bias; (4) and by comparing results from three time-lapsed datasets to establish the reproducibility of the approach. We found that some languages display similar patterns of pronoun use, falling into three loose groups, but that patterns vary between groups; this underlines the need to work with multiple languages. We also identify the main verb appearing in a sentence as a likely significant driver of implied gender in the translations. Moreover, we see a good level of replicability in the results, and establish that our variation metric proves robust despite an obvious change in the behaviour of the DeepL translation API during the course of the study. These results show that the back-translation method can provide further insights into bias in language models.

Title: Larimar: Large Language Models with Episodic Memory Control

Authors: Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiří, Navrátil, Soham Dan, Pin-Yu Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11901
Pdf URL: https://arxiv.org/pdf/2403.11901
Copy Paste: [[2403.11901]] Larimar: Large Language Models with Episodic Memory Control(https://arxiv.org/abs/2403.11901)
Keywords: large language model
Abstract: Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 4-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting and input context length generalization with Larimar and show their effectiveness.

Title: CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification

Authors: Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11904
Pdf URL: https://arxiv.org/pdf/2403.11904
Copy Paste: [[2403.11904]] CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification(https://arxiv.org/abs/2403.11904)
Keywords: transformer
Abstract: Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.

Title: RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF

Authors: Sibi Catley-Chandar, Richard Shaw, Gregory Slabaugh, Eduardo Perez-Pellitero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11909
Pdf URL: https://arxiv.org/pdf/2403.11909
Copy Paste: [[2403.11909]] RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF(https://arxiv.org/abs/2403.11909)
Keywords: robust
Abstract: Recent advances in neural rendering have enabled highly photorealistic 3D scene reconstruction and novel view synthesis. Despite this progress, current state-of-the-art methods struggle to reconstruct high frequency detail, due to factors such as a low-frequency bias of radiance fields and inaccurate camera calibration. One approach to mitigate this issue is to enhance images post-rendering. 2D enhancers can be pre-trained to recover some detail but are agnostic to scene geometry and do not easily generalize to new distributions of image degradation. Conversely, existing 3D enhancers are able to transfer detail from nearby training images in a generalizable manner, but suffer from inaccurate camera calibration and can propagate errors from the geometry into rendered images. We propose a neural rendering enhancer, RoGUENeRF, which exploits the best of both paradigms. Our method is pre-trained to learn a general enhancer while also leveraging information from nearby training images via robust 3D alignment and geometry-aware fusion. Our approach restores high-frequency textures while maintaining geometric consistency and is also robust to inaccurate camera calibration. We show that RoGUENeRF substantially enhances the rendering quality of a wide range of neural rendering baselines, e.g. improving the PSNR of MipNeRF360 by 0.63dB and Nerfacto by 1.34dB on the real world 360v2 dataset.

Title: LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Authors: Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11929
Pdf URL: https://arxiv.org/pdf/2403.11929
Copy Paste: [[2403.11929]] LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model(https://arxiv.org/abs/2403.11929)
Keywords: diffusion, generative
Abstract: Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.

Title: HyperColorization: Propagating spatially sparse noisy spectral clues for reconstructing hyperspectral images

Authors: M. Kerem Aydin, Qi Guo, Emma Alexander
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.11935
Pdf URL: https://arxiv.org/pdf/2403.11935
Copy Paste: [[2403.11935]] HyperColorization: Propagating spatially sparse noisy spectral clues for reconstructing hyperspectral images(https://arxiv.org/abs/2403.11935)
Keywords: robust
Abstract: Hyperspectral cameras face challenging spatial-spectral resolution trade-offs and are more affected by shot noise than RGB photos taken over the same total exposure time. Here, we present a colorization algorithm to reconstruct hyperspectral images from a grayscale guide image and spatially sparse spectral clues. We demonstrate that our algorithm generalizes to varying spectral dimensions for hyperspectral images, and show that colorizing in a low-rank space reduces compute time and the impact of shot noise. To enhance robustness, we incorporate guided sampling, edge-aware filtering, and dimensionality estimation techniques. Our method surpasses previous algorithms in various performance metrics, including SSIM, PSNR, GFC, and EMD, which we analyze as metrics for characterizing hyperspectral image quality. Collectively, these findings provide a promising avenue for overcoming the time-space-wavelength resolution trade-off by reconstructing a dense hyperspectral image from samples obtained by whisk or push broom scanners, as well as hybrid spatial-spectral computational imaging systems.

Title: Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment

Authors: Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, Ning Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11956
Pdf URL: https://arxiv.org/pdf/2403.11956
Copy Paste: [[2403.11956]] Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment(https://arxiv.org/abs/2403.11956)
Keywords: transformer, generative, large language model
Abstract: With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at https://github.com/QMME/T2VQA.

Title: Enhanced Event-Based Video Reconstruction with Motion Compensation

Authors: Siying Liu, Pier Luigi Dragotti
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11961
Pdf URL: https://arxiv.org/pdf/2403.11961
Copy Paste: [[2403.11961]] Enhanced Event-Based Video Reconstruction with Motion Compensation(https://arxiv.org/abs/2403.11961)
Keywords: interpretability
Abstract: Deep neural networks for event-based video reconstruction often suffer from a lack of interpretability and have high memory demands. A lightweight network called CISTA-LSTC has recently been introduced showing that high-quality reconstruction can be achieved through the systematic design of its architecture. However, its modelling assumption that input signals and output reconstructed frame share the same sparse representation neglects the displacement caused by motion. To address this, we propose warping the input intensity frames and sparse codes to enhance reconstruction quality. A CISTA-Flow network is constructed by integrating a flow network with CISTA-LSTC for motion compensation. The system relies solely on events, in which predicted flow aids in reconstruction and then reconstructed frames are used to facilitate flow estimation. We also introduce an iterative training framework for this combined system. Results demonstrate that our approach achieves state-of-the-art reconstruction accuracy and simultaneously provides reliable dense flow estimation. Furthermore, our model exhibits flexibility in that it can integrate different flow networks, suggesting its potential for further performance enhancement.

Title: Transfer Learning Beyond Bounded Density Ratios

Authors: Alkis Kalavasis, Ilias Zadik, Manolis Zampetakis
Subjects: cs.LG, cs.DS, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11963
Pdf URL: https://arxiv.org/pdf/2403.11963
Copy Paste: [[2403.11963]] Transfer Learning Beyond Bounded Density Ratios(https://arxiv.org/abs/2403.11963)
Keywords: transformer
Abstract: We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$. A standard change of measure argument implies that transfer learning happens when the density ratio $dQ/dP$ is bounded. Yet, prior thought-provoking works by Kpotufe and Martinet (COLT, 2018) and Hanneke and Kpotufe (NeurIPS, 2019) demonstrate cases where the ratio $dQ/dP$ is unbounded, but transfer learning is possible. In this work, we focus on transfer learning over the class of low-degree polynomial estimators. Our main result is a general transfer inequality over the domain $\mathbb{R}^n$, proving that non-trivial transfer learning for low-degree polynomials is possible under very mild assumptions, going well beyond the classical assumption that $dQ/dP$ is bounded. For instance, it always applies if $Q$ is a log-concave measure and the inverse ratio $dP/dQ$ is bounded. To demonstrate the applicability of our inequality, we obtain new results in the settings of: (1) the classical truncated regression setting, where $dQ/dP$ equals infinity, and (2) the more recent out-of-distribution generalization setting for in-context learning linear functions with transformers. We also provide a discrete analogue of our transfer inequality on the Boolean Hypercube $\{-1,1\}^n$, and study its connections with the recent problem of Generalization on the Unseen of Abbe, Bengio, Lotfi and Rizk (ICML, 2023). Our main conceptual contribution is that the maximum influence of the error of the estimator $\widehat{f}-f^*$ under $Q$, $\mathrm{I}_{\max}(\widehat{f}-f^*)$, acts as a sufficient condition for transferability; when $\mathrm{I}_{\max}(\widehat{f}-f^*)$ is appropriately bounded, transfer is possible over the Boolean domain.

Title: Informed Spectral Normalized Gaussian Processes for Trajectory Prediction

Authors: Christian Schlauch, Christian Wirth, Nadja Klein
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.11966
Pdf URL: https://arxiv.org/pdf/2403.11966
Copy Paste: [[2403.11966]] Informed Spectral Normalized Gaussian Processes for Trajectory Prediction(https://arxiv.org/abs/2403.11966)
Keywords: robust
Abstract: Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.

Title: Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

Authors: Hengyu Fu, Zhuoran Yang, Mengdi Wang, Minshuo Chen
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2403.11968
Pdf URL: https://arxiv.org/pdf/2403.11968
Copy Paste: [[2403.11968]] Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory(https://arxiv.org/abs/2403.11968)
Keywords: diffusion
Abstract: Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.

Title: Diffusion Denoising as a Certified Defense against Clean-label Poisoning

Authors: Sanghyun Hong, Nicholas Carlini, Alexey Kurakin
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.11981
Pdf URL: https://arxiv.org/pdf/2403.11981
Copy Paste: [[2403.11981]] Diffusion Denoising as a Certified Defense against Clean-label Poisoning(https://arxiv.org/abs/2403.11981)
Keywords: defense, attack, robust, diffusion
Abstract: We present a certified defense to clean-label poisoning attacks. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain $p$-norm bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $denoised$ $smoothing$, we show how an off-the-shelf diffusion model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks and reduce their attack success to 0-16% with only a negligible drop in the test time accuracy. We compare our defense with existing countermeasures against clean-label poisoning, showing that the defense reduces the attack success the most and offers the best model utility. Our results highlight the need for future work on developing stronger clean-label attacks and using our certified yet practical defense as a strong baseline to evaluate these attacks.

Title: Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching

Authors: Andrew Katz, Mitchell Gerhardt, Michelle Soledad
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2403.11984
Pdf URL: https://arxiv.org/pdf/2403.11984
Copy Paste: [[2403.11984]] Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching(https://arxiv.org/abs/2403.11984)
Keywords: generative, large language model
Abstract: Feedback is a critical aspect of improvement. Unfortunately, when there is a lot of feedback from multiple sources, it can be difficult to distill the information into actionable insights. Consider student evaluations of teaching (SETs), which are important sources of feedback for educators. They can give instructors insights into what worked during a semester. A collection of SETs can also be useful to administrators as signals for courses or entire programs. However, on a large scale as in high-enrollment courses or administrative records over several years, the volume of SETs can render them difficult to analyze. In this paper, we discuss a novel method for analyzing SETs using natural language processing (NLP) and large language models (LLMs). We demonstrate the method by applying it to a corpus of 5,000 SETs from a large public university. We show that the method can be used to extract, embed, cluster, and summarize the SETs to identify the themes they express. More generally, this work illustrates how to use the combination of NLP techniques and LLMs to generate a codebook for SETs. We conclude by discussing the implications of this method for analyzing SETs and other types of student writing in teaching and research settings.

Title: GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation

Authors: Zhaoyang Lyu, Ben Fei, Jinyi Wang, Xudong Xu, Ya Zhang, Weidong Yang, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.11990
Pdf URL: https://arxiv.org/pdf/2403.11990
Copy Paste: [[2403.11990]] GetMesh: A Controllable Model for High-quality Mesh Generation and Manipulation(https://arxiv.org/abs/2403.11990)
Keywords: robust, generative
Abstract: Mesh is a fundamental representation of 3D assets in various industrial applications, and is widely supported by professional softwares. However, due to its irregular structure, mesh creation and manipulation is often time-consuming and labor-intensive. In this paper, we propose a highly controllable generative model, GetMesh, for mesh generation and manipulation across different categories. By taking a varying number of points as the latent representation, and re-organizing them as triplane representation, GetMesh generates meshes with rich and sharp details, outperforming both single-category and multi-category counterparts. Moreover, it also enables fine-grained control over the generation process that previous mesh generative models cannot achieve, where changing global/local mesh topologies, adding/removing mesh parts, and combining mesh parts across categories can be intuitively, efficiently, and robustly accomplished by adjusting the number, positions or features of latent points. Project page is https://getmesh.github.io.

Title: Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning

Authors: Markus J. Buehler
Subjects: cs.LG, cond-mat.mes-hall, cond-mat.mtrl-sci, cond-mat.soft, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.11996
Pdf URL: https://arxiv.org/pdf/2403.11996
Copy Paste: [[2403.11996]] Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning(https://arxiv.org/abs/2403.11996)
Keywords: extraction, generative
Abstract: Using generative Artificial Intelligence (AI), we transformed a set of 1,000 scientific papers in the area of biological materials into detailed ontological knowledge graphs, revealing their inherently scale-free nature. Using graph traversal path detection between dissimilar concepts based on combinatorial ranking of node similarity and betweenness centrality, we reveal deep insights into unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, and propose never-before-seen material designs and their behaviors. One comparison revealed detailed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. The algorithm further created an innovative hierarchical mycelium-based composite that incorporates joint synthesis of graph sampling with principles extracted from Kandinsky's Composition VII painting, where the resulting composite reflects a balance of chaos and order, with features like adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across physical, biological, and artistic spheres, revealing a nuanced ontology of immanence and material flux that resonates with postmodern philosophy, and positions these interconnections within a heterarchical framework. Our findings reveal the dynamic, context-dependent interplay of entities beyond traditional hierarchical paradigms, emphasizing the significant role of individual components and their fluctuative relationships within the system. Our predictions achieve a far higher degree of novelty, technical detail and explorative capacity than conventional generative AI methods. The approach establishes a widely useful framework for innovation by revealing hidden connections that facilitate discovery.

Title: Learning Useful Representations of Recurrent Neural Network Weight Matrices

Authors: Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.11998
Pdf URL: https://arxiv.org/pdf/2403.11998
Copy Paste: [[2403.11998]] Learning Useful Representations of Recurrent Neural Network Weight Matrices(https://arxiv.org/abs/2403.11998)
Keywords: generative
Abstract: Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality -- specifically, its input-output mapping. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We create and release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits. With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority.

Title: HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

Authors: Ting Yao, Yehao Li, Yingwei Pan, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.11999
Pdf URL: https://arxiv.org/pdf/2403.11999
Copy Paste: [[2403.11999]] HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs(https://arxiv.org/abs/2403.11999)
Keywords: transformer
Abstract: The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost ($\sim$5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$\times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$\times$224 inputs.

Title: DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing

Authors: Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12002
Pdf URL: https://arxiv.org/pdf/2403.12002
Copy Paste: [[2403.12002]] DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing(https://arxiv.org/abs/2403.12002)
Keywords: diffusion
Abstract: Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.

Title: GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Authors: Xiaojie Li, Yibo Yang, Xiangtai Li, Jianlong Wu, Yue Yu, Bernard Ghanem, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12003
Pdf URL: https://arxiv.org/pdf/2403.12003
Copy Paste: [[2403.12003]] GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning(https://arxiv.org/abs/2403.12003)
Keywords: generative
Abstract: Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Code is available at https://github.com/xiaojieli0903/genview.

Title: SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Authors: Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12008
Pdf URL: https://arxiv.org/pdf/2403.12008
Copy Paste: [[2403.12008]] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion(https://arxiv.org/abs/2403.12008)
Keywords: diffusion, generative
Abstract: We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.

Title: Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks

Authors: K. P. Santoso, R. V. H. Ginardi, R. A. Sastrowardoyo, F. A. Madany
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12009
Pdf URL: https://arxiv.org/pdf/2403.12009
Copy Paste: [[2403.12009]] Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural Networks(https://arxiv.org/abs/2403.12009)
Keywords: extraction, generative
Abstract: In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-7B (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.

Title: VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

Authors: Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2403.12010
Pdf URL: https://arxiv.org/pdf/2403.12010
Copy Paste: [[2403.12010]] VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model(https://arxiv.org/abs/2403.12010)
Keywords: diffusion, generative
Abstract: Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.

Title: HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Authors: Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12011
Pdf URL: https://arxiv.org/pdf/2403.12011
Copy Paste: [[2403.12011]] HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data(https://arxiv.org/abs/2403.12011)
Keywords: diffusion
Abstract: 3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion

Title: GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Authors: Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, Xiaoxiao Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12013
Pdf URL: https://arxiv.org/pdf/2403.12013
Copy Paste: [[2403.12013]] GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image(https://arxiv.org/abs/2403.12013)
Keywords: diffusion, transformer, generative
Abstract: We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.

Title: EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

Authors: Abhay Zala, Jaemin Cho, Han Lin, Jaehong Yoon, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12014
Pdf URL: https://arxiv.org/pdf/2403.12014
Copy Paste: [[2403.12014]] EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents(https://arxiv.org/abs/2403.12014)
Keywords: large language model
Abstract: Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of LLM calls. Lastly, we present detailed ablation studies for our design choices.

Title: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Authors: Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12015
Pdf URL: https://arxiv.org/pdf/2403.12015
Copy Paste: [[2403.12015]] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation(https://arxiv.org/abs/2403.12015)
Keywords: diffusion, generative
Abstract: Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

Title: Supervised Fine-Tuning as Inverse Reinforcement Learning

Authors: Hao Sun
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.12017
Pdf URL: https://arxiv.org/pdf/2403.12017
Copy Paste: [[2403.12017]] Supervised Fine-Tuning as Inverse Reinforcement Learning(https://arxiv.org/abs/2403.12017)
Keywords: large language model
Abstract: The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.

Title: LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Authors: Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12019
Pdf URL: https://arxiv.org/pdf/2403.12019
Copy Paste: [[2403.12019]] LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation(https://arxiv.org/abs/2403.12019)
Keywords: diffusion, transformer, generative
Abstract: The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

Title: FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12026
Pdf URL: https://arxiv.org/pdf/2403.12026
Copy Paste: [[2403.12026]] FlexCap: Generating Rich, Localized, and Flexible Captions in Images(https://arxiv.org/abs/2403.12026)
Keywords: large language model
Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Title: From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Authors: Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2403.12027
Pdf URL: https://arxiv.org/pdf/2403.12027
Copy Paste: [[2403.12027]] From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models(https://arxiv.org/abs/2403.12027)
Keywords: large language model
Abstract: Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models (LLMs), have revolutionized various natural language processing (NLP) tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. The paper begins by defining chart understanding, outlining problem formulations, and discussing fundamental building blocks crucial for studying chart understanding tasks. In the section on tasks and datasets, we explore various tasks within chart understanding and discuss their evaluation metrics and sources of both charts and textual inputs. Modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed in a dedicated section, highlighting issues such as domain-specific charts, lack of efforts in evaluation, and agent-oriented settings. This survey paper serves to provide valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.

Title: Align and Distill: Unifying and Improving Domain Adaptive Object Detection

Authors: Justin Kay, Timm Haucke, Suzanne Stathatos, Siqi Deng, Erik Young, Pietro Perona, Sara Beery, Grant Van Horn
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12029
Pdf URL: https://arxiv.org/pdf/2403.12029
Copy Paste: [[2403.12029]] Align and Distill: Unifying and Improving Domain Adaptive Object Detection(https://arxiv.org/abs/2403.12029)
Keywords: fair
Abstract: Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, enabling evaluation on diverse real-world data, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes to Foggy Cityscapes, +5.7 AP50 on Sim10k to Cityscapes (where ours is the only method to outperform a fair baseline), and +2.0 AP50 on CFC Kenai to Channel. Our framework, dataset, and state-of-the-art method offer a critical reset for DAOD and provide a strong foundation for future research. Code and data are available: https://github.com/justinkay/aldi and https://github.com/visipedia/caltech-fish-counting.

Title: ROUTERBENCH: A Benchmark for Multi-LLM Routing System

Authors: Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, Shriyash Kaustubh Upadhyay
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.12031
Pdf URL: https://arxiv.org/pdf/2403.12031
Copy Paste: [[2403.12031]] ROUTERBENCH: A Benchmark for Multi-LLM Routing System(https://arxiv.org/abs/2403.12031)
Keywords: large language model
Abstract: As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present ROUTERBENCH, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through ROUTERBENCH, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.

Title: Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Authors: Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, Leonidas Guibas
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.12032
Pdf URL: https://arxiv.org/pdf/2403.12032
Copy Paste: [[2403.12032]] Generic 3D Diffusion Adapter Using Controlled Multi-View Editing(https://arxiv.org/abs/2403.12032)
Keywords: diffusion
Abstract: Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

Title: HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Authors: Ce Zhang, Simon Stepputtis, Joseph Campbell, Katia Sycara, Yaqi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12033
Pdf URL: https://arxiv.org/pdf/2403.12033
Copy Paste: [[2403.12033]] HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation(https://arxiv.org/abs/2403.12033)
Keywords: robust
Abstract: Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at https://github.com/zhangce01/HiKER-SGG.

Title: VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Authors: Junlin Han, Filippos Kokkinos, Philip Torr
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12034
Pdf URL: https://arxiv.org/pdf/2403.12034
Copy Paste: [[2403.12034]] VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models(https://arxiv.org/abs/2403.12034)
Keywords: diffusion, generative
Abstract: This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 70% of the time.

Title: One-Step Image Translation with Text-to-Image Models

Authors: Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.12036
Pdf URL: https://arxiv.org/pdf/2403.12036
Copy Paste: [[2403.12036]] One-Step Image Translation with Text-to-Image Models(https://arxiv.org/abs/2403.12036)
Keywords: diffusion
Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.

Title: MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Authors: Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12037
Pdf URL: https://arxiv.org/pdf/2403.12037
Copy Paste: [[2403.12037]] MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control(https://arxiv.org/abs/2403.12037)
Keywords: diffusion, large language model
Abstract: It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent's imaginative ability reveals its generalization and comprehension of the open world.

Title: Zero-Shot Image Feature Consensus with Deep Functional Maps

Authors: Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12038
Pdf URL: https://arxiv.org/pdf/2403.12038
Copy Paste: [[2403.12038]] Zero-Shot Image Feature Consensus with Deep Functional Maps(https://arxiv.org/abs/2403.12038)
Keywords: generative
Abstract: Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of different layers or networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but also more accurate, with the possibility of better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on various dense correspondence tasks. We also demonstrate our effectiveness in keypoint correspondence and affordance map transfer.

Title: Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Authors: Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.12042
Pdf URL: https://arxiv.org/pdf/2403.12042
Copy Paste: [[2403.12042]] Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation(https://arxiv.org/abs/2403.12042)
Keywords: diffusion, transformer, generative, segmentation
Abstract: In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{https://github.com/buxiangzhiren/VD-IT}