secure

security

Title: Generating Visually Realistic Adversarial Patch. (arXiv:2312.03030v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03030
Code URL: null
Copy Paste: [[2312.03030]] Generating Visually Realistic Adversarial Patch(http://arxiv.org/abs/2312.03030)
Summary:
Deep neural networks (DNNs) are vulnerable to various types of adversarial examples, bringing huge threats to security-critical applications. Among these, adversarial patches have drawn increasing attention due to their good applicability to fool DNNs in the physical world. However, existing works often generate patches with meaningless noise or patterns, making it conspicuous to humans. To address this issue, we explore how to generate visually realistic adversarial patches to fool DNNs. Firstly, we analyze that a high-quality adversarial patch should be realistic, position irrelevant, and printable to be deployed in the physical world. Based on this analysis, we propose an effective attack called VRAP, to generate visually realistic adversarial patches. Specifically, VRAP constrains the patch in the neighborhood of a real image to ensure the visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information. Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as the scrawl or logo in the physical world to fool the deep models without being detected, bringing significant threats to DNNs-enabled applications.

Title: LiDAR-based Person Re-identification. (arXiv:2312.03033v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03033
Code URL: null
Copy Paste: [[2312.03033]] LiDAR-based Person Re-identification(http://arxiv.org/abs/2312.03033)
Summary:
Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However, cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations, such as inadequate illumination, complex background, and personal privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the first LiDAR-based person ReID dataset, which is collected in several outdoor scenes with variations in natural conditions. Additionally, we introduce LReID-sync, a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge, we are the first to propose a solution for LiDAR-based ReID. The code and datasets will be released soon.

Title: Securing Data Platforms: Strategic Masking Techniques for Privacy and Security for B2B Enterprise Data. (arXiv:2312.03293v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03293
Code URL: null
Copy Paste: [[2312.03293]] Securing Data Platforms: Strategic Masking Techniques for Privacy and Security for B2B Enterprise Data(http://arxiv.org/abs/2312.03293)
Summary:
In today's digital age, the imperative to protect data privacy and security is a paramount concern, especially for business-to-business (B2B) enterprises that handle sensitive information. These enterprises are increasingly constructing data platforms, which are integrated suites of technology solutions architected for the efficient management, processing, storage, and data analysis. It has become critical to design these data platforms with mechanisms that inherently support data privacy and security, particularly as they encounter the added complexity of safeguarding unstructured data types such as log files and text documents. Within this context, data masking stands out as a vital feature of data platform architecture. It proactively conceals sensitive elements, ensuring data privacy while preserving the information's value for business operations and analytics. This protective measure entails a strategic two-fold process: firstly, accurately pinpointing the sensitive data that necessitates concealment, and secondly, applying sophisticated methods to disguise that data effectively within the data platform infrastructure. This research delves into the nuances of embedding advanced data masking techniques within the very fabric of data platforms and an in-depth exploration of how enterprises can adopt a comprehensive approach toward effective data masking implementation by exploring different identification and anonymization techniques.

Title: Behavioral Authentication for Security and Safety. (arXiv:2312.03429v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03429
Code URL: null
Copy Paste: [[2312.03429]] Behavioral Authentication for Security and Safety(http://arxiv.org/abs/2312.03429)
Summary:
The issues of both system security and safety can be dissected integrally from the perspective of behavioral \emph{appropriateness}. That is, a system is secure or safe can be judged by whether the behavior of certain agent(s) is \emph{appropriate} or not. Specifically, a so-called \emph{appropriate behavior} involves the right agent performing the right actions at the right time under certain conditions. Then, according to different levels of appropriateness and degrees of custodies, behavioral authentication can be graded into three levels, i.e., the authentication of behavioral \emph{Identity}, \emph{Conformity}, and \emph{Benignity}. In a broad sense, for the security and safety issue, behavioral authentication is not only an innovative and promising method due to its inherent advantages but also a critical and fundamental problem due to the ubiquity of behavior generation and the necessity of behavior regulation in any system. By this classification, this review provides a comprehensive examination of the background and preliminaries of behavioral authentication. It further summarizes existing research based on their respective focus areas and characteristics. The challenges confronted by current behavioral authentication methods are analyzed, and potential research directions are discussed to promote the diversified and integrated development of behavioral authentication.

privacy

protect

defense

Title: Defense Against Adversarial Attacks using Convolutional Auto-Encoders. (arXiv:2312.03520v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03520
Code URL: null
Copy Paste: [[2312.03520]] Defense Against Adversarial Attacks using Convolutional Auto-Encoders(http://arxiv.org/abs/2312.03520)
Summary:
Deep learning models, while achieving state-of-the-art performance on many tasks, are susceptible to adversarial attacks that exploit inherent vulnerabilities in their architectures. Adversarial attacks manipulate the input data with imperceptible perturbations, causing the model to misclassify the data or produce erroneous outputs. This work is based on enhancing the robustness of targeted classifier models against adversarial attacks. To achieve this, an convolutional autoencoder-based approach is employed that effectively counters adversarial perturbations introduced to the input images. By generating images closely resembling the input images, the proposed methodology aims to restore the model's accuracy.

attack

Title: Clinical Notes Reveal Physician Fatigue. (arXiv:2312.03077v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03077
Code URL: null
Copy Paste: [[2312.03077]] Clinical Notes Reveal Physician Fatigue(http://arxiv.org/abs/2312.03077)
Summary:
Physicians write notes about patients. In doing so, they reveal much about themselves. Using data from 129,228 emergency room visits, we train a model to identify notes written by fatigued physicians -- those who worked 5 or more of the prior 7 days. In a hold-out set, the model accurately identifies notes written by these high-workload physicians, and also flags notes written in other high-fatigue settings: on overnight shifts, and after high patient volumes. Model predictions also correlate with worse decision-making on at least one important metric: yield of testing for heart attack is 18% lower with each standard deviation increase in model-predicted fatigue. Finally, the model indicates that notes written about Black and Hispanic patients have 12% and 21% higher predicted fatigue than Whites -- larger than overnight vs. daytime differences. These results have an important implication for large language models (LLMs). Our model indicates that fatigued doctors write more predictable notes. Perhaps unsurprisingly, because word prediction is the core of how LLMs work, we find that LLM-written notes have 17% higher predicted fatigue than real physicians' notes. This indicates that LLMs may introduce distortions in generated text that are not yet fully understood.

Title: Parallel Proof-of-Work with DAG-Style Voting and Targeted Reward Discounting. (arXiv:2312.03111v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03111
Code URL: null
Copy Paste: [[2312.03111]] Parallel Proof-of-Work with DAG-Style Voting and Targeted Reward Discounting(http://arxiv.org/abs/2312.03111)
Summary:
We present parallel proof-of-work with DAG-style voting, a novel proof-of-work cryptocurrency protocol that, compared to Bitcoin, provides better consistency guarantees, higher transaction throughput, lower transaction confirmation latency, and higher resilience against incentive attacks. The superior consistency guarantees follow from implementing parallel proof-of-work, a recent consensus scheme that enforces a configurable number of proof-of-work votes per block. Our work is inspired by another recent protocol, Tailstorm, which structures the individual votes as tree and mitigates incentive attacks by discounting the mining rewards proportionally to the depth of the tree. We propose to structure the votes as a directed acyclic graph (DAG) instead of a tree. This allows for a more targeted punishment of offending miners and, as we show through a reinforcement learning based attack search, makes the protocol even more resilient to incentive attacks. An interesting by-product of our analysis is that parallel proof-of-work without reward discounting is less resilient to incentive attacks than Bitcoin in some realistic network scenarios.

robust

Title: Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations. (arXiv:2312.03005v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03005
Code URL: null
Copy Paste: [[2312.03005]] Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations(http://arxiv.org/abs/2312.03005)
Summary:
Anomaly detection is a critical and challenging task that aims to identify data points deviating from normal patterns and distributions within a dataset. Various methods have been proposed using a one-class-one-model approach, but these techniques often face practical problems such as memory inefficiency and the requirement of sufficient data for training. In particular, few-shot anomaly detection presents significant challenges in industrial applications, where limited samples are available before mass production. In this paper, we propose a few-shot anomaly detection method that integrates adversarial training loss to obtain more robust and generalized feature representations. We utilize the adversarial loss previously employed in domain adaptation to align feature distributions between source and target domains, to enhance feature robustness and generalization in few-shot anomaly detection tasks. We hypothesize that adversarial loss is effective when applied to features that should have similar characteristics, such as those from the same layer in a Siamese network's parallel branches or input-output pairs of reconstruction-based methods. Experimental results demonstrate that the proposed method generally achieves better performance when utilizing the adversarial loss.

Title: Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey. (arXiv:2312.03014v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03014
Code URL: null
Copy Paste: [[2312.03014]] Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey(http://arxiv.org/abs/2312.03014)
Summary:
As artificial intelligence (AI) continues to rapidly evolve, the realm of Earth and atmospheric sciences is increasingly adopting data-driven models, powered by progressive developments in deep learning (DL). Specifically, DL techniques are extensively utilized to decode the chaotic and nonlinear aspects of Earth systems, and to address climate challenges via understanding weather and climate data. Cutting-edge performance on specific tasks within narrower spatio-temporal scales has been achieved recently through DL. The rise of large models, specifically large language models (LLMs), has enabled fine-tuning processes that yield remarkable outcomes across various downstream tasks, thereby propelling the advancement of general AI. However, we are still navigating the initial stages of crafting general AI for weather and climate. In this survey, we offer an exhaustive, timely overview of state-of-the-art AI methodologies specifically engineered for weather and climate data, with a special focus on time series and text data. Our primary coverage encompasses four critical aspects: types of weather and climate data, principal model architectures, model scopes and applications, and datasets for weather and climate. Furthermore, in relation to the creation and application of foundation models for weather and climate data understanding, we delve into the field's prevailing challenges, offer crucial insights, and propose detailed avenues for future research. This comprehensive approach equips practitioners with the requisite knowledge to make substantial progress in this domain. Our survey encapsulates the most recent breakthroughs in research on large, data-driven models for weather and climate data understanding, emphasizing robust foundations, current advancements, practical applications, crucial resources, and prospective research opportunities.

Title: SEVA: Leveraging sketches to evaluate alignment between human and machine visual abstraction. (arXiv:2312.03035v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03035
Code URL: https://github.com/cogtoolslab/visual_abstractions_benchmarking_public2023
Copy Paste: [[2312.03035]] SEVA: Leveraging sketches to evaluate alignment between human and machine visual abstraction(http://arxiv.org/abs/2312.03035)
Summary:
Sketching is a powerful tool for creating abstract images that are sparse but meaningful. Sketch understanding poses fundamental challenges for general-purpose vision algorithms because it requires robustness to the sparsity of sketches relative to natural visual inputs and because it demands tolerance for semantic ambiguity, as sketches can reliably evoke multiple meanings. While current vision algorithms have achieved high performance on a variety of visual tasks, it remains unclear to what extent they understand sketches in a human-like way. Here we introduce SEVA, a new benchmark dataset containing approximately 90K human-generated sketches of 128 object concepts produced under different time constraints, and thus systematically varying in sparsity. We evaluated a suite of state-of-the-art vision algorithms on their ability to correctly identify the target concept depicted in these sketches and to generate responses that are strongly aligned with human response patterns on the same sketch recognition task. We found that vision algorithms that better predicted human sketch recognition performance also better approximated human uncertainty about sketch meaning, but there remains a sizable gap between model and human response patterns. To explore the potential of models that emulate human visual abstraction in generative tasks, we conducted further evaluations of a recently developed sketch generation algorithm (Vinker et al., 2022) capable of generating sketches that vary in sparsity. We hope that public release of this dataset and evaluation protocol will catalyze progress towards algorithms with enhanced capacities for human-like visual abstraction.

Title: DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration. (arXiv:2312.03053v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03053
Code URL: null
Copy Paste: [[2312.03053]] DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration(http://arxiv.org/abs/2312.03053)
Summary:
Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds. We propose formulating PCR as a denoising diffusion probabilistic process, mapping noisy transformations to the ground truth. However, using diffusion models for PCR has nontrivial challenges, such as adapting a generative model to a discriminative task and leveraging the estimated nonlinear transformation from the previous step. Instead of training a diffusion model to directly map pure noise to ground truth, we map the predictions of an off-the-shelf PCR model to ground truth. The predictions of off-the-shelf models are often imperfect, especially in challenging cases where the two points clouds have low overlap, and thus could be seen as noisy versions of the real rigid transformation. In addition, we transform the rotation matrix into a spherical linear space for interpolation between samples in the forward process, and convert rigid transformations into auxiliary information to implicitly exploit last-step estimations in the reverse process. As a result, conditioned on time step, the denoising model adapts to the increasing accuracy across steps and refines registrations. Our extensive experiments showcase the effectiveness of our DiffusionPCR, yielding state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and 3DLoMatch. The code will be made public upon publication.

Title: ScAR: Scaling Adversarial Robustness for LiDAR Object Detection. (arXiv:2312.03085v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03085
Code URL: null
Copy Paste: [[2312.03085]] ScAR: Scaling Adversarial Robustness for LiDAR Object Detection(http://arxiv.org/abs/2312.03085)
Summary:
The adversarial robustness of a model is its ability to resist adversarial attacks in the form of small perturbations to input data. Universal adversarial attack methods such as Fast Sign Gradient Method (FSGM) and Projected Gradient Descend (PGD) are popular for LiDAR object detection, but they are often deficient compared to task-specific adversarial attacks. Additionally, these universal methods typically require unrestricted access to the model's information, which is difficult to obtain in real-world applications. To address these limitations, we present a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection. By analyzing the statistical characteristics of 3D object detection datasets such as KITTI, Waymo, and nuScenes, we have found that the model's prediction is sensitive to scaling of 3D instances. We propose three black-box scaling adversarial attack methods based on the available information: model-aware attack, distribution-aware attack, and blind attack. We also introduce a strategy for generating scaling adversarial examples to improve the model's robustness against these three scaling adversarial attacks. Comparison with other methods on public datasets under different 3D object detection architectures demonstrates the effectiveness of our proposed method.

Title: Human Body Model based ID using Shape and Pose Parameters. (arXiv:2312.03227v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03227
Code URL: null
Copy Paste: [[2312.03227]] Human Body Model based ID using Shape and Pose Parameters(http://arxiv.org/abs/2312.03227)
Summary:
We present a Human Body model based IDentification system (HMID) system that is jointly trained for shape, pose and biometric identification. HMID is based on the Human Mesh Recovery (HMR) network and we propose additional losses to improve and stabilize shape estimation and biometric identification while maintaining the pose and shape output. We show that when our HMID network is trained using additional shape and pose losses, it shows a significant improvement in biometric identification performance when compared to an identical model that does not use such losses. The HMID model uses raw images instead of silhouettes and is able to perform robust recognition on images collected at range and altitude as many anthropometric properties are reasonably invariant to clothing, view and range. We show results on the USF dataset as well as the BRIAR dataset which includes probes with both clothing and view changes. Our approach (using body model losses) shows a significant improvement in Rank20 accuracy and True Accuracy Rate on the BRIAR evaluation dataset.

Title: Indirect Gradient Matching for Adversarial Robust Distillation. (arXiv:2312.03286v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03286
Code URL: null
Copy Paste: [[2312.03286]] Indirect Gradient Matching for Adversarial Robust Distillation(http://arxiv.org/abs/2312.03286)
Summary:
Adversarial training significantly improves adversarial robustness, but superior performance is primarily attained with large models. This substantial performance gap for smaller models has spurred active research into adversarial distillation (AD) to mitigate the difference. Existing AD methods leverage the teacher's logits as a guide. In contrast to these approaches, we aim to transfer another piece of knowledge from the teacher, the input gradient. In this paper, we propose a distillation module termed Indirect Gradient Distillation Module (IGDM) that indirectly matches the student's input gradient with that of the teacher. We hypothesize that students can better acquire the teacher's knowledge by matching the input gradient. Leveraging the observation that adversarial training renders the model locally linear on the input space, we employ Taylor approximation to effectively align gradients without directly calculating them. Experimental results show that IGDM seamlessly integrates with existing AD methods, significantly enhancing the performance of all AD methods. Particularly, utilizing IGDM on the CIFAR-100 dataset improves the AutoAttack accuracy from 28.06% to 30.32% with the ResNet-18 model and from 26.18% to 29.52% with the MobileNetV2 model when integrated into the SOTA method without additional data augmentation. The code will be made available.

Title: Class Incremental Learning for Adversarial Robustness. (arXiv:2312.03289v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03289
Code URL: null
Copy Paste: [[2312.03289]] Class Incremental Learning for Adversarial Robustness(http://arxiv.org/abs/2312.03289)
Summary:
Adversarial training integrates adversarial examples during model training to enhance robustness. However, its application in fixed dataset settings differs from real-world dynamics, where data accumulates incrementally. In this study, we investigate Adversarially Robust Class Incremental Learning (ARCIL), a method that combines adversarial robustness with incremental learning. We observe that combining incremental learning with naive adversarial training easily leads to a loss of robustness. We discover that this is attributed to the disappearance of the flatness of the loss function, a characteristic of adversarial training. To address this issue, we propose the Flatness Preserving Distillation (FPD) loss that leverages the output difference between adversarial and clean examples. Additionally, we introduce the Logit Adjustment Distillation (LAD) loss, which adapts the model's knowledge to perform well on new tasks. Experimental results demonstrate the superiority of our method over approaches that apply adversarial training to existing incremental learning methods, which provides a strong baseline for incremental learning on adversarial robustness in the future. Our method achieves AutoAttack accuracy that is 5.99\%p, 5.27\%p, and 3.90\%p higher on average than the baseline on split CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. The code will be made available.

Title: PointJEM: Self-supervised Point Cloud Understanding for Reducing Feature Redundancy via Joint Entropy Maximization. (arXiv:2312.03339v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03339
Code URL: null
Copy Paste: [[2312.03339]] PointJEM: Self-supervised Point Cloud Understanding for Reducing Feature Redundancy via Joint Entropy Maximization(http://arxiv.org/abs/2312.03339)
Summary:
Most deep learning-based point cloud processing methods are supervised and require large scale of labeled data. However, manual labeling of point cloud data is laborious and time-consuming. Self-supervised representation learning can address the aforementioned issue by learning robust and generalized representations from unlabeled datasets. Nevertheless, the embedded features obtained by representation learning usually contain redundant information, and most current methods reduce feature redundancy by linear correlation constraints. In this paper, we propose PointJEM, a self-supervised representation learning method applied to the point cloud field. PointJEM comprises an embedding scheme and a loss function based on joint entropy. The embedding scheme divides the embedding vector into different parts, each part can learn a distinctive feature. To reduce redundant information in the features, PointJEM maximizes the joint entropy between the different parts, thereby rendering the learned feature variables pairwise independent. To validate the effectiveness of our method, we conducted experiments on multiple datasets. The results demonstrate that our method can significantly reduce feature redundancy beyond linear correlation. Furthermore, PointJEM achieves competitive performance in downstream tasks such as classification and segmentation.

Title: Online Vectorized HD Map Construction using Geometry. (arXiv:2312.03341v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03341
Code URL: https://github.com/cnzzx/gemap
Copy Paste: [[2312.03341]] Online Vectorized HD Map Construction using Geometry(http://arxiv.org/abs/2312.03341)
Summary:
The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap ($\textbf{Ge}$ometry $\textbf{Map}$), which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time. Code is available at https://github.com/cnzzx/GeMap

Title: RING-NeRF: A Versatile Architecture based on Residual Implicit Neural Grids. (arXiv:2312.03357v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03357
Code URL: null
Copy Paste: [[2312.03357]] RING-NeRF: A Versatile Architecture based on Residual Implicit Neural Grids(http://arxiv.org/abs/2312.03357)
Summary:
Since their introduction, Neural Fields have become very popular for 3D reconstruction and new view synthesis. Recent researches focused on accelerating the process, as well as improving the robustness to variation of the observation distance and limited number of supervised viewpoints. However, those approaches often led to dedicated solutions that cannot be easily combined. To tackle this issue, we introduce a new simple but efficient architecture named RING-NeRF, based on Residual Implicit Neural Grids, that provides a control on the level of detail of the mapping function between the scene and the latent spaces. Associated with a distance-aware forward mapping mechanism and a continuous coarse-to-fine reconstruction process, our versatile architecture demonstrates both fast training and state-of-the-art performances in terms of: (1) anti-aliased rendering, (2) reconstruction quality from few supervised viewpoints, and (3) robustness in the absence of appropriate scene-specific initialization for SDF-based NeRFs. We also demonstrate that our architecture can dynamically add grids to increase the details of the reconstruction, opening the way to adaptive reconstruction.

Title: Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future. (arXiv:2312.03408v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03408
Code URL: https://github.com/opendrivelab/driveagi
Copy Paste: [[2312.03408]] Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future(http://arxiv.org/abs/2312.03408)
Summary:
With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI.

Title: Enhancing Kinship Verification through Multiscale Retinex and Combined Deep-Shallow features. (arXiv:2312.03562v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03562
Code URL: null
Copy Paste: [[2312.03562]] Enhancing Kinship Verification through Multiscale Retinex and Combined Deep-Shallow features(http://arxiv.org/abs/2312.03562)
Summary:
The challenge of kinship verification from facial images represents a cutting-edge and formidable frontier in the realms of pattern recognition and computer vision. This area of study holds a myriad of potential applications, spanning from image annotation and forensic analysis to social media research. Our research stands out by integrating a preprocessing method named Multiscale Retinex (MSR), which elevates image quality and amplifies contrast, ultimately bolstering the end results. Strategically, our methodology capitalizes on the harmonious blend of deep and shallow texture descriptors, merging them proficiently at the score level through the Logistic Regression (LR) method. To elucidate, we employ the Local Phase Quantization (LPQ) descriptor to extract shallow texture characteristics. For deep feature extraction, we turn to the prowess of the VGG16 model, which is pre-trained on a convolutional neural network (CNN). The robustness and efficacy of our method have been put to the test through meticulous experiments on three rigorous kinship datasets, namely: Cornell Kin Face, UB Kin Face, and TS Kin Face.

Title: A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System. (arXiv:2312.03245v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03245
Code URL: null
Copy Paste: [[2312.03245]] A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System(http://arxiv.org/abs/2312.03245)
Summary:
Deep learning based intrusion detection systems (DL-based IDS) have emerged as one of the best choices for providing security solutions against various network intrusion attacks. However, due to the emergence and development of adversarial deep learning technologies, it becomes challenging for the adoption of DL models into IDS. In this paper, we propose a novel IDS architecture that can enhance the robustness of IDS against adversarial attacks by combining conventional machine learning (ML) models and Deep Learning models. The proposed DLL-IDS consists of three components: DL-based IDS, adversarial example (AE) detector, and ML-based IDS. We first develop a novel AE detector based on the local intrinsic dimensionality (LID). Then, we exploit the low attack transferability between DL models and ML models to find a robust ML model that can assist us in determining the maliciousness of AEs. If the input traffic is detected as an AE, the ML-based IDS will predict the maliciousness of input traffic, otherwise the DL-based IDS will work for the prediction. The fusion mechanism can leverage the high prediction accuracy of DL models and low attack transferability between DL models and ML models to improve the robustness of the whole system. In our experiments, we observe a significant improvement in the prediction performance of the IDS when subjected to adversarial attack, achieving high accuracy with low resource consumption.

Title: REST: Enhancing Group Robustness in DNNs through Reweighted Sparse Training. (arXiv:2312.03044v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03044
Code URL: https://github.com/zhao1402072392/rest
Copy Paste: [[2312.03044]] REST: Enhancing Group Robustness in DNNs through Reweighted Sparse Training(http://arxiv.org/abs/2312.03044)
Summary:
The deep neural network (DNN) has been proven effective in various domains. However, they often struggle to perform well on certain minority groups during inference, despite showing strong performance on the majority of data groups. This is because over-parameterized models learned \textit{bias attributes} from a large number of \textit{bias-aligned} training samples. These bias attributes are strongly spuriously correlated with the target variable, causing the models to be biased towards spurious correlations (i.e., \textit{bias-conflicting}). To tackle this issue, we propose a novel \textbf{re}weighted \textbf{s}parse \textbf{t}raining framework, dubbed as \textit{\textbf{REST}}, which aims to enhance the performance of biased data while improving computation and memory efficiency. Our proposed REST framework has been experimentally validated on three datasets, demonstrating its effectiveness in exploring unbiased subnetworks. We found that REST reduces the reliance on spuriously correlated features, leading to better performance across a wider range of data groups with fewer training and inference resources. We highlight that the \textit{REST} framework represents a promising approach for improving the performance of DNNs on biased data, while simultaneously improving computation and memory efficiency. By reducing the reliance on spurious correlations, REST has the potential to enhance the robustness of DNNs and improve their generalization capabilities. Code is released at \url{https://github.com/zhao1402072392/REST}

Title: Multitask Learning Can Improve Worst-Group Outcomes. (arXiv:2312.03151v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03151
Code URL: https://github.com/atharvajk98/mtl-group-robustness
Copy Paste: [[2312.03151]] Multitask Learning Can Improve Worst-Group Outcomes(http://arxiv.org/abs/2312.03151)
Summary:
In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the common setting of fine-tuning a pre-trained model, where, following recent work (Gururangan et al., 2020; Dery et al., 2023), we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not always, achieves better worst-group accuracy than Just-Train-Twice (JTT; Liu et al. (2021)) -- a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language and find that our regularized MTL approach consistently outperforms JTT on both worst and average group outcomes. Our official code can be found here: https://github.com/atharvajk98/MTL-group-robustness.

Title: Deep Learning for Fast Inference of Mechanistic Models' Parameters. (arXiv:2312.03166v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03166
Code URL: null
Copy Paste: [[2312.03166]] Deep Learning for Fast Inference of Mechanistic Models' Parameters(http://arxiv.org/abs/2312.03166)
Summary:
Inferring parameters of macro-kinetic growth models, typically represented by Ordinary Differential Equations (ODE), from the experimental data is a crucial step in bioprocess engineering. Conventionally, estimates of the parameters are obtained by fitting the mechanistic model to observations. Fitting, however, requires a significant computational power. Specifically, during the development of new bioprocesses that use previously unknown organisms or strains, efficient, robust, and computationally cheap methods for parameter estimation are of great value. In this work, we propose using Deep Neural Networks (NN) for directly predicting parameters of mechanistic models given observations. The approach requires spending computational resources for training a NN, nonetheless, once trained, such a network can provide parameter estimates orders of magnitude faster than conventional methods. We consider a training procedure that combines Neural Networks and mechanistic models. We demonstrate the performance of the proposed algorithms on data sampled from several mechanistic models used in bioengineering describing a typical industrial batch process and compare the proposed method, a typical gradient-based fitting procedure, and the combination of the two. We find that, while Neural Network estimates are slightly improved by further fitting, these estimates are measurably better than the fitting procedure alone.

Title: SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning. (arXiv:2312.03216v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03216
Code URL: https://github.com/ericjiang18/sdsra
Copy Paste: [[2312.03216]] SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning(http://arxiv.org/abs/2312.03216)
Summary:
In this paper, we introduce a novel algorithm - the Skill-Driven Skill Recombination Algorithm (SDSRA) - an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.

Title: f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization. (arXiv:2312.03259v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03259
Code URL: https://github.com/optimization-for-data-driven-science/f-ferm
Copy Paste: [[2312.03259]] f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization(http://arxiv.org/abs/2312.03259)
Summary:
Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available.

Title: OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution. (arXiv:2312.03291v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03291
Code URL: null
Copy Paste: [[2312.03291]] OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution(http://arxiv.org/abs/2312.03291)
Summary:
We propose a novel model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs (including human-unrecognizable ones), which is crucial for AI safety and reliability. Unlike traditional data-centric evaluation based on pre-defined test sets, the test set in OmniInput is self-constructed by the model itself and the model quality is evaluated by investigating its output distribution. We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model, which, after selective annotation, can be used to estimate the model's precision and recall at different output values and a comprehensive precision-recall curve. Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models, especially when their performance is almost the same on pre-defined datasets, leading to new findings and insights for how to train more robust, generalizable models.

Title: Interpretable Mechanistic Representations for Meal-level Glycemic Control in the Wild. (arXiv:2312.03344v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03344
Code URL: null
Copy Paste: [[2312.03344]] Interpretable Mechanistic Representations for Meal-level Glycemic Control in the Wild(http://arxiv.org/abs/2312.03344)
Summary:
Diabetes encompasses a complex landscape of glycemic control that varies widely among individuals. However, current methods do not faithfully capture this variability at the meal level. On the one hand, expert-crafted features lack the flexibility of data-driven methods; on the other hand, learned representations tend to be uninterpretable which hampers clinical adoption. In this paper, we propose a hybrid variational autoencoder to learn interpretable representations of CGM and meal data. Our method grounds the latent space to the inputs of a mechanistic differential equation, producing embeddings that reflect physiological quantities, such as insulin sensitivity, glucose effectiveness, and basal glucose levels. Moreover, we introduce a novel method to infer the glucose appearance rate, making the mechanistic model robust to unreliable meal logs. On a dataset of CGM and self-reported meals from individuals with type-2 diabetes and pre-diabetes, our unsupervised representation discovers a separation between individuals proportional to their disease severity. Our embeddings produce clusters that are up to 4x better than naive, expert, black-box, and pure mechanistic features. Our method provides a nuanced, yet interpretable, embedding space to compare glycemic control within and across individuals, directly learnable from in-the-wild data.

Title: An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network. (arXiv:2312.03386v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03386
Code URL: null
Copy Paste: [[2312.03386]] An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network(http://arxiv.org/abs/2312.03386)
Summary:
The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.

biometric

steal

extraction

Title: Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving. (arXiv:2312.03661v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03661
Code URL: https://github.com/fudan-zvg/reason2drive
Copy Paste: [[2312.03661]] Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving(http://arxiv.org/abs/2312.03661)
Summary:
Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.

Title: LLMs for Multi-Modal Knowledge Extraction and Analysis in Intelligence/Safety-Critical Applications. (arXiv:2312.03088v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03088
Code URL: null
Copy Paste: [[2312.03088]] LLMs for Multi-Modal Knowledge Extraction and Analysis in Intelligence/Safety-Critical Applications(http://arxiv.org/abs/2312.03088)
Summary:
Large Language Models have seen rapid progress in capability in recent years; this progress has been accelerating and their capabilities, measured by various benchmarks, are beginning to approach those of humans. There is a strong demand to use such models in a wide variety of applications but, due to unresolved vulnerabilities and limitations, great care needs to be used before applying them to intelligence and safety-critical applications. This paper reviews recent literature related to LLM assessment and vulnerabilities to synthesize the current research landscape and to help understand what advances are most critical to enable use of of these technologies in intelligence and safety-critical applications. The vulnerabilities are broken down into ten high-level categories and overlaid onto a high-level life cycle of an LLM. Some general categories of mitigations are reviewed.

Title: Lazy-k: Decoding for Constrained Token Classification. (arXiv:2312.03367v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03367
Code URL: https://github.com/arthurdevnl/lazyk
Copy Paste: [[2312.03367]] Lazy-k: Decoding for Constrained Token Classification(http://arxiv.org/abs/2312.03367)
Summary:
We explore the possibility of improving probabilistic models in structured prediction. Specifically, we combine the models with constrained decoding approaches in the context of token classification for information extraction. The decoding methods search for constraint-satisfying label-assignments while maximizing the total probability. To do this, we evaluate several existing approaches, as well as propose a novel decoding method called Lazy-$k$. Our findings demonstrate that constrained decoding approaches can significantly improve the models' performances, especially when using smaller models. The Lazy-$k$ approach allows for more flexibility between decoding time and accuracy. The code for using Lazy-$k$ decoding can be found here: https://github.com/ArthurDevNL/lazyk.

membership infer

federate

Title: Who Leaked the Model? Tracking IP Infringers in Accountable Federated Learning. (arXiv:2312.03205v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03205
Code URL: null
Copy Paste: [[2312.03205]] Who Leaked the Model? Tracking IP Infringers in Accountable Federated Learning(http://arxiv.org/abs/2312.03205)
Summary:
Federated learning (FL) emerges as an effective collaborative learning framework to coordinate data and computation resources from massive and distributed clients in training. Such collaboration results in non-trivial intellectual property (IP) represented by the model parameters that should be protected and shared by the whole party rather than an individual user. Meanwhile, the distributed nature of FL endorses a malicious client the convenience to compromise IP through illegal model leakage to unauthorized third parties. To block such IP leakage, it is essential to make the IP identifiable in the shared model and locate the anonymous infringer who first leaks it. The collective challenges call for \emph{accountable federated learning}, which requires verifiable ownership of the model and is capable of revealing the infringer's identity upon leakage. In this paper, we propose Decodable Unique Watermarking (DUW) for complying with the requirements of accountable FL. Specifically, before a global model is sent to a client in an FL round, DUW encodes a client-unique key into the model by leveraging a backdoor-based watermark injection. To identify the infringer of a leaked model, DUW examines the model and checks if the triggers can be decoded as the corresponding keys. Extensive empirical results show that DUW is highly effective and robust, achieving over $99\%$ watermark success rate for Digits, CIFAR-10, and CIFAR-100 datasets under heterogeneous FL settings, and identifying the IP infringer with $100\%$ accuracy even after common watermark removal attempts.

Title: Fed-urlBERT: Client-side Lightweight Federated Transformers for URL Threat Analysis. (arXiv:2312.03636v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03636
Code URL: null
Copy Paste: [[2312.03636]] Fed-urlBERT: Client-side Lightweight Federated Transformers for URL Threat Analysis(http://arxiv.org/abs/2312.03636)
Summary:
In evolving cyber landscapes, the detection of malicious URLs calls for cooperation and knowledge sharing across domains. However, collaboration is often hindered by concerns over privacy and business sensitivities. Federated learning addresses these issues by enabling multi-clients collaboration without direct data exchange. Unfortunately, if highly expressive Transformer models are used, clients may face intolerable computational burdens, and the exchange of weights could quickly deplete network bandwidth. In this paper, we propose Fed-urlBERT, a federated URL pre-trained model designed to address both privacy concerns and the need for cross-domain collaboration in cybersecurity. Fed-urlBERT leverages split learning to divide the pre-training model into client and server part, so that the client part takes up less extensive computation resources and bandwidth. Our appraoch achieves performance comparable to centralized model under both independently and identically distributed (IID) and two non-IID data scenarios. Significantly, our federated model shows about an 7% decrease in the FPR compared to the centralized model. Additionally, we implement an adaptive local aggregation strategy that mitigates heterogeneity among clients, demonstrating promising performance improvements. Overall, our study validates the applicability of the proposed Transformer federated learning for URL threat analysis, establishing a foundation for real-world collaborative cybersecurity efforts. The source code is accessible at https://github.com/Davidup1/FedURLBERT.

Title: The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated Learning. (arXiv:2312.03120v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03120
Code URL: null
Copy Paste: [[2312.03120]] The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated Learning(http://arxiv.org/abs/2312.03120)
Summary:
With the advance of the powerful heterogeneous, parallel and distributed computing systems and ever increasing immense amount of data, machine learning has become an indispensable part of cutting-edge technology, scientific research and consumer products. In this study, we present a review of modern machine and deep learning. We provide a high-level overview for the latest advanced machine learning algorithms, applications, and frameworks. Our discussion encompasses parallel distributed learning, deep learning as well as federated learning. As a result, our work serves as an introductory text to the vast field of modern machine learning.

fair

Title: Rethinking Object Saliency Ranking: A Novel Whole-flow Processing Paradigm. (arXiv:2312.03226v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03226
Code URL: https://github.com/mengkesong/saliency-ranking-paradigm
Copy Paste: [[2312.03226]] Rethinking Object Saliency Ranking: A Novel Whole-flow Processing Paradigm(http://arxiv.org/abs/2312.03226)
Summary:
Existing salient object detection methods are capable of predicting binary maps that highlight visually salient regions. However, these methods are limited in their ability to differentiate the relative importance of multiple objects and the relationships among them, which can lead to errors and reduced accuracy in downstream tasks that depend on the relative importance of multiple objects. To conquer, this paper proposes a new paradigm for saliency ranking, which aims to completely focus on ranking salient objects by their "importance order". While previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively. To solve these problems, this paper conducts an in-depth analysis of the causes and proposes a whole-flow processing paradigm of saliency ranking task from the perspective of "GT data generation", "network structure design" and "training protocol". The proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.

Title: Seller-side Outcome Fairness in Online Marketplaces. (arXiv:2312.03253v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03253
Code URL: null
Copy Paste: [[2312.03253]] Seller-side Outcome Fairness in Online Marketplaces(http://arxiv.org/abs/2312.03253)
Summary:
This paper aims to investigate and achieve seller-side fairness within online marketplaces, where many sellers and their items are not sufficiently exposed to customers in an e-commerce platform. This phenomenon raises concerns regarding the potential loss of revenue associated with less exposed items as well as less marketplace diversity. We introduce the notion of seller-side outcome fairness and build an optimization model to balance collected recommendation rewards and the fairness metric. We then propose a gradient-based data-driven algorithm based on the duality and bandit theory. Our numerical experiments on real e-commerce data sets show that our algorithm can lift seller fairness measures while not hurting metrics like collected Gross Merchandise Value (GMV) and total purchases.

interpretability

Title: FlexModel: A Framework for Interpretability of Distributed Large Language Models. (arXiv:2312.03140v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03140
Code URL: https://github.com/vectorinstitute/flex_model
Copy Paste: [[2312.03140]] FlexModel: A Framework for Interpretability of Distributed Large Language Models(http://arxiv.org/abs/2312.03140)
Summary:
With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization and distributed training, deeper model interactions, crucial for interpretability and responsible AI techniques, still demand thorough knowledge of distributed computing. This often hinders contributions from researchers with machine learning expertise but limited distributed computing background. Addressing this challenge, we present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi-GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals, bridging the gap between distributed and single-device model paradigms. Primarily, FlexModel enhances accessibility by democratizing model interactions and promotes more inclusive research in the domain of large-scale neural networks. The package is found at https://github.com/VectorInstitute/flex_model.

Title: Interpretability Illusions in the Generalization of Simplified Models. (arXiv:2312.03656v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03656
Code URL: null
Copy Paste: [[2312.03656]] Interpretability Illusions in the Generalization of Simplified Models(http://arxiv.org/abs/2312.03656)
Summary:
A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

Title: Generating Interpretable Networks using Hypernetworks. (arXiv:2312.03051v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03051
Code URL: null
Copy Paste: [[2312.03051]] Generating Interpretable Networks using Hypernetworks(http://arxiv.org/abs/2312.03051)
Summary:
An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.

Title: Incidental Polysemanticity. (arXiv:2312.03096v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03096
Code URL: null
Copy Paste: [[2312.03096]] Incidental Polysemanticity(http://arxiv.org/abs/2312.03096)
Summary:
Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more "features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand the network's internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this \textit{incidental polysemanticity}.

explainability

Title: Gravitational cell detection and tracking in fluorescence microscopy data. (arXiv:2312.03509v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03509
Code URL: null
Copy Paste: [[2312.03509]] Gravitational cell detection and tracking in fluorescence microscopy data(http://arxiv.org/abs/2312.03509)
Summary:
Automatic detection and tracking of cells in microscopy images are major applications of computer vision technologies in both biomedical research and clinical practice. Though machine learning methods are increasingly common in these fields, classical algorithms still offer significant advantages for both tasks, including better explainability, faster computation, lower hardware requirements and more consistent performance. In this paper, we present a novel approach based on gravitational force fields that can compete with, and potentially outperform modern machine learning models when applied to fluorescence microscopy images. This method includes detection, segmentation, and tracking elements, with the results demonstrated on a Cell Tracking Challenge dataset.

watermark

diffusion

Title: Diff-GO: Diffusion Goal-Oriented Communications to Achieve Ultra-High Spectrum Efficiency. (arXiv:2312.02984v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02984
Code URL: null
Copy Paste: [[2312.02984]] Diff-GO: Diffusion Goal-Oriented Communications to Achieve Ultra-High Spectrum Efficiency(http://arxiv.org/abs/2312.02984)
Summary:
The latest advances in artificial intelligence (AI) present many unprecedented opportunities to achieve much improved bandwidth saving in communications. Unlike conventional communication systems focusing on packet transport, rich datasets and AI makes it possible to efficiently transfer only the information most critical to the goals of message recipients. One of the most exciting advances in generative AI known as diffusion model presents a unique opportunity for designing ultra-fast communication systems well beyond language-based messages. This work presents an ultra-efficient communication design by utilizing generative AI-based on diffusion models as a specific example of the general goal-oriented communication framework. To better control the regenerated message at the receiver output, our diffusion system design includes a local regeneration module with finite dimensional noise latent. The critical significance of noise latent control and sharing residing on our Diff-GO is the ability to introduce the concept of "local generative feedback" (Local-GF), which enables the transmitter to monitor the quality and gauge the quality or accuracy of the message recovery at the semantic system receiver. To this end, we propose a new low-dimensional noise space for the training of diffusion models, which significantly reduces the communication overhead and achieves satisfactory message recovery performance. Our experimental results demonstrate that the proposed noise space and the diffusion-based generative model achieve ultra-high spectrum efficiency and accurate recovery of transmitted image signals. By trading off computation for bandwidth efficiency (C4BE), this new framework provides an important avenue to achieve exceptional computation-bandwidth tradeoff.

Title: DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance. (arXiv:2312.03018v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03018
Code URL: null
Copy Paste: [[2312.03018]] DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance(http://arxiv.org/abs/2312.03018)
Summary:
Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch on the basis of a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process in a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenate the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has powerful image retention ability and result in high FVD in UCF101 compared to other image-to-video models. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

Title: Stable Diffusion Exposed: Gender Bias from Prompt to Image. (arXiv:2312.03027v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03027
Code URL: null
Copy Paste: [[2312.03027]] Stable Diffusion Exposed: Gender Bias from Prompt to Image(http://arxiv.org/abs/2312.03027)
Summary:
Recent studies have highlighted biases in generative models, shedding light on their predisposition towards gender-based stereotypes and imbalances. This paper contributes to this growing body of research by introducing an evaluation protocol designed to automatically analyze the impact of gender indicators on Stable Diffusion images. Leveraging insights from prior work, we explore how gender indicators not only affect gender presentation but also the representation of objects and layouts within the generated images. Our findings include the existence of differences in the depiction of objects, such as instruments tailored for specific genders, and shifts in overall layouts. We also reveal that neutral prompts tend to produce images more aligned with masculine prompts than their feminine counterparts, providing valuable insights into the nuanced gender biases inherent in Stable Diffusion.

Title: Customization Assistant for Text-to-image Generation. (arXiv:2312.03045v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03045
Code URL: null
Copy Paste: [[2312.03045]] Customization Assistant for Text-to-image Generation(http://arxiv.org/abs/2312.03045)
Summary:
Customizing pre-trained text-to-image generation model has attracted massive research interest recently, due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image, their capability are still far from perfection. Specifically, most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning, while their performance are unsatisfactory. Furthermore, the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work, we build a customization assistant based on pre-trained large language model and diffusion model, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically, we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted, competitive results have been obtained across different domains, illustrating the effectiveness of the proposed method.

Title: MagicStick: Controllable Video Editing via Control Handle Transformations. (arXiv:2312.03047v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03047
Code URL: https://github.com/mayuelala/magicstick
Copy Paste: [[2312.03047]] MagicStick: Controllable Video Editing via Control Handle Transformations(http://arxiv.org/abs/2312.03047)
Summary:
Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easily propagate to other frames to provide generation guidance. We thus propose MagicStick, a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. In detail, to keep the appearance, we inflate both the pretrained image diffusion model and ControlNet to the temporal dimension and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in editing, we perform an inversion and editing framework. Differently, finetuned ControlNet is introduced in both inversion and generation for attention guidance with the proposed attention remix between the spatial attention maps of inversion and editing. Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works. The code and models will be made publicly available.

Title: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control. (arXiv:2312.03048v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03048
Code URL: null
Copy Paste: [[2312.03048]] DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control(http://arxiv.org/abs/2312.03048)
Summary:
Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. Source code and dataset are available at https://dginstyle.github.io .

Title: LooseControl: Lifting ControlNet for Generalized Depth Conditioning. (arXiv:2312.03079v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03079
Code URL: null
Copy Paste: [[2312.03079]] LooseControl: Lifting ControlNet for Generalized Depth Conditioning(http://arxiv.org/abs/2312.03079)
Summary:
We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth-conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables many new content-creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying layout locations of the target objects rather than the exact shape and appearance of the objects. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the style of the image. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Extensive tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. Code and more information are available at https://shariqfarooq123.github.io/loose-control/ .

Title: ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet. (arXiv:2312.03154v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03154
Code URL: null
Copy Paste: [[2312.03154]] ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet(http://arxiv.org/abs/2312.03154)
Summary:
This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object's appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles with text and image prompts. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.

Title: Cache Me if You Can: Accelerating Diffusion Models through Block Caching. (arXiv:2312.03209v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03209
Code URL: null
Copy Paste: [[2312.03209]] Cache Me if You Can: Accelerating Diffusion Models through Block Caching(http://arxiv.org/abs/2312.03209)
Summary:
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

Title: DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction. (arXiv:2312.03298v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03298
Code URL: null
Copy Paste: [[2312.03298]] DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction(http://arxiv.org/abs/2312.03298)
Summary:
Point cloud streaming is increasingly getting popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the substantial volume of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in-terms of auto-encoding and downstream tasks considered.

Title: F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis. (arXiv:2312.03459v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03459
Code URL: null
Copy Paste: [[2312.03459]] F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis(http://arxiv.org/abs/2312.03459)
Summary:
Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.

Title: Kandinsky 3.0 Technical Report. (arXiv:2312.03511v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03511
Code URL: https://github.com/ai-forever/movqgan
Copy Paste: [[2312.03511]] Kandinsky 3(http://arxiv.org/abs/2312.03511)
Summary:
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

Title: FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models. (arXiv:2312.03517v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03517
Code URL: null
Copy Paste: [[2312.03517]] FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models(http://arxiv.org/abs/2312.03517)
Summary:
The substantial computational costs of diffusion models, particularly due to the repeated denoising steps crucial for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation without sacrificing output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.

Title: FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation. (arXiv:2312.03540v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03540
Code URL: null
Copy Paste: [[2312.03540]] FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation(http://arxiv.org/abs/2312.03540)
Summary:
Current state-of-the-art image generation models such as Latent Diffusion Models (LDMs) have demonstrated the capacity to produce visually striking food-related images. However, these generated images often exhibit an artistic or surreal quality that diverges from the authenticity of real-world food representations. This inadequacy renders them impractical for applications requiring realistic food imagery, such as training models for image-based dietary assessment. To address these limitations, we introduce FoodFusion, a Latent Diffusion model engineered specifically for the faithful synthesis of realistic food images from textual descriptions. The development of the FoodFusion model involves harnessing an extensive array of open-source food datasets, resulting in over 300,000 curated image-caption pairs. Additionally, we propose and employ two distinct data cleaning methodologies to ensure that the resulting image-text pairs maintain both realism and accuracy. The FoodFusion model, thus trained, demonstrates a remarkable ability to generate food images that exhibit a significant improvement in terms of both realism and diversity over the publicly available image generation models. We openly share the dataset and fine-tuned models to support advancements in this critical field of food image synthesis at https://bit.ly/genai4good.

Title: Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention. (arXiv:2312.03556v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03556
Code URL: null
Copy Paste: [[2312.03556]] Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention(http://arxiv.org/abs/2312.03556)
Summary:
Face inpainting is important in various applications, such as photo restoration, image editing, and virtual reality. Despite the significant advances in face generative models, ensuring that a person's unique facial identity is maintained during the inpainting process is still an elusive goal. Current state-of-the-art techniques, exemplified by MyStyle, necessitate resource-intensive fine-tuning and a substantial number of images for each new identity. Furthermore, existing methods often fall short in accommodating user-specified semantic attributes, such as beard or expression. To improve inpainting results, and reduce the computational complexity during inference, this paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models. Specifically, we insert parallel attention matrices to each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder. We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting. Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, in comparison to various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. Our findings reveal that PVA ensures good identity preservation while offering effective language-controllability. Additionally, in contrast to Custom Diffusion, PVA requires just 40 fine-tuning steps for each new identity, which translates to a significant speed increase of over 20 times.

Title: Context Diffusion: In-Context Aware Image Generation. (arXiv:2312.03584v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03584
Code URL: null
Copy Paste: [[2312.03584]] Context Diffusion: In-Context Aware Image Generation(http://arxiv.org/abs/2312.03584)
Summary:
We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.

Title: DiffusionSat: A Generative Foundation Model for Satellite Imagery. (arXiv:2312.03606v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03606
Code URL: null
Copy Paste: [[2312.03606]] DiffusionSat: A Generative Foundation Model for Satellite Imagery(http://arxiv.org/abs/2312.03606)
Summary:
Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale $\textit{generative}$ foundation model for satellite imagery.

Title: DreamComposer: Controllable 3D Object Generation via Multi-View Conditions. (arXiv:2312.03611v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03611
Code URL: null
Copy Paste: [[2312.03611]] DreamComposer: Controllable 3D Object Generation via Multi-View Conditions(http://arxiv.org/abs/2312.03611)
Summary:
Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.

Title: TokenCompose: Grounding Diffusion with Token-level Supervision. (arXiv:2312.03626v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03626
Code URL: null
Copy Paste: [[2312.03626]] TokenCompose: Grounding Diffusion with Token-level Supervision(http://arxiv.org/abs/2312.03626)
Summary:
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.

Title: WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on. (arXiv:2312.03667v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03667
Code URL: null
Copy Paste: [[2312.03667]] WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on(http://arxiv.org/abs/2312.03667)
Summary:
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on warping the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the warping-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively.

Title: Self-conditioned Image Generation via Generating Representations. (arXiv:2312.03701v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03701
Code URL: https://github.com/LTH14/rcg
Copy Paste: [[2312.03701]] Self-conditioned Image Generation via Generating Representations(http://arxiv.org/abs/2312.03701)
Summary:
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image $\textbf{G}$eneration (RCG), a simple yet effective image generation framework which sets a new benchmark in class-unconditional image generation. RCG does not condition on any human annotations. Instead, it conditions on a self-supervised representation distribution which is mapped from the image distribution using a pre-trained encoder. During generation, RCG samples from such representation distribution using a representation diffusion model (RDM), and employs a pixel generator to craft image pixels conditioned on the sampled representation. Such a design provides substantial guidance during the generative process, resulting in high-quality image generation. Tested on ImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of 3.31 and an Inception Score (IS) of 253.4. These results not only significantly improve the state-of-the-art of class-unconditional image generation but also rival the current leading methods in class-conditional image generation, bridging the long-standing performance gap between these two tasks. Code is available at https://github.com/LTH14/rcg.

Title: Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning. (arXiv:2312.03397v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03397
Code URL: null
Copy Paste: [[2312.03397]] Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning(http://arxiv.org/abs/2312.03397)
Summary:
We present Generalized Contrastive Divergence (GCD), a novel objective function for training an energy-based model (EBM) and a sampler simultaneously. GCD generalizes Contrastive Divergence (Hinton, 2002), a celebrated algorithm for training EBM, by replacing Markov Chain Monte Carlo (MCMC) distribution with a trainable sampler, such as a diffusion model. In GCD, the joint training of EBM and a diffusion model is formulated as a minimax problem, which reaches an equilibrium when both models converge to the data distribution. The minimax learning with GCD bears interesting equivalence to inverse reinforcement learning, where the energy corresponds to a negative reward, the diffusion model is a policy, and the real data is expert demonstrations. We present preliminary yet promising results showing that joint training is beneficial for both EBM and a diffusion model. GCD enables EBM training without MCMC while improving the sample quality of a diffusion model.

Title: Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion. (arXiv:2312.03475v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03475
Code URL: null
Copy Paste: [[2312.03475]] Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion(http://arxiv.org/abs/2312.03475)
Summary:
Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.

noise learning

data-free

transformer

Title: Uni3DL: Unified Model for 3D and Language Understanding. (arXiv:2312.03026v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03026
Code URL: null
Copy Paste: [[2312.03026]] Uni3DL: Unified Model for 3D and Language Understanding(http://arxiv.org/abs/2312.03026)
Summary:
In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding. Project page: https://uni3dl.github.io.

Title: STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition. (arXiv:2312.03288v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03288
Code URL: https://github.com/maclong01/STEP-CATFormer
Copy Paste: [[2312.03288]] STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition(http://arxiv.org/abs/2312.03288)
Summary:
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer

Title: When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology. (arXiv:2312.03558v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03558
Code URL: null
Copy Paste: [[2312.03558]] When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology(http://arxiv.org/abs/2312.03558)
Summary:
This technical report presents LongViT, a vision Transformer that can process gigapixel images in an end-to-end manner. Specifically, we split the gigapixel image into a sequence of millions of patches and project them linearly into embeddings. LongNet is then employed to model the extremely long sequence, generating representations that capture both short-range and long-range dependencies. The linear computation complexity of LongNet, along with its distributed algorithm, enables us to overcome the constraints of both computation and memory. We apply LongViT in the field of computational pathology, aiming for cancer diagnosis and prognosis within gigapixel whole-slide images. Experimental results demonstrate that LongViT effectively encodes gigapixel images and outperforms previous state-of-the-art methods on cancer subtyping and survival prediction. Code and models will be available at https://aka.ms/LongViT.

Title: DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization. (arXiv:2312.03568v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03568
Code URL: null
Copy Paste: [[2312.03568]] DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization(http://arxiv.org/abs/2312.03568)
Summary:
In real life, various degradation scenarios exist that might damage document images, making it harder to recognize and analyze them, thus binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer (Document Binarization Transformer), a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization. The presented architecture employs a two-level transformer encoder to effectively capture both global and local feature representation from the input images. These complimentary bi-level features are exploited for efficient document image binarization, resulting in improved results for system-generated as well as handwritten document images in a comprehensive approach. With the absence of convolutional layers, the transformer encoder uses the pixel patches and sub-patches along with their positional information to operate directly on them, while the decoder generates a clean (binarized) output image from the latent representation of the patches. Instead of using a simple vision transformer block to extract information from the image patches, the proposed architecture uses two transformer blocks for greater coverage of the extracted feature space on a global and local scale. The encoded feature representation is used by the decoder block to generate the corresponding binarized output. Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that the proposed model outperforms state-of-the-art techniques on four metrics. The source code will be made available at https://github.com/RisabBiswas/DocBinFormer.

Title: KhabarChin: Automatic Detection of Important News in the Persian Language. (arXiv:2312.03361v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03361
Code URL: null
Copy Paste: [[2312.03361]] KhabarChin: Automatic Detection of Important News in the Persian Language(http://arxiv.org/abs/2312.03361)
Summary:
Being aware of important news is crucial for staying informed and making well-informed decisions efficiently. Natural Language Processing (NLP) approaches can significantly automate this process. This paper introduces the detection of important news, in a previously unexplored area, and presents a new benchmarking dataset (Khabarchin) for detecting important news in the Persian language. We define important news articles as those deemed significant for a considerable portion of society, capable of influencing their mindset or decision-making. The news articles are obtained from seven different prominent Persian news agencies, resulting in the annotation of 7,869 samples and the creation of the dataset. Two challenges of high disagreement and imbalance between classes were faced, and solutions were provided for them. We also propose several learning-based models, ranging from conventional machine learning to state-of-the-art transformer models, to tackle this task. Furthermore, we introduce the second task of important sentence detection in news articles, as they often come with a significant contextual length that makes it challenging for readers to identify important information. We identify these sentences in a weakly supervised manner.

Title: A Text-to-Text Model for Multilingual Offensive Language Identification. (arXiv:2312.03379v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03379
Code URL: null
Copy Paste: [[2312.03379]] A Text-to-Text Model for Multilingual Offensive Language Identification(http://arxiv.org/abs/2312.03379)
Summary:
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

Title: Compressed Context Memory For Online Language Model Interaction. (arXiv:2312.03414v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03414
Code URL: https://github.com/snu-mllab/context-memory
Copy Paste: [[2312.03414]] Compressed Context Memory For Online Language Model Interaction(http://arxiv.org/abs/2312.03414)
Summary:
This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight conditional LoRA into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with $5\times$ smaller context memory space. Codes are available at https://github.com/snu-mllab/context-memory.

Title: Exploring Answer Information Methods for Question Generation with Transformers. (arXiv:2312.03483v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03483
Code URL: null
Copy Paste: [[2312.03483]] Exploring Answer Information Methods for Question Generation with Transformers(http://arxiv.org/abs/2312.03483)
Summary:
There has been a lot of work in question generation where different methods to provide target answers as input, have been employed. This experimentation has been mostly carried out for RNN based models. We use three different methods and their combinations for incorporating answer information and explore their effect on several automatic evaluation metrics. The methods that are used are answer prompting, using a custom product method using answer embeddings and encoder outputs, choosing sentences from the input paragraph that have answer related information, and using a separate cross-attention attention block in the decoder which attends to the answer. We observe that answer prompting without any additional modes obtains the best scores across rouge, meteor scores. Additionally, we use a custom metric to calculate how many of the generated questions have the same answer, as the answer which is used to generate them.

Title: XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering. (arXiv:2312.03567v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03567
Code URL: null
Copy Paste: [[2312.03567]] XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering(http://arxiv.org/abs/2312.03567)
Summary:
Extractive question answering (QA) systems can enable physicians and researchers to query medical records, a foundational capability for designing clinical studies and understanding patient medical history. However, building these systems typically requires expert-annotated QA pairs. Large language models (LLMs), which can perform extractive QA, depend on high quality data in their prompts, specialized for the application domain. We introduce a novel approach, XAIQA, for generating synthetic QA pairs at scale from data naturally available in electronic health records. Our method uses the idea of a classification model explainer to generate questions and answers about medical concepts corresponding to medical codes. In an expert evaluation with two physicians, our method identifies $2.2\times$ more semantic matches and $3.8\times$ more clinical abbreviations than two popular approaches that use sentence transformers to create QA pairs. In an ML evaluation, adding our QA pairs improves performance of GPT-4 as an extractive QA model, including on difficult questions. In both the expert and ML evaluations, we examine trade-offs between our method and sentence transformers for QA pair generation depending on question difficulty.

Title: The mechanistic basis of data dependence and abrupt learning in an in-context classification task. (arXiv:2312.03002v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03002
Code URL: null
Copy Paste: [[2312.03002]] The mechanistic basis of data dependence and abrupt learning in an in-context classification task(http://arxiv.org/abs/2312.03002)
Summary:
Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence. In-context learning contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.

Title: Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit. (arXiv:2312.03038v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03038
Code URL: null
Copy Paste: [[2312.03038]] Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit(http://arxiv.org/abs/2312.03038)
Summary:
Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.

Title: Transformer-Based Deep Learning Model for Bored Pile Load-Deformation Prediction in Bangkok Subsoil. (arXiv:2312.03041v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03041
Code URL: null
Copy Paste: [[2312.03041]] Transformer-Based Deep Learning Model for Bored Pile Load-Deformation Prediction in Bangkok Subsoil(http://arxiv.org/abs/2312.03041)
Summary:
This paper presents a novel deep learning model based on the transformer architecture to predict the load-deformation behavior of large bored piles in Bangkok subsoil. The model encodes the soil profile and pile features as tokenization input, and generates the load-deformation curve as output. The model also incorporates the previous sequential data of load-deformation curve into the decoder to improve the prediction accuracy. The model also incorporates the previous sequential data of load-deformation curve into the decoder. The model shows a satisfactory accuracy and generalization ability for the load-deformation curve prediction, with a mean absolute error of 5.72% for the test data. The model could also be used for parametric analysis and design optimization of piles under different soil and pile conditions, pile cross section, pile length and type of pile.

Title: Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data. (arXiv:2312.03642v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03642
Code URL: null
Copy Paste: [[2312.03642]] Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data(http://arxiv.org/abs/2312.03642)
Summary:
Recent advances in machine learning, specifically transformer architecture, have led to significant advancements in commercial domains. These powerful models have demonstrated superior capability to learn complex relationships and often generalize better to new data and problems. This paper presents a novel transformer-powered approach for enhancing prediction accuracy in multi-modal output scenarios, where sparse experimental data is supplemented with simulation data. The proposed approach integrates transformer-based architecture with a novel graph-based hyper-parameter optimization technique. The resulting system not only effectively reduces simulation bias, but also achieves superior prediction accuracy compared to the prior method. We demonstrate the efficacy of our approach on inertial confinement fusion experiments, where only 10 shots of real-world data are available, as well as synthetic versions of these experiments.

Title: What Planning Problems Can A Relational Neural Network Solve?. (arXiv:2312.03682v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03682
Code URL: https://github.com/concepts-ai/goal-regression-width
Copy Paste: [[2312.03682]] What Planning Problems Can A Relational Neural Network Solve?(http://arxiv.org/abs/2312.03682)
Summary:
Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.

generative

Title: FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction. (arXiv:2312.03187v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03187
Code URL: https://github.com/shuangquanfeng/fergi
Copy Paste: [[2312.03187]] FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction(http://arxiv.org/abs/2312.03187)
Summary:
Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically annotate user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. Specifically, AU4 (brow lowerer) is most consistently reflective of negative evaluations of the generated image. This can be useful in two ways. Firstly, we can automatically annotate user preferences between image pairs with substantial difference in AU4 responses to them with an accuracy significantly outperforming state-of-the-art scoring models. Secondly, directly integrating the AU4 responses with the scoring models improves their consistency with human preferences. Additionally, the AU4 response best reflects the user's evaluation of the image fidelity, making it complementary to the state-of-the-art scoring models, which are generally better at reflecting image-text alignment. Finally, this method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.

Title: Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks. (arXiv:2312.03443v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03443
Code URL: null
Copy Paste: [[2312.03443]] Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks(http://arxiv.org/abs/2312.03443)
Summary:
Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an image of an initial growth stage, the associated growth time, and further information about the field treatment. We present a two-stage framework consisting first of an image prediction model and second of a growth estimation model, which both are independently trained. The image prediction model is a conditional Wasserstein generative adversarial network (CWGAN). In the generator of this model, conditional batch normalization (CBN) is used to integrate different conditions along with the input image. This allows the model to generate time-varying artificial images dependent on multiple influencing factors of different kinds. These images are used by the second part of the framework for plant phenotyping by deriving plant-specific traits and comparing them with those of non-artificial (real) reference images. For various crop datasets, the framework allows realistic, sharp image predictions with a slight loss of quality from short-term to long-term predictions. Simulations of varying growth-influencing conditions performed with the trained framework provide valuable insights into how such factors relate to crop appearances, which is particularly useful in complex, less explored crop mixture systems. Further results show that adding process-based simulated biomass as a condition increases the accuracy of the derived phenotypic traits from the predicted images. This demonstrates the potential of our framework to serve as an interface between an image- and process-based crop growth model.

Title: MMM: Generative Masked Motion Model. (arXiv:2312.03596v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03596
Code URL: null
Copy Paste: [[2312.03596]] MMM: Generative Masked Motion Model(http://arxiv.org/abs/2312.03596)
Summary:
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.

Title: MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations. (arXiv:2312.03631v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03631
Code URL: null
Copy Paste: [[2312.03631]] MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations(http://arxiv.org/abs/2312.03631)
Summary:
While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.

Title: Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication. (arXiv:2312.03692v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03692
Code URL: null
Copy Paste: [[2312.03692]] Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication(http://arxiv.org/abs/2312.03692)
Summary:
Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.

Title: ZTCloudGuard: Zero Trust Context-Aware Access Management Framework to Avoid Misuse Cases in the Era of Generative AI and Cloud-based Health Information Ecosystem. (arXiv:2312.02993v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.02993
Code URL: null
Copy Paste: [[2312.02993]] ZTCloudGuard: Zero Trust Context-Aware Access Management Framework to Avoid Misuse Cases in the Era of Generative AI and Cloud-based Health Information Ecosystem(http://arxiv.org/abs/2312.02993)
Summary:
Managing access between large numbers of distributed medical devices has become a crucial aspect of modern healthcare systems, enabling the establishment of smart hospitals and telehealth infrastructure. However, as telehealth technology continues to evolve and Internet of Things (IoT) devices become more widely used, they are also becoming increasingly exposed to various types of vulnerabilities and medical errors. In healthcare information systems, about 90\% of vulnerabilities emerged from misuse cases and human errors. As a result, there is a need for additional research and development of security tools to prevent such attacks. This article proposes a zero-trust-based context-aware framework for managing access to the main components of the cloud ecosystem, including users, devices and output data. The main goal and benefit of the proposed framework is to build a scoring system to prevent or alleviate misuse cases while using distributed medical devices in cloud-based healthcare information systems. The framework has two main scoring schemas to maintain the chain of trust. First, it proposes a critical trust score based on cloud-native micro-services of authentication, encryption, logging, and authorizations. Second, creating a bond trust scoring to assess the real-time semantic and syntactic analysis of attributes stored in a healthcare information system. The analysis is based on a pre-trained machine learning model to generate the semantic and syntactic scores. The framework also takes into account regulatory compliance and user consent to create a scoring system. The advantage of this method is that it is applicable to any language and adapts to all attributes as it relies on a language model, not just a set of predefined and limited attributes. The results show a high F1 score of 93.5%, which proves that it is valid for detecting misuse cases.

Title: Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models. (arXiv:2312.03419v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.03419
Code URL: null
Copy Paste: [[2312.03419]] Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models(http://arxiv.org/abs/2312.03419)
Summary:
Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world. Consequently, this limitation has given rise to the development of physical backdoor attacks, where trigger objects manifest as physical entities within the real world. However, creating the requisite dataset to train or evaluate a physical backdoor model is a daunting task, limiting the backdoor researchers and practitioners from studying such physical attack scenarios. This paper unleashes a recipe that empowers backdoor researchers to effortlessly create a malicious, physical backdoor dataset based on advances in generative modeling. Particularly, this recipe involves 3 automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally refining for the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, transforming it from a daunting task into an attainable objective. Extensive experiment results show that datasets created by our "recipe" enable adversaries to achieve an impressive attack success rate on real physical world data and exhibit similar properties compared to previous physical backdoor attack studies. This paper offers researchers a valuable toolkit for studies of physical backdoors, all within the confines of their laboratories.

Title: MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment. (arXiv:2312.03644v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03644
Code URL: null
Copy Paste: [[2312.03644]] MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment(http://arxiv.org/abs/2312.03644)
Summary:
Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges due to partial observability and emergent behavior. Directly transferring the online credit assignment method to offline settings results in suboptimal outcomes due to the absence of real-time feedback and intricate agent interactions. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. Experimentally, we tested MACCA in two environments, including discrete and continuous action settings. The results show that MACCA outperforms SOTA methods and improves performance upon their backbones.

Title: On the Role of Edge Dependency in Graph Generative Models. (arXiv:2312.03691v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03691
Code URL: null
Copy Paste: [[2312.03691]] On the Role of Edge Dependency in Graph Generative Models(http://arxiv.org/abs/2312.03691)
Summary:
In this work, we introduce a novel evaluation framework for generative models of graphs, emphasizing the importance of model-generated graph overlap (Chanpuriya et al., 2021) to ensure both accuracy and edge-diversity. We delineate a hierarchy of graph generative models categorized into three levels of complexity: edge independent, node independent, and fully dependent models. This hierarchy encapsulates a wide range of prevalent methods. We derive theoretical bounds on the number of triangles and other short-length cycles producible by each level of the hierarchy, contingent on the model overlap. We provide instances demonstrating the asymptotic optimality of our bounds. Furthermore, we introduce new generative models for each of the three hierarchical levels, leveraging dense subgraph discovery (Gionis & Tsourakakis, 2015). Our evaluation, conducted on real-world datasets, focuses on assessing the output quality and overlap of our proposed models in comparison to other popular models. Our results indicate that our simple, interpretable models provide competitive baselines to popular generative models. Through this investigation, we aim to propel the advancement of graph generative models by offering a structured framework and robust evaluation metrics, thereby facilitating the development of models capable of generating accurate and edge-diverse graphs.

large language model

Title: Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models. (arXiv:2312.03052v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03052
Code URL: null
Copy Paste: [[2312.03052]] Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models(http://arxiv.org/abs/2312.03052)
Summary:
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

Title: GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models. (arXiv:2312.03543v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03543
Code URL: https://github.com/petrichor625/talk2car_cavg
Copy Paste: [[2312.03543]] GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models(http://arxiv.org/abs/2312.03543)
Summary:
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.

Title: OneLLM: One Framework to Align All Modalities with Language. (arXiv:2312.03700v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03700
Code URL: https://github.com/csuhan/onellm
Copy Paste: [[2312.03700]] OneLLM: One Framework to Align All Modalities with Language(http://arxiv.org/abs/2312.03700)
Summary:
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

Title: Inherent limitations of LLMs regarding spatial information. (arXiv:2312.03042v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03042
Code URL: null
Copy Paste: [[2312.03042]] Inherent limitations of LLMs regarding spatial information(http://arxiv.org/abs/2312.03042)
Summary:
Despite the significant advancements in natural language processing capabilities demonstrated by large language models such as ChatGPT, their proficiency in comprehending and processing spatial information, especially within the domains of 2D and 3D route planning, remains notably underdeveloped. This paper investigates the inherent limitations of ChatGPT and similar models in spatial reasoning and navigation-related tasks, an area critical for applications ranging from autonomous vehicle guidance to assistive technologies for the visually impaired. In this paper, we introduce a novel evaluation framework complemented by a baseline dataset, meticulously crafted for this study. This dataset is structured around three key tasks: plotting spatial points, planning routes in two-dimensional (2D) spaces, and devising pathways in three-dimensional (3D) environments. We specifically developed this dataset to assess the spatial reasoning abilities of ChatGPT. Our evaluation reveals key insights into the model's capabilities and limitations in spatial understanding.

Title: Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations. (arXiv:2312.03122v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03122
Code URL: null
Copy Paste: [[2312.03122]] Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations(http://arxiv.org/abs/2312.03122)
Summary:
Human educators possess an intrinsic ability to anticipate and seek educational explanations from students, which drives them to pose thought-provoking questions when students cannot articulate these explanations independently. We aim to imbue Intelligent Tutoring Systems with this ability using few-shot learning capability of Large Language Models. Our work proposes a novel prompting technique, Assertion Enhanced Few-Shot Learning, to facilitate the generation of accurate, detailed oriented educational explanations. Our central hypothesis is that, in educational domain, few-shot demonstrations are necessary but not a sufficient condition for quality explanation generation. We conducted a study involving 12 in-service teachers, comparing our approach to Traditional Few-Shot Learning. The results show that Assertion Enhanced Few-Shot Learning improves explanation accuracy by 15% and yields higher-quality explanations, as evaluated by teachers. We also conduct a qualitative ablation study to factor the impact of assertions to provide educator-friendly prompting guidelines for generating explanations in their domain of interest.

Title: Teaching Specific Scientific Knowledge into Large Language Models through Additional Training. (arXiv:2312.03360v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03360
Code URL: null
Copy Paste: [[2312.03360]] Teaching Specific Scientific Knowledge into Large Language Models through Additional Training(http://arxiv.org/abs/2312.03360)
Summary:
Through additional training, we explore embedding specialized scientific knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that effective knowledge integration requires reading texts from multiple perspectives, especially in instructional formats. We utilize text augmentation to tackle the scarcity of specialized texts, including style conversions and translations. Hyperparameter optimization proves crucial, with different size models (7b, 13b, and 70b) reasonably undergoing additional training. Validating our methods, we construct a dataset of 65,000 scientific papers. Although we have succeeded in partially embedding knowledge, the study highlights the complexities and limitations of incorporating specialized information into LLMs, suggesting areas for further improvement.

Title: Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification. (arXiv:2312.03458v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03458
Code URL: null
Copy Paste: [[2312.03458]] Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification(http://arxiv.org/abs/2312.03458)
Summary:
The proliferation of Large Language Models (LLMs) has spurred extensive research into LLM-related Prompt investigations, such as Instruction Learning (IL), In-context Learning (ICL), and Chain-of-Thought (CoT). These approaches aim to improve LLMs' responses by enabling them to provide concise statements or examples for deeper contemplation when addressing questions. However, independent thinking by LLMs can introduce variability in their thought processes, leading to potential inaccuracies. In response, our study seeks to bridge the gap between LLM and human-like thinking processes, recognizing that text comprehension begins with understanding individual words. To tackle this challenge, we have expanded the CoT method to cater to a specific domain. Our approach, known as "Think from Words" (TFW), initiates the comprehension process at the word level and then extends it to encompass the entire text. We also propose "TFW with Extra word-level information" (TFW Extra), augmenting comprehension with additional word-level data. To assess our methods, we employ text classification on six Japanese datasets comprising text-level and word-level elements. Our findings not only validate the effectiveness of TFW but also shed light on the impact of various word-level information types on LLMs' text comprehension, offering insights into their potential to cause misinterpretations and errors in the overall comprehension of the final text.

Title: DBCopilot: Scaling Natural Language Querying to Massive Databases. (arXiv:2312.03463v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03463
Code URL: https://github.com/tshu-w/dbcopilot
Copy Paste: [[2312.03463]] DBCopilot: Scaling Natural Language Querying to Massive Databases(http://arxiv.org/abs/2312.03463)
Summary:
Text-to-SQL simplifies database interactions by enabling non-experts to convert their natural language (NL) questions into Structured Query Language (SQL) queries. While recent advances in large language models (LLMs) have improved the zero-shot text-to-SQL paradigm, existing methods face scalability challenges when dealing with massive, dynamically changing databases. This paper introduces DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing across massive databases. Specifically, DBCopilot decouples the text-to-SQL process into schema routing and SQL generation, leveraging a lightweight sequence-to-sequence neural network-based router to formulate database connections and navigate natural language questions through databases and tables. The routed schemas and questions are then fed into LLMs for efficient SQL generation. Furthermore, DBCopilot also introduced a reverse schema-to-question generation paradigm, which can learn and adapt the router over massive databases automatically without requiring manual intervention. Experimental results demonstrate that DBCopilot is a scalable and effective solution for real-world text-to-SQL tasks, providing a significant advancement in handling large-scale schemas.

Title: Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment. (arXiv:2312.03549v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03549
Code URL: null
Copy Paste: [[2312.03549]] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment(http://arxiv.org/abs/2312.03549)
Summary:
Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

Title: Not All Large Language Models (LLMs) Succumb to the "Reversal Curse": A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models. (arXiv:2312.03633v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03633
Code URL: null
Copy Paste: [[2312.03633]] Not All Large Language Models (LLMs) Succumb to the "Reversal Curse": A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models(http://arxiv.org/abs/2312.03633)
Summary:
The "Reversal Curse" refers to the scenario where auto-regressive decoder large language models (LLMs), such as ChatGPT, trained on "A is B" fail to learn "B is A", demonstrating a basic failure of logical deduction. This raises a red flag in the use of GPT models for certain general tasks such as constructing knowledge graphs, considering their adherence to this symmetric principle. In our study, we examined a bidirectional LLM, BERT, and found that it is immune to the reversal curse. Driven by ongoing efforts to construct biomedical knowledge graphs with LLMs, we also embarked on evaluating more complex but essential deductive reasoning capabilities. This process included first training encoder and decoder language models to master the intersection ($\cap$) and union ($\cup$) operations on two sets and then moving on to assess their capability to infer different combinations of union ($\cup$) and intersection ($\cap$) operations on three newly created sets. The findings showed that while both encoder and decoder language models, trained for tasks involving two sets (union/intersection), were proficient in such scenarios, they encountered difficulties when dealing with operations that included three sets (various combinations of union and intersection). Our research highlights the distinct characteristics of encoder and decoder models in simple and complex logical reasoning. In practice, the choice between BERT and GPT should be guided by the specific requirements and nature of the task at hand, leveraging their respective strengths in bidirectional context comprehension and sequence prediction.

segmentation

Title: PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation. (arXiv:2312.03015v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03015
Code URL: https://github.com/zyc00/partslip2
Copy Paste: [[2312.03015]] PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation(http://arxiv.org/abs/2312.03015)
Summary:
Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.

Title: AI-SAM: Automatic and Interactive Segment Anything Model. (arXiv:2312.03119v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03119
Code URL: null
Copy Paste: [[2312.03119]] AI-SAM: Automatic and Interactive Segment Anything Model(http://arxiv.org/abs/2312.03119)
Summary:
Semantic segmentation is a core task in computer vision. Existing methods are generally divided into two categories: automatic and interactive. Interactive approaches, exemplified by the Segment Anything Model (SAM), have shown promise as pre-trained models. However, current adaptation strategies for these models tend to lean towards either automatic or interactive approaches. Interactive methods depend on prompts user input to operate, while automatic ones bypass the interactive promptability entirely. Addressing these limitations, we introduce a novel paradigm and its first model: the Automatic and Interactive Segment Anything Model (AI-SAM). In this paradigm, we conduct a comprehensive analysis of prompt quality and introduce the pioneering Automatic and Interactive Prompter (AI-Prompter) that automatically generates initial point prompts while accepting additional user inputs. Our experimental results demonstrate AI-SAM's effectiveness in the automatic setting, achieving state-of-the-art performance. Significantly, it offers the flexibility to incorporate additional user prompts, thereby further enhancing its performance. The project page is available at https://github.com/ymp5078/AI-SAM.

Title: Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. (arXiv:2312.03203v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03203
Code URL: null
Copy Paste: [[2312.03203]] Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields(http://arxiv.org/abs/2312.03203)
Summary:
3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/

Title: Background Clustering Pre-training for Few-shot Segmentation. (arXiv:2312.03322v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03322
Code URL: null
Copy Paste: [[2312.03322]] Background Clustering Pre-training for Few-shot Segmentation(http://arxiv.org/abs/2312.03322)
Summary:
Recent few-shot segmentation (FSS) methods introduce an extra pre-training stage before meta-training to obtain a stronger backbone, which has become a standard step in few-shot learning. Despite the effectiveness, current pre-training scheme suffers from the merged background problem: only base classes are labelled as foregrounds, making it hard to distinguish between novel classes and actual background. In this paper, we propose a new pre-training scheme for FSS via decoupling the novel classes from background, called Background Clustering Pre-Training (BCPT). Specifically, we adopt online clustering to the pixel embeddings of merged background to explore the underlying semantic structures, bridging the gap between pre-training and adaptation to novel classes. Given the clustering results, we further propose the background mining loss and leverage base classes to guide the clustering process, improving the quality and stability of clustering results. Experiments on PASCAL-5i and COCO-20i show that BCPT yields advanced performance. Code will be available.

Title: PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds. (arXiv:2312.03350v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03350
Code URL: null
Copy Paste: [[2312.03350]] PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds(http://arxiv.org/abs/2312.03350)
Summary:
Large and rich data is a prerequisite for effective training of deep neural networks. However, the irregularity of point cloud data makes manual annotation time-consuming and laborious. Self-supervised representation learning, which leverages the intrinsic structure of large-scale unlabelled data to learn meaningful feature representations, has attracted increasing attention in the field of point cloud research. However, self-supervised representation learning often suffers from model collapse, resulting in reduced information and diversity of the learned representation, and consequently degrading the performance of downstream tasks. To address this problem, we propose PointMoment, a novel framework for point cloud self-supervised representation learning that utilizes a high-order mixed moment loss function rather than the conventional contrastive loss function. Moreover, our framework does not require any special techniques such as asymmetric network architectures, gradient stopping, etc. Specifically, we calculate the high-order mixed moment of the feature variables and force them to decompose into products of their individual moment, thereby making multiple variables more independent and minimizing the feature redundancy. We also incorporate a contrastive learning approach to maximize the feature invariance under different data augmentations of the same point cloud. Experimental results show that our approach outperforms previous unsupervised learning methods on the downstream task of 3D point cloud classification and segmentation.

Title: DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception. (arXiv:2312.03409v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03409
Code URL: null
Copy Paste: [[2312.03409]] DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception(http://arxiv.org/abs/2312.03409)
Summary:
Semantic Segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architecture, DeepPyramid+, which addresses diverse challenges encountered in medical image and surgical video segmentation. The proposed DeepPyramid+ incorporates two major modules, namely "Pyramid View Fusion" (PVF) and "Deformable Pyramid Reception," (DPR), to address the outlined challenges. PVF replicates a deduction process within the neural network, aligning with the human visual system, thereby enhancing the representation of relative information at each pixel position. Complementarily, DPR introduces shape- and scale-adaptive feature extraction techniques using dilated deformable convolutions, enhancing accuracy and robustness in handling heterogeneous classes and deformable shapes. Extensive experiments conducted on diverse datasets, including endometriosis videos, MRI images, OCT scans, and cataract and laparoscopy videos, demonstrate the effectiveness of DeepPyramid+ in handling various challenges such as shape and scale variation, reflection, and blur degradation. DeepPyramid+ demonstrates significant improvements in segmentation performance, achieving up to a 3.65% increase in Dice coefficient for intra-domain segmentation and up to a 17% increase in Dice coefficient for cross-domain segmentation. DeepPyramid+ consistently outperforms state-of-the-art networks across diverse modalities considering different backbone networks, showcasing its versatility.

Title: ShareCMP: Polarization-Aware RGB-P Semantic Segmentation. (arXiv:2312.03430v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03430
Code URL: null
Copy Paste: [[2312.03430]] ShareCMP: Polarization-Aware RGB-P Semantic Segmentation(http://arxiv.org/abs/2312.03430)
Summary:
Multimodal semantic segmentation is developing rapidly, but the modality of RGB-Polarization remains underexplored. To delve into this problem, we construct a UPLight RGB-P segmentation benchmark with 12 typical underwater semantic classes which provides data support for Autonomous Underwater Vehicles (AUVs) to perform special perception tasks. In this work, we design the ShareCMP, an RGB-P semantic segmentation framework with a shared dual-branch architecture, which reduces the number of parameters by about 26-33% compared to previous dual-branch models. It encompasses a Polarization Generate Attention (PGA) module designed to generate polarization modal images with richer polarization properties for the encoder. In addition, we introduce the Class Polarization-Aware Loss (CPALoss) to improve the learning and understanding of the encoder for polarization modal information and to optimize the PGA module. With extensive experiments on a total of three RGB-P benchmarks, our ShareCMP achieves state-of-the-art performance in mIoU with fewer parameters on the UPLight (92.45%), ZJU (92.7%), and MCubeS (50.99%) datasets. The code is available at https://github.com/LEFTeyex/ShareCMP.

Title: Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation. (arXiv:2312.03502v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03502
Code URL: https://github.com/zhang-haojie/wesam
Copy Paste: [[2312.03502]] Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation(http://arxiv.org/abs/2312.03502)
Summary:
The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything(SAM), among others, is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success, recent studies reveal the weakness of SAM under strong distribution shift. In particular, SAM performs awkwardly on corrupted natural images, camouflaged images, medical images, etc. Motivated by the observations, we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset, high computation cost and incorrect pseudo label, we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images, medical images, camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs.

Title: Foundation Model Assisted Weakly Supervised Semantic Segmentation. (arXiv:2312.03585v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03585
Code URL: null
Copy Paste: [[2312.03585]] Foundation Model Assisted Weakly Supervised Semantic Segmentation(http://arxiv.org/abs/2312.03585)
Summary:
This work aims to leverage pre-trained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to address weakly supervised semantic segmentation (WSSS) using image-level labels. To this end, we propose a coarse-to-fine framework based on CLIP and SAM for generating high-quality segmentation seeds. Specifically, we construct an image classification task and a seed segmentation task, which are jointly performed by CLIP with frozen weights and two sets of learnable task-specific prompts. A SAM-based seeding (SAMS) module is designed and applied to each task to produce either coarse or fine seed maps. Moreover, we design a multi-label contrastive loss supervised by image-level labels and a CAM activation loss supervised by the generated coarse seed map. These losses are used to learn the prompts, which are the only parts need to be learned in our framework. Once the prompts are learned, we input each image along with the learned segmentation-specific prompts into CLIP and the SAMS module to produce high-quality segmentation seeds. These seeds serve as pseudo labels to train an off-the-shelf segmentation network like other two-stage WSSS methods. Experiments show that our method achieves the state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.