secure

Title: Compress & Align: Curating Image-Text Data with Human Knowledge. (arXiv:2312.06726v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06726
Code URL: null
Copy Paste: [[2312.06726]] Compress & Align: Curating Image-Text Data with Human Knowledge(http://arxiv.org/abs/2312.06726)
Summary:
The massive growth of image-text data through web crawling inherently presents the challenge of variability in data quality. This paper introduces a novel algorithm, rooted in human knowledge, to compress this vast corpus of web-crawled image-text datasets to a compact and high-quality form. Our method unfolds in three major steps. First, we collect an image-text dataset, wherein each image is associated with multiple captions sourced from diverse origins. Then, to systemically capture human preferences regarding the best caption paired with each image, we establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Lastly, we train a reward model on the annotated dataset to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter misaligned/low-quality image-text pairs. Extensive experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to 15.5M (e.g., ~9x smaller), our BLIP-B/16 models still consistently show superior performance compared with the full-size-dataset counterpart on image-text retrieval (Flickr30K, COCO) by ~2.5% in Recall@1, and on image-captioning (Nocaps, COCO) by ~10.0% in CIDEr and ~2.7% in SPICE.

Title: Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training. (arXiv:2312.07067v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07067
Code URL: null
Copy Paste: [[2312.07067]] Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training(http://arxiv.org/abs/2312.07067)
Summary:
Adversarial training is often formulated as a min-max problem, however, concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model, i.e., previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders", which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training, we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders, effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore, we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments, and ensure that HFAT can provide higher robustness and accuracy.

security

Title: Toward Real Text Manipulation Detection: New Dataset and New Solution. (arXiv:2312.06934v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06934
Code URL: null
Copy Paste: [[2312.06934]] Toward Real Text Manipulation Detection: New Dataset and New Solution(http://arxiv.org/abs/2312.06934)
Summary:
With the surge in realistic text tampering, detecting fraudulent text in images has gained prominence for maintaining information security. However, the high costs associated with professional text manipulation and annotation limit the availability of real-world datasets, with most relying on synthetic tampering, which inadequately replicates real-world tampering attributes. To address this issue, we present the Real Text Manipulation (RTM) dataset, encompassing 14,250 text images, which include 5,986 manually and 5,258 automatically tampered images, created using a variety of techniques, alongside 3,006 unaltered text images for evaluating solution stability. Our evaluations indicate that existing methods falter in text forgery detection on the RTM dataset. We propose a robust baseline solution featuring a Consistency-aware Aggregation Hub and a Gated Cross Neighborhood-attention Fusion module for efficient multi-modal information fusion, supplemented by a Tampered-Authentic Contrastive Learning module during training, enriching feature representation distinction. This framework, extendable to other dual-stream architectures, demonstrated notable localization performance improvements of 7.33% and 6.38% on manual and overall manipulations, respectively. Our contributions aim to propel advancements in real-world text tampering detection. Code and dataset will be made available at https://github.com/DrLuo/RTM

Title: Noised Autoencoders for Point Annotation Restoration in Object Counting. (arXiv:2312.07190v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07190
Code URL: null
Copy Paste: [[2312.07190]] Noised Autoencoders for Point Annotation Restoration in Object Counting(http://arxiv.org/abs/2312.07190)
Summary:
Object counting is a field of growing importance in domains such as security surveillance, urban planning, and biology. The annotation is usually provided in terms of 2D points. However, the complexity of object shapes and subjective of annotators may lead to annotation inconsistency, potentially confusing the model during training. To alleviate this issue, we introduce the Noised Autoencoders (NAE) methodology, which extracts general positional knowledge from all annotations. The method involves adding random offsets to initial point annotations, followed by a UNet to restore them to their original positions. Similar to MAE, NAE faces challenges in restoring non-generic points, necessitating reliance on the most common positions inferred from general knowledge. This reliance forms the cornerstone of our method's effectiveness. Different from existing noise-resistance methods, our approach focus on directly improving initial point annotations. Extensive experiments show that NAE yields more consistent annotations compared to the original ones, steadily enhancing the performance of advanced models trained with these revised annotations. \textbf{Remarkably, the proposed approach helps to set new records in nine datasets}. We will make the NAE codes and refined point annotations available.

Title: LLMs Perform Poorly at Concept Extraction in Cyber-security Research Literature. (arXiv:2312.07110v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07110
Code URL: null
Copy Paste: [[2312.07110]] LLMs Perform Poorly at Concept Extraction in Cyber-security Research Literature(http://arxiv.org/abs/2312.07110)
Summary:
The cybersecurity landscape evolves rapidly and poses threats to organizations. To enhance resilience, one needs to track the latest developments and trends in the domain. It has been demonstrated that standard bibliometrics approaches show their limits in such a fast-evolving domain. For this purpose, we use large language models (LLMs) to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context, but our results show some potential for noun extractors. For this reason, we developed a noun extractor boosted with some statistical analysis to extract specific and relevant compound nouns from the domain. Later, we tested our model to identify trends in the LLM domain. We observe some limitations, but it offers promising results to monitor the evolution of emergent trends.

Title: Introduction to IoT. (arXiv:2312.06689v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.06689
Code URL: null
Copy Paste: [[2312.06689]] Introduction to IoT(http://arxiv.org/abs/2312.06689)
Summary:
The Internet of Things (IoT) has rapidly transformed the 21st century, enhancing decision-making processes and introducing innovative consumer services such as pay-as-you-use models. This integration of smart devices and automation technologies has revolutionized our lives. However, it is essential to recognize the significant concerns surrounding security, privacy, intellectual property rights, safety, and trust in this technological landscape, which require further exploration. This chapter serves as a comprehensive guide to newcomers interested in the IoT domain, providing a foundation for making future contributions. It begins by explaining the core concept of IoT, its historical evolution, network components, and key characteristics. The advantages of IoT deployment across various domains are discussed, along with an introduction to the foundational layered architectures supporting IoT. The chapter also delves into the taxonomy of IoT technologies, encompassing hardware, software, wireless communication technologies, hardware platforms, and cloud solutions. Existing applications in major domains like smart cities, healthcare, and agriculture are explored, offering a broad perspective on the IoT's application potential. In addressing prevalent issues and challenges in designing and deploying IoT applications, the chapter examines security threats across architectural layers, ethical considerations, user privacy concerns, and trust-related issues. This discussion equips researchers with a solid understanding of diverse IoT aspects, providing a strong foundation for both practical IoT projects and the development of novel theoretical approaches within the IoT field.

Title: On the Feasibility of Fingerprinting Collaborative Robot Traffic. (arXiv:2312.06802v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.06802
Code URL: null
Copy Paste: [[2312.06802]] On the Feasibility of Fingerprinting Collaborative Robot Traffic(http://arxiv.org/abs/2312.06802)
Summary:
This study examines privacy risks in collaborative robotics, focusing on the potential for traffic analysis in encrypted robot communications. While previous research has explored low-level command recovery, our work investigates high-level motion recovery from command message sequences. We evaluate the efficacy of traditional website fingerprinting techniques (k-FP, KNN, and CUMUL) and their limitations in accurately identifying robotic actions due to their inability to capture detailed temporal relationships. To address this, we introduce a traffic classification approach using signal processing techniques, demonstrating high accuracy in action identification and highlighting the vulnerability of encrypted communications to privacy breaches. Additionally, we explore defenses such as packet padding and timing manipulation, revealing the challenges in balancing traffic analysis resistance with network efficiency. Our findings emphasize the need for continued development of practical defenses in robotic privacy and security.

Title: Blockchain-Based Security Architecture for Unmanned Aerial Vehicles in B5G/6G Services and Beyond: A Comprehensive Approach. (arXiv:2312.06928v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.06928
Code URL: null
Copy Paste: [[2312.06928]] Blockchain-Based Security Architecture for Unmanned Aerial Vehicles in B5G/6G Services and Beyond: A Comprehensive Approach(http://arxiv.org/abs/2312.06928)
Summary:
Unmanned Aerial Vehicles (UAVs), previously favored by enthusiasts, have evolved into indispensable tools for effectively managing disasters and responding to emergencies. For example, one of their most critical applications is to provide seamless wireless communication services in remote rural areas. Thus, it is substantial to identify and consider the different security challenges in the research and development associated with advanced UAV-based B5G/6G architectures. Following this requirement, the present study thoroughly examines the security considerations about UAVs in relation to the architectural framework of the 5G/6G system, the technologies that facilitate its operation, and the concerns surrounding privacy. It exhibits security integration at all the protocol stack layers and analyzes the existing mechanisms to secure UAV-based B5G/6G communications and its energy and power optimization factors. Last, this article also summarizes modern technological trends for establishing security and protecting UAV-based systems, along with the open challenges and strategies for future research work.

Title: A new lightweight additive homomorphic encryption algorithm. (arXiv:2312.06987v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.06987
Code URL: null
Copy Paste: [[2312.06987]] A new lightweight additive homomorphic encryption algorithm(http://arxiv.org/abs/2312.06987)
Summary:
This article describes a lightweight additive homomorphic algorithm with the same encryption and decryption keys. Compared to standard additive homomorphic algorithms like Paillier, this algorithm reduces the computational cost of encryption and decryption from modular exponentiation to modular multiplication, and reduces the computational cost of ciphertext addition from modular multiplication to modular addition. This algorithm is based on a new mathematical problem: in two division operations, whether it is possible to infer the remainder or divisor based on the dividend when two remainders are related. Currently, it is not obvious how to break this problem, but further exploration is needed to determine if it is sufficiently difficult. In addition to this mathematical problem, we have also designed two interesting mathematical structures for decryption, which are used in the two algorithms mentioned in the main text. It is possible that the decryption structure of Algorithm 2 introduces new security vulnerabilities, but we have not investigated this issue thoroughly.

privacy

Title: Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning Against Attribute Inference Attacks. (arXiv:2312.06989v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.06989
Code URL: null
Copy Paste: [[2312.06989]] Task-Agnostic Privacy-Preserving Representation Learning for Federated Learning Against Attribute Inference Attacks(http://arxiv.org/abs/2312.06989)
Summary:
Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacy-preserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees.

We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL ({\bf TAPPFL}) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. Experimental results also show that TAPPFL outperforms the existing defenses\footnote{Source code and full version: \url{https://github.com/TAPPFL}}.

Title: Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions. (arXiv:2312.07055v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.07055
Code URL: https://github.com/gericko/grouprandomizedresponse
Copy Paste: [[2312.07055]] Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions(http://arxiv.org/abs/2312.07055)
Summary:
We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics, including the count of subgraphs, under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical approach in differential privacy, its application in local differential privacy requires a form of compression that every node can reproduce. In our study, we introduce linear congruence hashing. With a sampling rate of $s$, our method can cut communication costs by a factor of $s^2$, albeit at the cost of increasing variance in the published graph statistic by a factor of $s$. The experimental results indicate that, when matched for communication costs, our method achieves a reduction in the $\ell_2$-error for triangle counts by up to 1000 times compared to the performance of leading algorithms.

Title: Practical considerations on using private sampling for synthetic data. (arXiv:2312.07139v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.07139
Code URL: null
Copy Paste: [[2312.07139]] Practical considerations on using private sampling for synthetic data(http://arxiv.org/abs/2312.07139)
Summary:
Artificial intelligence and data access are already mainstream. One of the main challenges when designing an artificial intelligence or disclosing content from a database is preserving the privacy of individuals who participate in the process. Differential privacy for synthetic data generation has received much attention due to the ability of preserving privacy while freely using the synthetic data. Private sampling is the first noise-free method to construct differentially private synthetic data with rigorous bounds for privacy and accuracy. However, this synthetic data generation method comes with constraints which seem unrealistic and not applicable for real-world datasets. In this paper, we provide an implementation of the private sampling algorithm and discuss the realism of its constraints in practical cases.

Title: Privacy-Aware Energy Consumption Modeling of Connected Battery Electric Vehicles using Federated Learning. (arXiv:2312.07371v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07371
Code URL: null
Copy Paste: [[2312.07371]] Privacy-Aware Energy Consumption Modeling of Connected Battery Electric Vehicles using Federated Learning(http://arxiv.org/abs/2312.07371)
Summary:
Battery Electric Vehicles (BEVs) are increasingly significant in modern cities due to their potential to reduce air pollution. Precise and real-time estimation of energy consumption for them is imperative for effective itinerary planning and optimizing vehicle systems, which can reduce driving range anxiety and decrease energy costs. As public awareness of data privacy increases, adopting approaches that safeguard data privacy in the context of BEV energy consumption modeling is crucial. Federated Learning (FL) is a promising solution mitigating the risk of exposing sensitive information to third parties by allowing local data to remain on devices and only sharing model updates with a central server. Our work investigates the potential of using FL methods, such as FedAvg, and FedPer, to improve BEV energy consumption prediction while maintaining user privacy. We conducted experiments using data from 10 BEVs under simulated real-world driving conditions. Our results demonstrate that the FedAvg-LSTM model achieved a reduction of up to 67.84\% in the MAE value of the prediction results. Furthermore, we explored various real-world scenarios and discussed how FL methods can be employed in those cases. Our findings show that FL methods can effectively improve the performance of BEV energy consumption prediction while maintaining user privacy.

protect

defense

Title: EdgePruner: Poisoned Edge Pruning in Graph Contrastive Learning. (arXiv:2312.07022v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2312.07022
Code URL: null
Copy Paste: [[2312.07022]] EdgePruner: Poisoned Edge Pruning in Graph Contrastive Learning(http://arxiv.org/abs/2312.07022)
Summary:
Graph Contrastive Learning (GCL) is unsupervised graph representation learning that can obtain useful representation of unknown nodes. The node representation can be utilized as features of downstream tasks. However, GCL is vulnerable to poisoning attacks as with existing learning models. A state-of-the-art defense cannot sufficiently negate adverse effects by poisoned graphs although such a defense introduces adversarial training in the GCL. To achieve further improvement, pruning adversarial edges is important. To the best of our knowledge, the feasibility remains unexplored in the GCL domain. In this paper, we propose a simple defense for GCL, EdgePruner. We focus on the fact that the state-of-the-art poisoning attack on GCL tends to mainly add adversarial edges to create poisoned graphs, which means that pruning edges is important to sanitize the graphs. Thus, EdgePruner prunes edges that contribute to minimizing the contrastive loss based on the node representation obtained after training on poisoned graphs by GCL. Furthermore, we focus on the fact that nodes with distinct features are connected by adversarial edges in poisoned graphs. Thus, we introduce feature similarity between neighboring nodes to help more appropriately determine adversarial edges. This similarity is helpful in further eliminating adverse effects from poisoned graphs on various datasets. Finally, EdgePruner outputs a graph that yields the minimum contrastive loss as the sanitized graph. Our results demonstrate that pruning adversarial edges is feasible on six datasets. EdgePruner can improve the accuracy of node classification under the attack by up to 5.55% compared with that of the state-of-the-art defense. Moreover, we show that EdgePruner is immune to an adaptive attack.

attack

Title: Attacking the Loop: Adversarial Attacks on Graph-based Loop Closure Detection. (arXiv:2312.06991v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06991
Code URL: null
Copy Paste: [[2312.06991]] Attacking the Loop: Adversarial Attacks on Graph-based Loop Closure Detection(http://arxiv.org/abs/2312.06991)
Summary:
With the advancement in robotics, it is becoming increasingly common for large factories and warehouses to incorporate visual SLAM (vSLAM) enabled automated robots that operate closely next to humans. This makes any adversarial attacks on vSLAM components potentially detrimental to humans working alongside them. Loop Closure Detection (LCD) is a crucial component in vSLAM that minimizes the accumulation of drift in mapping, since even a small drift can accumulate into a significant drift over time. A prior work by Kim et al., SymbioLCD2, unified visual features and semantic objects into a single graph structure for finding loop closure candidates. While this provided a performance improvement over visual feature-based LCD, it also created a single point of vulnerability for potential graph-based adversarial attacks. Unlike previously reported visual-patch based attacks, small graph perturbations are far more challenging to detect, making them a more significant threat. In this paper, we present Adversarial-LCD, a novel black-box evasion attack framework that employs an eigencentrality-based perturbation method and an SVM-RBF surrogate model with a Weisfeiler-Lehman feature extractor for attacking graph-based LCD. Our evaluation shows that the attack performance of Adversarial-LCD with the SVM-RBF surrogate model was superior to that of other machine learning surrogate algorithms, including SVM-linear, SVM-polynomial, and Bayesian classifier, demonstrating the effectiveness of our attack framework. Furthermore, we show that our eigencentrality-based perturbation method outperforms other algorithms, such as Random-walk and Shortest-path, highlighting the efficiency of Adversarial-LCD's perturbation selection method.

Title: DTA: Distribution Transform-based Attack for Query-Limited Scenario. (arXiv:2312.07245v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07245
Code URL: null
Copy Paste: [[2312.07245]] DTA: Distribution Transform-based Attack for Query-Limited Scenario(http://arxiv.org/abs/2312.07245)
Summary:
In generating adversarial examples, the conventional black-box attack methods rely on sufficient feedback from the to-be-attacked models by repeatedly querying until the attack is successful, which usually results in thousands of trials during an attack. This may be unacceptable in real applications since Machine Learning as a Service Platform (MLaaS) usually only returns the final result (i.e., hard-label) to the client and a system equipped with certain defense mechanisms could easily detect malicious queries. By contrast, a feasible way is a hard-label attack that simulates an attacked action being permitted to conduct a limited number of queries. To implement this idea, in this paper, we bypass the dependency on the to-be-attacked model and benefit from the characteristics of the distributions of adversarial examples to reformulate the attack problem in a distribution transform manner and propose a distribution transform-based attack (DTA). DTA builds a statistical mapping from the benign example to its adversarial counterparts by tackling the conditional likelihood under the hard-label black-box settings. In this way, it is no longer necessary to query the target model frequently. A well-trained DTA model can directly and efficiently generate a batch of adversarial examples for a certain input, which can be used to attack un-seen models based on the assumed transferability. Furthermore, we surprisingly find that the well-trained DTA model is not sensitive to the semantic spaces of the training dataset, meaning that the model yields acceptable attack performance on other datasets. Extensive experiments validate the effectiveness of the proposed idea and the superiority of DTA over the state-of-the-art.

Title: SSTA: Salient Spatially Transformed Attack. (arXiv:2312.07258v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07258
Code URL: null
Copy Paste: [[2312.07258]] SSTA: Salient Spatially Transformed Attack(http://arxiv.org/abs/2312.07258)
Summary:
Extensive studies have demonstrated that deep neural networks (DNNs) are vulnerable to adversarial attacks, which brings a huge security risk to the further application of DNNs, especially for the AI models developed in the real world. Despite the significant progress that has been made recently, existing attack methods still suffer from the unsatisfactory performance of escaping from being detected by naked human eyes due to the formulation of adversarial example (AE) heavily relying on a noise-adding manner. Such mentioned challenges will significantly increase the risk of exposure and result in an attack to be failed. Therefore, in this paper, we propose the Salient Spatially Transformed Attack (SSTA), a novel framework to craft imperceptible AEs, which enhance the stealthiness of AEs by estimating a smooth spatial transform metric on a most critical area to generate AEs instead of adding external noise to the whole image. Compared to state-of-the-art baselines, extensive experiments indicated that SSTA could effectively improve the imperceptibility of the AEs while maintaining a 100\% attack success rate.

Title: Eroding Trust In Aerial Imagery: Comprehensive Analysis and Evaluation Of Adversarial Attacks In Geospatial Systems. (arXiv:2312.07389v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07389
Code URL: null
Copy Paste: [[2312.07389]] Eroding Trust In Aerial Imagery: Comprehensive Analysis and Evaluation Of Adversarial Attacks In Geospatial Systems(http://arxiv.org/abs/2312.07389)
Summary:
In critical operations where aerial imagery plays an essential role, the integrity and trustworthiness of data are paramount. The emergence of adversarial attacks, particularly those that exploit control over labels or employ physically feasible trojans, threatens to erode that trust, making the analysis and mitigation of these attacks a matter of urgency. We demonstrate how adversarial attacks can degrade confidence in geospatial systems, specifically focusing on scenarios where the attacker's control over labels is restricted and the use of realistic threat vectors. Proposing and evaluating several innovative attack methodologies, including those tailored to overhead images, we empirically show their threat to remote sensing systems using high-quality SpaceNet datasets. Our experimentation reflects the unique challenges posed by aerial imagery, and these preliminary results not only reveal the potential risks but also highlight the non-trivial nature of the problem compared to recent works.

Title: Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack. (arXiv:2312.06924v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.06924
Code URL: null
Copy Paste: [[2312.06924]] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack(http://arxiv.org/abs/2312.06924)
Summary:
Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integraty of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.

Title: Adversarial Estimation of Topological Dimension with Harmonic Score Maps. (arXiv:2312.06869v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06869
Code URL: null
Copy Paste: [[2312.06869]] Adversarial Estimation of Topological Dimension with Harmonic Score Maps(http://arxiv.org/abs/2312.06869)
Summary:
Quantification of the number of variables needed to locally explain complex data is often the first step to better understanding it. Existing techniques from intrinsic dimension estimation leverage statistical models to glean this information from samples within a neighborhood. However, existing methods often rely on well-picked hyperparameters and ample data as manifold dimension and curvature increases. Leveraging insight into the fixed point of the score matching objective as the score map is regularized by its Dirichlet energy, we show that it is possible to retrieve the topological dimension of the manifold learned by the score map. We then introduce a novel method to measure the learned manifold's topological dimension (i.e., local intrinsic dimension) using adversarial attacks, thereby generating useful interpretations of the learned manifold.

robust

Title: SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction. (arXiv:2312.06704v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06704
Code URL: null
Copy Paste: [[2312.06704]] SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction(http://arxiv.org/abs/2312.06704)
Summary:
Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements, accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images, along with predicting textures for unseen areas, remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response, we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction), a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU employs a cross-attention mechanism within the transformer, using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness, especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments, SIFU surpasses SOTA methods in both geometry and texture reconstruction, showcasing enhanced robustness in complex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building, demonstrating its broad utility in real-world scenarios. Project page https://river-zhang.github.io/SIFU-projectpage/ .

Title: Gaussian Splatting SLAM. (arXiv:2312.06741v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06741
Code URL: null
Copy Paste: [[2312.06741]] Gaussian Splatting SLAM(http://arxiv.org/abs/2312.06741)
Summary:
We present the first application of 3D Gaussian Splatting to incremental 3D reconstruction using a single moving monocular or RGB-D camera. Our Simultaneous Localisation and Mapping (SLAM) method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, mapping, and high-quality rendering. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First, to move beyond the original 3DGS algorithm, which requires accurate poses from an offline Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians, and show that this enables fast and robust tracking with a wide basin of convergence. Second, by utilising the explicit nature of the Gaussians, we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally, we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation, but also reconstruction of tiny and even transparent objects.

Title: Honeybee: Locality-enhanced Projector for Multimodal LLM. (arXiv:2312.06742v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06742
Code URL: https://github.com/kakaobrain/honeybee
Copy Paste: [[2312.06742]] Honeybee: Locality-enhanced Projector for Multimodal LLM(http://arxiv.org/abs/2312.06742)
Summary:
In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.

Title: Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input. (arXiv:2312.06797v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06797
Code URL: null
Copy Paste: [[2312.06797]] Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input(http://arxiv.org/abs/2312.06797)
Summary:
Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to a wide range of common video corruptions including temporary occlusion, motion blur, and pixel-level noise. We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption and establish two techniques to tackle this issue. First, we introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation. Additionally, to incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block. Extensively tested on corrupted videos, the proposed strategies consistently boost the robustness of 3D pose lifters and serve as new baselines for future research.

Title: ADOD: Adaptive Domain-Aware Object Detection with Residual Attention for Underwater Environments. (arXiv:2312.06801v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06801
Code URL: null
Copy Paste: [[2312.06801]] ADOD: Adaptive Domain-Aware Object Detection with Residual Attention for Underwater Environments(http://arxiv.org/abs/2312.06801)
Summary:
This research presents ADOD, a novel approach to address domain generalization in underwater object detection. Our method enhances the model's ability to generalize across diverse and unseen domains, ensuring robustness in various underwater environments. The first key contribution is Residual Attention YOLOv3, a novel variant of the YOLOv3 framework empowered by residual attention modules. These modules enable the model to focus on informative features while suppressing background noise, leading to improved detection accuracy and adaptability to different domains. The second contribution is the attention-based domain classification module, vital during training. This module helps the model identify domain-specific information, facilitating the learning of domain-invariant features. Consequently, ADOD can generalize effectively to underwater environments with distinct visual characteristics. Extensive experiments on diverse underwater datasets demonstrate ADOD's superior performance compared to state-of-the-art domain generalization methods, particularly in challenging scenarios. The proposed model achieves exceptional detection performance in both seen and unseen domains, showcasing its effectiveness in handling domain shifts in underwater object detection tasks. ADOD represents a significant advancement in adaptive object detection, providing a promising solution for real-world applications in underwater environments. With the prevalence of domain shifts in such settings, the model's strong generalization ability becomes a valuable asset for practical underwater surveillance and marine research endeavors.

Title: Encoding Surgical Videos as Latent Spatiotemporal Graphs for Object and Anatomy-Driven Reasoning. (arXiv:2312.06829v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06829
Code URL: https://github.com/camma-public/surglatentgraph
Copy Paste: [[2312.06829]] Encoding Surgical Videos as Latent Spatiotemporal Graphs for Object and Anatomy-Driven Reasoning(http://arxiv.org/abs/2312.06829)
Summary:
Recently, spatiotemporal graphs have emerged as a concise and elegant manner of representing video clips in an object-centric fashion, and have shown to be useful for downstream tasks such as action recognition. In this work, we investigate the use of latent spatiotemporal graphs to represent a surgical video in terms of the constituent anatomical structures and tools and their evolving properties over time. To build the graphs, we first predict frame-wise graphs using a pre-trained model, then add temporal edges between nodes based on spatial coherence and visual and semantic similarity. Unlike previous approaches, we incorporate long-term temporal edges in our graphs to better model the evolution of the surgical scene and increase robustness to temporary occlusions. We also introduce a novel graph-editing module that incorporates prior knowledge and temporal coherence to correct errors in the graph, enabling improved downstream task performance. Using our graph representations, we evaluate two downstream tasks, critical view of safety prediction and surgical phase recognition, obtaining strong results that demonstrate the quality and flexibility of the learned representations. Code is available at github.com/CAMMA-public/SurgLatentGraph.

Title: Exploring Novel Object Recognition and Spontaneous Location Recognition Machine Learning Analysis Techniques in Alzheimer's Mice. (arXiv:2312.06914v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06914
Code URL: https://github.com/bafanas/dlc-object-recognition-analysis
Copy Paste: [[2312.06914]] Exploring Novel Object Recognition and Spontaneous Location Recognition Machine Learning Analysis Techniques in Alzheimer's Mice(http://arxiv.org/abs/2312.06914)
Summary:
Understanding object recognition patterns in mice is crucial for advancing behavioral neuroscience and has significant implications for human health, particularly in the realm of Alzheimer's research. This study is centered on the development, application, and evaluation of a state-of-the-art computational pipeline designed to analyze such behaviors, specifically focusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition (SLR) tasks. The pipeline integrates three advanced computational models: Any-Maze for initial data collection, DeepLabCut for detailed pose estimation, and Convolutional Neural Networks (CNNs) for nuanced behavioral classification. Employed across four distinct mouse groups, this pipeline demonstrated high levels of accuracy and robustness. Despite certain challenges like video quality limitations and the need for manual calculations, the results affirm the pipeline's efficacy and potential for scalability. The study serves as a proof of concept for a multidimensional computational approach to behavioral neuroscience, emphasizing the pipeline's versatility and readiness for future, more complex analyses.

Title: Adjustable Robust Transformer for High Myopia Screening in Optical Coherence Tomography. (arXiv:2312.07052v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07052
Code URL: https://github.com/maxiao0234/artran
Copy Paste: [[2312.07052]] Adjustable Robust Transformer for High Myopia Screening in Optical Coherence Tomography(http://arxiv.org/abs/2312.07052)
Summary:
Myopia is a manifestation of visual impairment caused by an excessively elongated eyeball. Image data is critical material for studying high myopia and pathological myopia. Measurements of spherical equivalent and axial length are the gold standards for identifying high myopia, but the available image data for matching them is scarce. In addition, the criteria for defining high myopia vary from study to study, and therefore the inclusion of samples in automated screening efforts requires an appropriate assessment of interpretability. In this work, we propose a model called adjustable robust transformer (ARTran) for high myopia screening of optical coherence tomography (OCT) data. Based on vision transformer, we propose anisotropic patch embedding (APE) to capture more discriminative features of high myopia. To make the model effective under variable screening conditions, we propose an adjustable class embedding (ACE) to replace the fixed class token, which changes the output to adapt to different conditions. Considering the confusion of the data at high myopia and low myopia threshold, we introduce the label noise learning strategy and propose a shifted subspace transition matrix (SST) to enhance the robustness of the model. Besides, combining the two structures proposed above, the model can provide evidence for uncertainty evaluation. The experimental results demonstrate the effectiveness and reliability of the proposed method. Code is available at: https://github.com/maxiao0234/ARTran.

Title: Collapse-Oriented Adversarial Training with Triplet Decoupling for Robust Image Retrieval. (arXiv:2312.07364v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07364
Code URL: null
Copy Paste: [[2312.07364]] Collapse-Oriented Adversarial Training with Triplet Decoupling for Robust Image Retrieval(http://arxiv.org/abs/2312.07364)
Summary:
Adversarial training has achieved substantial performance in defending image retrieval systems against adversarial examples. However, existing studies still suffer from two major limitations: model collapse and weak adversary. This paper addresses these two limitations by proposing collapse-oriented (COLO) adversarial training with triplet decoupling (TRIDE). Specifically, COLO prevents model collapse by temporally orienting the perturbation update direction with a new collapse metric, while TRIDE yields a strong adversary by spatially decoupling the update targets of perturbation into the anchor and the two candidates of a triplet. Experimental results demonstrate that our COLO-TRIDE outperforms the current state of the art by 7% on average over 10 robustness metrics and across 3 popular datasets. In addition, we identify the fairness limitations of commonly used robustness metrics in image retrieval and propose a new metric for more meaningful robustness evaluation. Codes will be made publicly available on GitHub.

Title: Unsupervised Temporal Action Localization via Self-paced Incremental Learning. (arXiv:2312.07384v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07384
Code URL: null
Copy Paste: [[2312.07384]] Unsupervised Temporal Action Localization via Self-paced Incremental Learning(http://arxiv.org/abs/2312.07384)
Summary:
Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.

Title: A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames. (arXiv:2312.07395v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07395
Code URL: null
Copy Paste: [[2312.07395]] A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames(http://arxiv.org/abs/2312.07395)
Summary:
Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema).

Title: How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation. (arXiv:2312.07424v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07424
Code URL: null
Copy Paste: [[2312.07424]] How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation(http://arxiv.org/abs/2312.07424)
Summary:
In machine learning, generalization against distribution shifts -- where deployment conditions diverge from the training scenarios -- is crucial, particularly in fields like climate modeling, biomedicine, and autonomous driving. The emergence of foundation models, distinguished by their extensive pretraining and task versatility, has led to an increased interest in their adaptability to distribution shifts. GPT-4V(ision) acts as the most advanced publicly accessible multimodal foundation model, with extensive applications across various domains, including anomaly detection, video understanding, image generation, and medical diagnosis. However, its robustness against data distributions remains largely underexplored. Addressing this gap, this study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments, benchmarking against prominent models like CLIP and LLaVA. We delve into GPT-4V's zero-shot generalization across 13 diverse datasets spanning natural, medical, and molecular domains. We further investigate its adaptability to controlled data perturbations and examine the efficacy of in-context learning as a tool to enhance its adaptation. Our findings delineate GPT-4V's capability boundaries in distribution shifts, shedding light on its strengths and limitations across various scenarios. Importantly, this investigation contributes to our understanding of how AI foundation models generalize to distribution shifts, offering pivotal insights into their adaptability and robustness. Code is publicly available at https://github.com/jameszhou-gl/gpt-4v-distribution-shift.

Title: Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. (arXiv:2312.07435v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07435
Code URL: null
Copy Paste: [[2312.07435]] Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval(http://arxiv.org/abs/2312.07435)
Summary:
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.

Title: Efficient Object Detection in Autonomous Driving using Spiking Neural Networks: Performance, Energy Consumption Analysis, and Insights into Open-set Object Discovery. (arXiv:2312.07466v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07466
Code URL: https://github.com/aitor-martinez-seras/snn-automotive-object-detection
Copy Paste: [[2312.07466]] Efficient Object Detection in Autonomous Driving using Spiking Neural Networks: Performance, Energy Consumption Analysis, and Insights into Open-set Object Discovery(http://arxiv.org/abs/2312.07466)
Summary:
Besides performance, efficiency is a key design driver of technologies supporting vehicular perception. Indeed, a well-balanced trade-off between performance and energy consumption is crucial for the sustainability of autonomous vehicles. In this context, the diversity of real-world contexts in which autonomous vehicles can operate motivates the need for empowering perception models with the capability to detect, characterize and identify newly appearing objects by themselves. In this manuscript we elaborate on this threefold conundrum (performance, efficiency and open-world learning) for object detection modeling tasks over image data collected from vehicular scenarios. Specifically, we show that well-performing and efficient models can be realized by virtue of Spiking Neural Networks (SNNs), reaching competitive levels of detection performance when compared to their non-spiking counterparts at dramatic energy consumption savings (up to 85%) and a slightly improved robustness against image noise. Our experiments herein offered also expose qualitatively the complexity of detecting new objects based on the preliminary results of a simple approach to discriminate potential object proposals in the captured image.

Title: NearbyPatchCL: Leveraging Nearby Patches for Self-Supervised Patch-Level Multi-Class Classification in Whole-Slide Images. (arXiv:2312.07489v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07489
Code URL: null
Copy Paste: [[2312.07489]] NearbyPatchCL: Leveraging Nearby Patches for Self-Supervised Patch-Level Multi-Class Classification in Whole-Slide Images(http://arxiv.org/abs/2312.07489)
Summary:
Whole-slide image (WSI) analysis plays a crucial role in cancer diagnosis and treatment. In addressing the demands of this critical task, self-supervised learning (SSL) methods have emerged as a valuable resource, leveraging their efficiency in circumventing the need for a large number of annotations, which can be both costly and time-consuming to deploy supervised methods. Nevertheless, patch-wise representation may exhibit instability in performance, primarily due to class imbalances stemming from patch selection within WSIs. In this paper, we introduce Nearby Patch Contrastive Learning (NearbyPatchCL), a novel self-supervised learning method that leverages nearby patches as positive samples and a decoupled contrastive loss for robust representation learning. Our method demonstrates a tangible enhancement in performance for downstream tasks involving patch-level multi-class classification. Additionally, we curate a new dataset derived from WSIs sourced from the Canine Cutaneous Cancer Histology, thus establishing a benchmark for the rigorous evaluation of patch-level multi-class classification methodologies. Intensive experiments show that our method significantly outperforms the supervised baseline and state-of-the-art SSL methods with top-1 classification accuracy of 87.56%. Our method also achieves comparable results while utilizing a mere 1% of labeled data, a stark contrast to the 100% labeled data requirement of other approaches. Source code: https://github.com/nvtien457/NearbyPatchCL

Title: Intelligent Virtual Assistants with LLM-based Process Automation. (arXiv:2312.06677v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06677
Code URL: null
Copy Paste: [[2312.06677]] Intelligent Virtual Assistants with LLM-based Process Automation(http://arxiv.org/abs/2312.06677)
Summary:
While intelligent virtual assistants like Siri, Alexa, and Google Assistant have become ubiquitous in modern life, they still face limitations in their ability to follow multi-step instructions and accomplish complex goals articulated in natural language. However, recent breakthroughs in large language models (LLMs) show promise for overcoming existing barriers by enhancing natural language processing and reasoning capabilities. Though promising, applying LLMs to create more advanced virtual assistants still faces challenges like ensuring robust performance and handling variability in real-world user commands. This paper proposes a novel LLM-based virtual assistant that can automatically perform multi-step operations within mobile apps based on high-level user requests. The system represents an advance in assistants by providing an end-to-end solution for parsing instructions, reasoning about goals, and executing actions. LLM-based Process Automation (LLMPA) has modules for decomposing instructions, generating descriptions, detecting interface elements, predicting next actions, and error checking. Experiments demonstrate the system completing complex mobile operation tasks in Alipay based on natural language instructions. This showcases how large language models can enable automated assistants to accomplish real-world tasks. The main contributions are the novel LLMPA architecture optimized for app process automation, the methodology for applying LLMs to mobile apps, and demonstrations of multi-step task completion in a real-world environment. Notably, this work represents the first real-world deployment and extensive evaluation of a large language model-based virtual assistant in a widely used mobile application with an enormous user base numbering in the hundreds of millions.

Title: Dynamic Corrective Self-Distillation for Better Fine-Tuning of Pretrained Models. (arXiv:2312.07028v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07028
Code URL: null
Copy Paste: [[2312.07028]] Dynamic Corrective Self-Distillation for Better Fine-Tuning of Pretrained Models(http://arxiv.org/abs/2312.07028)
Summary:
We tackle the challenging issue of aggressive fine-tuning encountered during the process of transfer learning of pre-trained language models (PLMs) with limited labeled downstream data. This problem primarily results in a decline in performance on the subsequent task. Inspired by the adaptive boosting method in traditional machine learning, we present an effective dynamic corrective self-distillation (DCS) approach to improve the fine-tuning of the PLMs. Our technique involves performing a self-distillation mechanism where, at each iteration, the student model actively adapts and corrects itself by dynamically adjusting the weights assigned to individual data points. This iterative self-correcting process significantly enhances the overall fine-tuning capability of PLMs, leading to improved performance and robustness. We conducted comprehensive evaluations using the GLUE benchmark demonstrating the efficacy of our method in enhancing the fine-tuning process for various PLMs across diverse downstream tasks.

Title: Predictive variational autoencoder for learning robust representations of time-series data. (arXiv:2312.06932v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06932
Code URL: null
Copy Paste: [[2312.06932]] Predictive variational autoencoder for learning robust representations of time-series data(http://arxiv.org/abs/2312.06932)
Summary:
Variational autoencoders (VAEs) have been used extensively to discover low-dimensional latent factors governing neural activity and animal behavior. However, without careful model selection, the uncovered latent factors may reflect noise in the data rather than true underlying features, rendering such representations unsuitable for scientific interpretation. Existing solutions to this problem involve introducing additional measured variables or data augmentations specific to a particular data type. We propose a VAE architecture that predicts the next point in time and show that it mitigates the learning of spurious features. In addition, we introduce a model selection metric based on smoothness over time in the latent space. We show that together these two constraints on VAEs to be smooth over time produce robust latent representations and faithfully recover latent factors on synthetic datasets.

Title: AI Control: Improving Safety Despite Intentional Subversion. (arXiv:2312.06942v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06942
Code URL: null
Copy Paste: [[2312.06942]] AI Control: Improving Safety Despite Intentional Subversion(http://arxiv.org/abs/2312.06942)
Summary:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.

We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.

We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.

Title: Toward Robustness in Multi-label Classification: A Data Augmentation Strategy against Imbalance and Noise. (arXiv:2312.07087v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07087
Code URL: https://github.com/disl-lab/balancemix
Copy Paste: [[2312.07087]] Toward Robustness in Multi-label Classification: A Data Augmentation Strategy against Imbalance and Noise(http://arxiv.org/abs/2312.07087)
Summary:
Multi-label classification poses challenges due to imbalanced and noisy labels in training data. We propose a unified data augmentation method, named BalanceMix, to address these challenges. Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity. It also refines multi-labels at the label-wise granularity, categorizing noisy labels as clean, re-labeled, or ambiguous for robust optimization. Extensive experiments on three benchmark datasets demonstrate that BalanceMix outperforms existing state-of-the-art methods. We release the code at https://github.com/DISL-Lab/BalanceMix.

Title: Analyze the Robustness of Classifiers under Label Noise. (arXiv:2312.07271v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07271
Code URL: null
Copy Paste: [[2312.07271]] Analyze the Robustness of Classifiers under Label Noise(http://arxiv.org/abs/2312.07271)
Summary:
This study explores the robustness of label noise classifiers, aiming to enhance model resilience against noisy data in complex real-world scenarios. Label noise in supervised learning, characterized by erroneous or imprecise labels, significantly impairs model performance. This research focuses on the increasingly pertinent issue of label noise's impact on practical applications. Addressing the prevalent challenge of inaccurate training data labels, we integrate adversarial machine learning (AML) and importance reweighting techniques. Our approach involves employing convolutional neural networks (CNN) as the foundational model, with an emphasis on parameter adjustment for individual training samples. This strategy is designed to heighten the model's focus on samples critically influencing performance.

Title: Safe Multi-Task Bayesian Optimization. (arXiv:2312.07281v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07281
Code URL: https://github.com/tuhh-ics/2024-code-l4dc-safe-multi-task-bayesian-optimization
Copy Paste: [[2312.07281]] Safe Multi-Task Bayesian Optimization(http://arxiv.org/abs/2312.07281)
Summary:
Bayesian optimization has become a powerful tool for safe online optimization of systems, due to its high sample efficiency and noise robustness. For further speed-up reduced physical models of the system can be incorporated into the optimization to accelerate the process, since the models are able to offer an approximation of the actual system, and sampling from them is significantly cheaper. The similarity between model and reality is represented by additional hyperparameters and learned within the optimization process. Safety is an important criteria for online optimization methods like Bayesian optimization, which has been addressed by recent literature, which provide safety guarantees under the assumption of known hyperparameters. However, in practice this is not applicable. Therefore, we extend the robust Gaussian process uniform error bounds to meet the multi-task setting, which involves the calculation of a confidence region from the hyperparameter posterior distribution utilizing Markov chain Monte Carlo methods. Then, using the robust safety bounds, Bayesian optimization is applied to safely optimize the system while incorporating measurements of the models. Simulations show that the optimization can be significantly accelerated compared to other state-of-the-art safe Bayesian optimization methods depending on the fidelity of the models.

Title: ReRoGCRL: Representation-based Robustness in Goal-Conditioned Reinforcement Learning. (arXiv:2312.07392v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07392
Code URL: null
Copy Paste: [[2312.07392]] ReRoGCRL: Representation-based Robustness in Goal-Conditioned Reinforcement Learning(http://arxiv.org/abs/2312.07392)
Summary:
While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness, particularly against adversarial perturbations, remains unexplored. Unfortunately, the attacks and robust representation training methods specifically designed for traditional RL are not so effective when applied to GCRL. To address this challenge, we propose the \textit{Semi-Contrastive Representation} attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Furthermore, to mitigate the vulnerability of existing GCRL algorithms, we introduce \textit{Adversarial Representation Tactics}. This strategy combines \textit{Semi-Contrastive Adversarial Augmentation} with \textit{Sensitivity-Aware Regularizer}. It improves the adversarial robustness of the underlying agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence mechanism across multiple state-of-the-art GCRL algorithms. Our tool {\bf ReRoGCRL} is available at \url{https://github.com/TrustAI/ReRoGCRL}.

Title: BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics. (arXiv:2312.07439v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07439
Code URL: null
Copy Paste: [[2312.07439]] BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics(http://arxiv.org/abs/2312.07439)
Summary:
The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models.

biometric

steal

extraction

Title: Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations. (arXiv:2312.06716v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06716
Code URL: null
Copy Paste: [[2312.06716]] Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations(http://arxiv.org/abs/2312.06716)
Summary:
We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a wholistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).

Title: Medical Image Classification Using Transfer Learning and Chaos Game Optimization on the Internet of Medical Things. (arXiv:2312.07437v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07437
Code URL: null
Copy Paste: [[2312.07437]] Medical Image Classification Using Transfer Learning and Chaos Game Optimization on the Internet of Medical Things(http://arxiv.org/abs/2312.07437)
Summary:
The Internet of Medical Things (IoMT) has dramatically benefited medical professionals that patients and physicians can access from all regions. Although the automatic detection and prediction of diseases such as melanoma and leukemia is still being researched and studied in IoMT, existing approaches are not able to achieve a high degree of efficiency. Thus, with a new approach that provides better results, patients would access the adequate treatments earlier and the death rate would be reduced. Therefore, this paper introduces an IoMT proposal for medical images classification that may be used anywhere, i.e. it is an ubiquitous approach. It was design in two stages: first, we employ a Transfer Learning (TL)-based method for feature extraction, which is carried out using MobileNetV3; second, we use the Chaos Game Optimization (CGO) for feature selection, with the aim of excluding unnecessary features and improving the performance, which is key in IoMT. Our methodology was evaluated using ISIC-2016, PH2, and Blood-Cell datasets. The experimental results indicated that the proposed approach obtained an accuracy of 88.39% on ISIC-2016, 97.52% on PH2, and 88.79% on Blood-cell. Moreover, our approach had successful performances for the metrics employed compared to other existing methods.

Title: BED: Bi-Encoder-Decoder Model for Canonical Relation Extraction. (arXiv:2312.07088v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07088
Code URL: null
Copy Paste: [[2312.07088]] BED: Bi-Encoder-Decoder Model for Canonical Relation Extraction(http://arxiv.org/abs/2312.07088)
Summary:
Canonical relation extraction aims to extract relational triples from sentences, where the triple elements (entity pairs and their relationship) are mapped to the knowledge base. Recently, methods based on the encoder-decoder architecture are proposed and achieve promising results. However, these methods cannot well utilize the entity information, which is merely used as augmented training data. Moreover, they are incapable of representing novel entities, since no embeddings have been learned for them. In this paper, we propose a novel framework, Bi-Encoder-Decoder (BED), to solve the above issues. Specifically, to fully utilize entity information, we employ an encoder to encode semantics of this information, leading to high-quality entity representations. For novel entities, given a trained entity encoder, their representations can be easily generated. Experimental results on two datasets show that, our method achieves a significant performance improvement over the previous state-of-the-art and handle novel entities well without retraining.

membership infer

federate

Title: Efficient Cross-Domain Federated Learning by MixStyle Approximation. (arXiv:2312.07064v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07064
Code URL: null
Copy Paste: [[2312.07064]] Efficient Cross-Domain Federated Learning by MixStyle Approximation(http://arxiv.org/abs/2312.07064)
Summary:
With the advent of interconnected and sensor-equipped edge devices, Federated Learning (FL) has gained significant attention, enabling decentralized learning while maintaining data privacy. However, FL faces two challenges in real-world tasks: expensive data labeling and domain shift between source and target samples. In this paper, we introduce a privacy-preserving, resource-efficient FL concept for client adaptation in hardware-constrained environments. Our approach includes server model pre-training on source data and subsequent fine-tuning on target data via low-end clients. The local client adaptation process is streamlined by probabilistic mixing of instance-level feature statistics approximated from source and target domain data. The adapted parameters are transferred back to the central server and globally aggregated. Preliminary results indicate that our method reduces computational and transmission costs while maintaining competitive performance on downstream tasks.

Title: Language-Guided Transformer for Federated Multi-Label Classification. (arXiv:2312.07165v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07165
Code URL: https://github.com/jack24658735/fedlgt
Copy Paste: [[2312.07165]] Language-Guided Transformer for Federated Multi-Label Classification(http://arxiv.org/abs/2312.07165)
Summary:
Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.

Title: Ensemble Federated Learning: an approach for collaborative pneumonia diagnosis. (arXiv:2312.07428v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07428
Code URL: null
Copy Paste: [[2312.07428]] Ensemble Federated Learning: an approach for collaborative pneumonia diagnosis(http://arxiv.org/abs/2312.07428)
Summary:
Federated learning is a very convenient approach for scenarios where (i) the exchange of data implies privacy concerns and/or (ii) a quick reaction is needed. In smart healthcare systems, both aspects are usually required. In this paper, we work on the first scenario, where preserving privacy is key and, consequently, building a unique and massive medical image data set by fusing different data sets from different medical institutions or research centers (computation nodes) is not an option. We propose an ensemble federated learning (EFL) approach that is based on the following characteristics: First, each computation node works with a different data set (but of the same type). They work locally and apply an ensemble approach combining eight well-known CNN models (densenet169, mobilenetv2, xception, inceptionv3, vgg16, resnet50, densenet121, and resnet152v2) on Chest X-ray images. Second, the best two local models are used to create a local ensemble model that is shared with a central node. Third, the ensemble models are aggregated to obtain a global model, which is shared with the computation nodes to continue with a new iteration. This procedure continues until there are no changes in the best local models. We have performed different experiments to compare our approach with centralized ones (with or without an ensemble approach)\color{black}. The results conclude that our proposal outperforms these ones in Chest X-ray images (achieving an accuracy of 96.63\%) and offers very competitive results compared to other proposals in the literature.

Title: Feature Norm Regularized Federated Learning: Transforming Skewed Distributions into Global Insights. (arXiv:2312.06951v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06951
Code URL: null
Copy Paste: [[2312.06951]] Feature Norm Regularized Federated Learning: Transforming Skewed Distributions into Global Insights(http://arxiv.org/abs/2312.06951)
Summary:
In the field of federated learning, addressing non-independent and identically distributed (non-i.i.d.) data remains a quintessential challenge for improving global model performance. This work introduces the Feature Norm Regularized Federated Learning (FNR-FL) algorithm, which uniquely incorporates class average feature norms to enhance model accuracy and convergence in non-i.i.d. scenarios. Our comprehensive analysis reveals that FNR-FL not only accelerates convergence but also significantly surpasses other contemporary federated learning algorithms in test accuracy, particularly under feature distribution skew scenarios. The novel modular design of FNR-FL facilitates seamless integration with existing federated learning frameworks, reinforcing its adaptability and potential for widespread application. We substantiate our claims through rigorous empirical evaluations, demonstrating FNR-FL's exceptional performance across various skewed data distributions. Relative to FedAvg, FNR-FL exhibits a substantial 66.24\% improvement in accuracy and a significant 11.40\% reduction in training time, underscoring its enhanced effectiveness and efficiency.

fair

Title: FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs. (arXiv:2312.07420v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07420
Code URL: null
Copy Paste: [[2312.07420]] FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs(http://arxiv.org/abs/2312.07420)
Summary:
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where models are modified to "unlearn" undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called 'FairSISA'.

interpretability

Title: CLIP in Medical Imaging: A Comprehensive Survey. (arXiv:2312.07353v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07353
Code URL: https://github.com/zhaozh10/awesome-clip-in-medical-imaging
Copy Paste: [[2312.07353]] CLIP in Medical Imaging: A Comprehensive Survey(http://arxiv.org/abs/2312.07353)
Summary:
Contrastive Language-Image Pre-training (CLIP), a straightforward yet effective pre-training paradigm, successfully introduces semantic-rich text supervision to vision models and has demonstrated promising results in various tasks due to its generalizability and interpretability. It has recently gained increasing interest in the medical imaging domain, either as a powerful pre-training paradigm for medical vision language alignment or a pre-trained key component for various clinical tasks. With the aim of facilitating a deeper understanding of this promising direction, this survey offers an in-depth exploration of the CLIP paradigm within the domain of medical imaging, regarding both refined CLIP pre-training and CLIP-driven applications. Our survey (1) starts with a brief introduction to the fundamentals of CLIP methodology. (2) Then, we investigate the adaptation of CLIP pre-training in the medical domain, focusing on how to optimize CLIP given characteristics of medical images and reports. (3) Furthermore, we explore the practical utilization of CLIP pre-trained models in various tasks, including classification, dense prediction, and cross-modal tasks. (4) Finally, we discuss existing limitations of CLIP in the context of medical imaging and propose forward-looking directions to address the demands of medical imaging domain. We expect that this comprehensive survey will provide researchers in the field of medical image analysis with a holistic understanding of the CLIP paradigm and its potential implications. The project page is available at https://github.com/zhaozh10/Awesome-CLIP-in-Medical-Imaging, which will be regularly updated.

explainability

Title: Identifying Drivers of Predictive Uncertainty using Variance Feature Attribution. (arXiv:2312.07252v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07252
Code URL: null
Copy Paste: [[2312.07252]] Identifying Drivers of Predictive Uncertainty using Variance Feature Attribution(http://arxiv.org/abs/2312.07252)
Summary:
Explainability and uncertainty quantification are two pillars of trustable artificial intelligence. However, the reasoning behind uncertainty estimates is generally left unexplained. Identifying the drivers of uncertainty complements explanations of point predictions in recognizing potential model limitations. It facilitates the detection of oversimplification in the uncertainty estimation process. Explanations of uncertainty enhance communication and trust in decisions. They allow for verifying whether the main drivers of model uncertainty are relevant and may impact model usage. So far, the subject of explaining uncertainties has been rarely studied. The few exceptions in existing literature are tailored to Bayesian neural networks or rely heavily on technically intricate approaches, hindering their broad adoption. We propose variance feature attribution, a simple and scalable solution to explain predictive aleatoric uncertainties. First, we estimate uncertainty as predictive variance by equipping a neural network with a Gaussian output distribution by adding a variance output neuron. Thereby, we can rely on pre-trained point prediction models and fine-tune them for meaningful variance estimation. Second, we apply out-of-the-box explainers on the variance output of these models to explain the uncertainty estimation. We evaluate our approach in a synthetic setting where the data-generating process is known. We show that our method can explain uncertainty influences more reliably and faster than the established baseline CLUE. We fine-tune a state-of-the-art age regression model to estimate uncertainty and obtain attributions. Our explanations highlight potential sources of uncertainty, such as laugh lines. Variance feature attribution provides accurate explanations for uncertainty estimates with little modifications to the model architecture and low computational overhead.

watermark

diffusion

Title: Perceptual Similarity guidance and text guidance optimization for Editing Real Images using Guided Diffusion Models. (arXiv:2312.06680v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06680
Code URL: null
Copy Paste: [[2312.06680]] Perceptual Similarity guidance and text guidance optimization for Editing Real Images using Guided Diffusion Models(http://arxiv.org/abs/2312.06680)
Summary:
When using a diffusion model for image editing, there are times when the modified image can differ greatly from the source. To address this, we apply a dual-guidance approach to maintain high fidelity to the original in areas that are not altered. First, we employ text-guided optimization, using text embeddings to direct latent space and classifier-free guidance. Second, we use perceptual similarity guidance, optimizing latent vectors with posterior sampling via Tweedie formula during the reverse process. This method ensures the realistic rendering of both the edited elements and the preservation of the unedited parts of the original image.

Title: Neutral Editing Framework for Diffusion-based Video Editing. (arXiv:2312.06708v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06708
Code URL: null
Copy Paste: [[2312.06708]] Neutral Editing Framework for Diffusion-based Video Editing(http://arxiv.org/abs/2312.06708)
Summary:
Text-conditioned image editing has succeeded in various types of editing based on a diffusion framework. Unfortunately, this success did not carry over to a video, which continues to be challenging. Existing video editing systems are still limited to rigid-type editing such as style transfer and object overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework to enable complex non-rigid editing by changing the motion of a person/object in a video, which has never been attempted before. NeuEdit introduces a concept of `neutralization' that enhances a tuning-editing process of diffusion-based editing systems in a model-agnostic manner by leveraging input video and text without any other auxiliary aids (e.g., visual masks, video captions). Extensive experiments on numerous videos demonstrate adaptability and effectiveness of the NeuEdit framework. The website of our work is available here: https://neuedit.github.io

Title: Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models. (arXiv:2312.06712v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06712
Code URL: null
Copy Paste: [[2312.06712]] Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models(http://arxiv.org/abs/2312.06712)
Summary:
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability. The project webpage is available at https://zpbao.github.io/projects/SepEn/.

Title: EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion. (arXiv:2312.06725v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06725
Code URL: null
Copy Paste: [[2312.06725]] EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion(http://arxiv.org/abs/2312.06725)
Summary:
Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews, but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue, we propose EpiDiff, a localized interactive multiview diffusion model. At the core of the proposed approach is to insert a lightweight epipolar attention block into the frozen diffusion model, leveraging epipolar constraints to enable cross-view interaction among feature maps of neighboring views. The newly initialized 3D modeling module preserves the original feature distribution of the diffusion model, exhibiting compatibility with a variety of base diffusion models. Experiments show that EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS. Additionally, EpiDiff can generate a more diverse distribution of views, improving the reconstruction quality from generated multiviews. Please see our project page at https://huanngzh.github.io/EpiDiff/.

Title: DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting. (arXiv:2312.06734v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06734
Code URL: null
Copy Paste: [[2312.06734]] DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting(http://arxiv.org/abs/2312.06734)
Summary:
Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations, which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems, it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However, their predictions suffer from the blurry, high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems, we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion, which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework, compared to state-of-the-art techniques. Our code will be made publicly available soon.

Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following. (arXiv:2312.06738v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06738
Code URL: https://github.com/jacklishufan/instructany2pix
Copy Paste: [[2312.06738]] InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following(http://arxiv.org/abs/2312.06738)
Summary:
The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at https://github.com/jacklishufan/InstructAny2Pix.git

Title: Relightful Harmonization: Lighting-aware Portrait Background Replacement. (arXiv:2312.06886v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06886
Code URL: null
Copy Paste: [[2312.06886]] Relightful Harmonization: Lighting-aware Portrait Background Replacement(http://arxiv.org/abs/2312.06886)
Summary:
Portrait harmonization aims to composite a subject into a new background, adjusting its lighting and color to ensure harmony with the background scene. Existing harmonization techniques often only focus on adjusting the global color and brightness of the foreground and ignore crucial illumination cues from the background such as apparent lighting direction, leading to unrealistic compositions. We introduce Relightful Harmonization, a lighting-aware diffusion model designed to seamlessly harmonize sophisticated lighting effect for the foreground portrait using any background image. Our approach unfolds in three stages. First, we introduce a lighting representation module that allows our diffusion model to encode lighting information from target image background. Second, we introduce an alignment network that aligns lighting features learned from image background with lighting features learned from panorama environment maps, which is a complete representation for scene illumination. Last, to further boost the photorealism of the proposed method, we introduce a novel data simulation pipeline that generates synthetic training pairs from a diverse range of natural images, which are used to refine the model. Our method outperforms existing benchmarks in visual fidelity and lighting coherence, showing superior generalization in real-world testing scenarios, highlighting its versatility and practicality.

Title: LoRA-Enhanced Distillation on Guided Diffusion Models. (arXiv:2312.06899v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06899
Code URL: null
Copy Paste: [[2312.06899]] LoRA-Enhanced Distillation on Guided Diffusion Models(http://arxiv.org/abs/2312.06899)
Summary:
Diffusion models, such as Stable Diffusion (SD), offer the ability to generate high-resolution images with diverse features, but they come at a significant computational and memory cost. In classifier-free guided diffusion models, prolonged inference times are attributed to the necessity of computing two separate diffusion models at each denoising step. Recent work has shown promise in improving inference time through distillation techniques, teaching the model to perform similar denoising steps with reduced computations. However, the application of distillation introduces additional memory overhead to these already resource-intensive diffusion models, making it less practical.

To address these challenges, our research explores a novel approach that combines Low-Rank Adaptation (LoRA) with model distillation to efficiently compress diffusion models. This approach not only reduces inference time but also mitigates memory overhead, and notably decreases memory consumption even before applying distillation. The results are remarkable, featuring a significant reduction in inference time due to the distillation process and a substantial 50% reduction in memory consumption. Our examination of the generated images underscores that the incorporation of LoRA-enhanced distillation maintains image quality and alignment with the provided prompts. In summary, while conventional distillation tends to increase memory consumption, LoRA-enhanced distillation offers optimization without any trade-offs or compromises in quality.

Title: CCM: Adding Conditional Controls to Text-to-Image Consistency Models. (arXiv:2312.06971v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06971
Code URL: null
Copy Paste: [[2312.06971]] CCM: Adding Conditional Controls to Text-to-Image Consistency Models(http://arxiv.org/abs/2312.06971)
Summary:
Consistency Models (CMs) have showed a promise in creating visual content efficiently and with high quality. However, the way to add new conditional controls to the pretrained CMs has not been explored. In this technical report, we consider alternative strategies for adding ControlNet-like conditional control to CMs and present three significant findings. 1) ControlNet trained for diffusion models (DMs) can be directly applied to CMs for high-level semantic controls but struggles with low-level detail and realism control. 2) CMs serve as an independent class of generative models, based on which ControlNet can be trained from scratch using Consistency Training proposed by Song et al. 3) A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training, allowing for the swift transfer of DMs-based ControlNet to CMs. We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image with text-to-image latent consistency models.

Title: Diff-OP3D: Bridging 2D Diffusion for Open Pose 3D Zero-Shot Classification. (arXiv:2312.07039v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07039
Code URL: null
Copy Paste: [[2312.07039]] Diff-OP3D: Bridging 2D Diffusion for Open Pose 3D Zero-Shot Classification(http://arxiv.org/abs/2312.07039)
Summary:
With the explosive 3D data growth, the urgency of utilizing zero-shot learning to facilitate data labeling becomes evident. Recently, the methods via transferring Contrastive Language-Image Pre-training (CLIP) to 3D vision have made great progress in the 3D zero-shot classification task. However, these methods primarily focus on aligned pose 3D objects (ap-3os), overlooking the recognition of 3D objects with open poses (op-3os) typically encountered in real-world scenarios, such as an overturned chair or a lying teddy bear. To this end, we propose a more challenging benchmark for 3D open-pose zero-shot classification. Echoing our benchmark, we design a concise angle-refinement mechanism that automatically optimizes one ideal pose as well as classifies these op-3os. Furthermore, we make a first attempt to bridge 2D pre-trained diffusion model as a classifer to 3D zero-shot classification without any additional training. Such 2D diffusion to 3D objects proves vital in improving zero-shot classification for both ap-3os and op-3os. Our model notably improves by 3.5% and 15.8% on ModelNet10$^{\ddag}$ and McGill$^{\ddag}$ open pose benchmarks, respectively, and surpasses the current state-of-the-art by 6.8% on the aligned pose ModelNet10, affirming diffusion's efficacy in 3D zero-shot tasks.

Title: Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation. (arXiv:2312.07063v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07063
Code URL: null
Copy Paste: [[2312.07063]] Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation(http://arxiv.org/abs/2312.07063)
Summary:
Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data will be publicly released at: https://virtualhumans.mpi-inf.mpg.de/procigen-hdm.

Title: DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models. (arXiv:2312.07066v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07066
Code URL: null
Copy Paste: [[2312.07066]] DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models(http://arxiv.org/abs/2312.07066)
Summary:
Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.

Title: Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion. (arXiv:2312.07133v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07133
Code URL: null
Copy Paste: [[2312.07133]] Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion(http://arxiv.org/abs/2312.07133)
Summary:
We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap, and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.

Title: Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation. (arXiv:2312.07231v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07231
Code URL: null
Copy Paste: [[2312.07231]] Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation(http://arxiv.org/abs/2312.07231)
Summary:
Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.

Title: Scalable Motion Style Transfer with Constrained Diffusion Generation. (arXiv:2312.07311v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07311
Code URL: null
Copy Paste: [[2312.07311]] Scalable Motion Style Transfer with Constrained Diffusion Generation(http://arxiv.org/abs/2312.07311)
Summary:
Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusion-based style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs.

Title: GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos. (arXiv:2312.07322v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07322
Code URL: https://github.com/soCzech/GenHowto
Copy Paste: [[2312.07322]] GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos(http://arxiv.org/abs/2312.07322)
Summary:
We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.

Title: Learned representation-guided diffusion models for large-image generation. (arXiv:2312.07330v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07330
Code URL: null
Copy Paste: [[2312.07330]] Learned representation-guided diffusion models for large-image generation(http://arxiv.org/abs/2312.07330)
Summary:
To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

Title: Boosting Latent Diffusion with Flow Matching. (arXiv:2312.07360v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07360
Code URL: https://github.com/compvis/fm-boosting
Copy Paste: [[2312.07360]] Boosting Latent Diffusion with Flow Matching(http://arxiv.org/abs/2312.07360)
Summary:
Recently, there has been tremendous progress in visual synthesis and the underlying generative models. Here, diffusion models (DMs) stand out particularly, but lately, flow matching (FM) has also garnered considerable interest. While DMs excel in providing diverse images, they suffer from long training and slow generation. With latent diffusion, these issues are only partially alleviated. Conversely, FM offers faster training and inference but exhibits less diversity in synthesis. We demonstrate that introducing FM between the Diffusion model and the convolutional decoder offers high-resolution image synthesis with reduced computational cost and model size. Diffusion can then efficiently provide the necessary generation diversity. FM compensates for the lower resolution, mapping the small latent space to a high-dimensional one. Subsequently, the convolutional decoder of the LDM maps these latents to high-resolution images. By combining the diversity of DMs, the efficiency of FMs, and the effectiveness of convolutional decoders, we achieve state-of-the-art high-resolution image synthesis at $1024^2$ with minimal computational cost. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying DMs, making it easily integrable into various DM frameworks.

Title: DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing. (arXiv:2312.07409v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07409
Code URL: null
Copy Paste: [[2312.07409]] DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing(http://arxiv.org/abs/2312.07409)
Summary:
Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However, a notable limitation of diffusion models, in comparison to GANs, is their difficulty in smoothly interpolating between two image samples, due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work, we present DiffMorpher, the first approach enabling smooth and natural image interpolation using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition, where correspondence automatically emerges without the need for annotation. In addition, we propose an attention interpolation and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories, bridging a critical functional gap that distinguished diffusion models from GANs.

Title: MinD-3D: Reconstruct High-quality 3D objects in Human Brain. (arXiv:2312.07485v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07485
Code URL: null
Copy Paste: [[2312.07485]] MinD-3D: Reconstruct High-quality 3D objects in Human Brain(http://arxiv.org/abs/2312.07485)
Summary:
In this paper, we introduce Recon3DMind, a groundbreaking task focused on reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals. This represents a major step forward in cognitive neuroscience and computer vision. To support this task, we present the fMRI-Shape dataset, utilizing 360-degree view videos of 3D objects for comprehensive fMRI signal capture. Containing 55 categories of common objects from daily life, this dataset will bolster future research endeavors. We also propose MinD-3D, a novel and effective three-stage framework that decodes and reconstructs the brain's 3D visual information from fMRI signals. This method starts by extracting and aggregating features from fMRI frames using a neuro-fusion encoder, then employs a feature bridge diffusion model to generate corresponding visual features, and ultimately recovers the 3D object through a generative transformer decoder. Our experiments demonstrate that this method effectively extracts features that are valid and highly correlated with visual regions of interest (ROIs) in fMRI signals. Notably, it not only reconstructs 3D objects with high semantic relevance and spatial similarity but also significantly deepens our understanding of the human brain's 3D visual processing capabilities. Project page at: https://jianxgao.github.io/MinD-3D.

Title: Class-Prototype Conditional Diffusion Model for Continual Learning with Generative Replay. (arXiv:2312.06710v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06710
Code URL: https://github.com/dnkhanh45/cpdm
Copy Paste: [[2312.06710]] Class-Prototype Conditional Diffusion Model for Continual Learning with Generative Replay(http://arxiv.org/abs/2312.06710)
Summary:
Mitigating catastrophic forgetting is a key hurdle in continual learning. Deep Generative Replay (GR) provides techniques focused on generating samples from prior tasks to enhance the model's memory capabilities. With the progression in generative AI, generative models have advanced from Generative Adversarial Networks (GANs) to the more recent Diffusion Models (DMs). A major issue is the deterioration in the quality of generated data compared to the original, as the generator continuously self-learns from its outputs. This degradation can lead to the potential risk of catastrophic forgetting occurring in the classifier. To address this, we propose the Class-Prototype Conditional Diffusion Model (CPDM), a GR-based approach for continual learning that enhances image quality in generators and thus reduces catastrophic forgetting in classifiers. The cornerstone of CPDM is a learnable class-prototype that captures the core characteristics of images in a given class. This prototype, integrated into the diffusion model's denoising process, ensures the generation of high-quality images. It maintains its effectiveness for old tasks even when new tasks are introduced, preserving image generation quality and reducing the risk of catastrophic forgetting in classifiers. Our empirical studies on diverse datasets demonstrate that our proposed method significantly outperforms existing state-of-the-art models, highlighting its exceptional ability to preserve image quality and enhance the model's memory retention.

Title: Generating High-Resolution Regional Precipitation Using Conditional Diffusion Model. (arXiv:2312.07112v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07112
Code URL: null
Copy Paste: [[2312.07112]] Generating High-Resolution Regional Precipitation Using Conditional Diffusion Model(http://arxiv.org/abs/2312.07112)
Summary:
Climate downscaling is a crucial technique within climate research, serving to project low-resolution (LR) climate data to higher resolutions (HR). Previous research has demonstrated the effectiveness of deep learning for downscaling tasks. However, most deep learning models for climate downscaling may not perform optimally for high scaling factors (i.e., 4x, 8x) due to their limited ability to capture the intricate details required for generating HR climate data. Furthermore, climate data behaves differently from image data, necessitating a nuanced approach when employing deep generative models. In response to these challenges, this paper presents a deep generative model for downscaling climate data, specifically precipitation on a regional scale. We employ a denoising diffusion probabilistic model (DDPM) conditioned on multiple LR climate variables. The proposed model is evaluated using precipitation data from the Community Earth System Model (CESM) v1.2.2 simulation. Our results demonstrate significant improvements over existing baselines, underscoring the effectiveness of the conditional diffusion model in downscaling climate data.

Title: Equivariant Flow Matching with Hybrid Probability Transport. (arXiv:2312.07168v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07168
Code URL: null
Copy Paste: [[2312.07168]] Equivariant Flow Matching with Hybrid Probability Transport(http://arxiv.org/abs/2312.07168)
Summary:
The generation of 3D molecules requires simultaneously deciding the categorical features~(atom types) and continuous features~(atom coordinates). Deep generative models, especially Diffusion Models (DMs), have demonstrated effectiveness in generating feature-rich geometries. However, existing DMs typically suffer from unstable probability dynamics with inefficient sampling speed. In this paper, we introduce geometric flow matching, which enjoys the advantages of both equivariant modeling and stabilized probability dynamics. More specifically, we propose a hybrid probability path where the coordinates probability path is regularized by an equivariant optimal transport, and the information between different modalities is aligned. Experimentally, the proposed method could consistently achieve better performance on multiple molecule generation benchmarks with 4.75$\times$ speed up of sampling on average.

Title: Momentum Particle Maximum Likelihood. (arXiv:2312.07335v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07335
Code URL: null
Copy Paste: [[2312.07335]] Momentum Particle Maximum Likelihood(http://arxiv.org/abs/2312.07335)
Summary:
Maximum likelihood estimation (MLE) of latent variable models is often recast as an optimization problem over the extended space of parameters and probability distributions. For example, the Expectation Maximization (EM) algorithm can be interpreted as coordinate descent applied to a suitable free energy functional over this space. Recently, this perspective has been combined with insights from optimal transport and Wasserstein gradient flows to develop particle-based algorithms applicable to wider classes of models than standard EM.

Drawing inspiration from prior works which interpret `momentum-enriched' optimisation algorithms as discretizations of ordinary differential equations, we propose an analogous dynamical systems-inspired approach to minimizing the free energy functional over the extended space of parameters and probability distributions. The result is a dynamic system that blends elements of Nesterov's Accelerated Gradient method, the underdamped Langevin diffusion, and particle methods.

Under suitable assumptions, we establish quantitative convergence of the proposed system to the unique minimiser of the functional in continuous time. We then propose a numerical discretization of this system which enables its application to parameter estimation in latent variable models. Through numerical experiments, we demonstrate that the resulting algorithm converges faster than existing methods and compares favourably with other (approximate) MLE algorithms.

noise learning

data-free

transformer

Title: TULIP: Transformer for Upsampling of LiDAR Point Cloud. (arXiv:2312.06733v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06733
Code URL: null
Copy Paste: [[2312.06733]] TULIP: Transformer for Upsampling of LiDAR Point Cloud(http://arxiv.org/abs/2312.06733)
Summary:
LiDAR Upsampling is a challenging task for the perception systems of robots and autonomous vehicles, due to the sparse and irregular structure of large-scale scene contexts. Recent works propose to solve this problem by converting LiDAR data from 3D Euclidean space into an image super-resolution problem in 2D image space. Although their methods can generate high-resolution range images with fine-grained details, the resulting 3D point clouds often blur out details and predict invalid points. In this paper, we propose TULIP, a new method to reconstruct high-resolution LiDAR point clouds from low-resolution LiDAR input. We also follow a range image-based approach but specifically modify the patch and window geometries of a Swin-Transformer-based network to better fit the characteristics of range images. We conducted several experiments on three different public real-world and simulated datasets. TULIP outperforms state-of-the-art methods in all relevant metrics and generates robust and more realistic point clouds than prior works.

Title: Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition. (arXiv:2312.06940v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06940
Code URL: null
Copy Paste: [[2312.06940]] Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition(http://arxiv.org/abs/2312.06940)
Summary:
Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system it is important to understand the performance of different candidate deep learning models and determine the best model accordingly This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets Specifically we train and test five SAR image classifiers based on Residual Neural Networks ResNet18 ResNet34 ResNet50 Graph Neural Network GNN and Vision Transformer for Small-Sized Datasets (SS-ViT) We select three datasets MSTAR GBSAR and SynthWakeSAR that offer heterogeneity We evaluate and compare the five classifiers concerning their classification accuracy runtime performance in terms of inference throughput and analytical performance in terms of number of parameters number of layers model size and number of operations Experimental results show that the GNN classifier outperforms with respect to throughput and latency However it is also shown that no clear model winner emerges from all of our chosen metrics and a one model rules all case is doubtful in the domain of SAR ATR

Title: READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling. (arXiv:2312.06950v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06950
Code URL: null
Copy Paste: [[2312.06950]] READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling(http://arxiv.org/abs/2312.06950)
Summary:
Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ-PVLA framework through extensive experiments where READ-PVLA significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.

Title: IA2U: A Transfer Plugin with Multi-Prior for In-Air Model to Underwater. (arXiv:2312.06955v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06955
Code URL: null
Copy Paste: [[2312.06955]] IA2U: A Transfer Plugin with Multi-Prior for In-Air Model to Underwater(http://arxiv.org/abs/2312.06955)
Summary:
In underwater environments, variations in suspended particle concentration and turbidity cause severe image degradation, posing significant challenges to image enhancement (IE) and object detection (OD) tasks. Currently, in-air image enhancement and detection methods have made notable progress, but their application in underwater conditions is limited due to the complexity and variability of these environments. Fine-tuning in-air models saves high overhead and has more optional reference work than building an underwater model from scratch. To address these issues, we design a transfer plugin with multiple priors for converting in-air models to underwater applications, named IA2U. IA2U enables efficient application in underwater scenarios, thereby improving performance in Underwater IE and OD. IA2U integrates three types of underwater priors: the water type prior that characterizes the degree of image degradation, such as color and visibility; the degradation prior, focusing on differences in details and textures; and the sample prior, considering the environmental conditions at the time of capture and the characteristics of the photographed object. Utilizing a Transformer-like structure, IA2U employs these priors as query conditions and a joint task loss function to achieve hierarchical enhancement of task-level underwater image features, therefore considering the requirements of two different tasks, IE and OD. Experimental results show that IA2U combined with an in-air model can achieve superior performance in underwater image enhancement and object detection tasks. The code will be made publicly available.

Title: Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning. (arXiv:2312.06995v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06995
Code URL: null
Copy Paste: [[2312.06995]] Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning(http://arxiv.org/abs/2312.06995)
Summary:
Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on seven standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA

Title: X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer. (arXiv:2312.07378v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07378
Code URL: null
Copy Paste: [[2312.07378]] X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer(http://arxiv.org/abs/2312.07378)
Summary:
The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge\footnote{\url{this http URL}.}, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D

Title: GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance. (arXiv:2312.07385v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07385
Code URL: null
Copy Paste: [[2312.07385]] GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance(http://arxiv.org/abs/2312.07385)
Summary:
Although existing speech-driven talking face generation methods achieve significant progress, they are far from real-world application due to the avatar-specific training demand and unstable lip movements. To address the above issues, we propose the GSmoothFace, a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model, which can synthesize smooth lip dynamics while preserving the speaker's identity. Our proposed GSmoothFace model mainly consists of the Audio to Expression Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT) module. Specifically, we first develop the A2EP module to predict expression parameters synchronized with the driven speech. It uses a transformer to capture the long-term audio context and learns the parameters from the fine-grained 3D facial vertices, resulting in accurate and smooth lip-synchronization performance. Afterward, the well-designed TAFT module, empowered by Morphology Augmented Face Blending (MAFB), takes the predicted expression parameters and target video as inputs to modify the facial region of the target video without distorting the background content. The TAFT effectively exploits the identity appearance and background context in the target video, which makes it possible to generalize to different speakers without retraining. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality. See the project page for code, data, and request pre-trained models: https://zhanghm1995.github.io/GSmoothFace.

Title: Dozerformer: Sequence Adaptive Sparse Transformer for Multivariate Time Series Forecasting. (arXiv:2312.06874v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06874
Code URL: null
Copy Paste: [[2312.06874]] Dozerformer: Sequence Adaptive Sparse Transformer for Multivariate Time Series Forecasting(http://arxiv.org/abs/2312.06874)
Summary:
Transformers have achieved remarkable performance in multivariate time series(MTS) forecasting due to their capability to capture long-term dependencies. However, the canonical attention mechanism has two key limitations: (1) its quadratic time complexity limits the sequence length, and (2) it generates future values from the entire historical sequence. To address this, we propose a Dozer Attention mechanism consisting of three sparse components: (1) Local, each query exclusively attends to keys within a localized window of neighboring time steps. (2) Stride, enables each query to attend to keys at predefined intervals. (3) Vary, allows queries to selectively attend to keys from a subset of the historical sequence. Notably, the size of this subset dynamically expands as forecasting horizons extend. Those three components are designed to capture essential attributes of MTS data, including locality, seasonality, and global temporal dependencies. Additionally, we present the Dozerformer Framework, incorporating the Dozer Attention mechanism for the MTS forecasting task. We evaluated the proposed Dozerformer framework with recent state-of-the-art methods on nine benchmark datasets and confirmed its superior performance. The code will be released after the manuscript is accepted.

Title: DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers. (arXiv:2312.06881v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06881
Code URL: https://github.com/asappresearch/dyad
Copy Paste: [[2312.06881]] DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers(http://arxiv.org/abs/2312.06881)
Summary:
We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers, (nn.Linear() in Pytorch). These layers appear in common subcomponents, such as in the ff module of Transformers. DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to a block-sparse counterpart. These can be represented as 3D tensors, which in unison allow a faster execution of matrix multiplication with the mini-batched input matrix X compared to DENSE (O(rows(W ) x cols(W )) --> O( rows(W ) x cols(W ) # of blocks )). As the crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizes of the OPT arch and 1 size of the Pythia arch, including at different token scales of the babyLM benchmark. We find DYAD to be competitive (>= 90%) of DENSE performance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning (GLUE) benchmarks, while being >=7-15% faster to train on-GPU even at 125m scale, besides surfacing larger speedups at increasing scale and model width.

Title: Neural Machine Translation of Clinical Text: An Empirical Investigation into Multilingual Pre-Trained Language Models and Transfer-Learning. (arXiv:2312.07250v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07250
Code URL: null
Copy Paste: [[2312.07250]] Neural Machine Translation of Clinical Text: An Empirical Investigation into Multilingual Pre-Trained Language Models and Transfer-Learning(http://arxiv.org/abs/2312.07250)
Summary:
We conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) won over the other two extra-large language models by a large margin, in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation.

Title: The GUA-Speech System Description for CNVSRC Challenge 2023. (arXiv:2312.07254v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07254
Code URL: null
Copy Paste: [[2312.07254]] The GUA-Speech System Description for CNVSRC Challenge 2023(http://arxiv.org/abs/2312.07254)
Summary:
This study describes our system for Task 1 Single-speaker Visual Speech Recognition (VSR) fixed track in the Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023. Specifically, we use intermediate connectionist temporal classification (Inter CTC) residual modules to relax the conditional independence assumption of CTC in our model. Then we use a bi-transformer decoder to enable the model to capture both past and future contextual information. In addition, we use Chinese characters as the modeling units to improve the recognition accuracy of our model. Finally, we use a recurrent neural network language model (RNNLM) for shallow fusion in the inference stage. Experiments show that our system achieves a character error rate (CER) of 38.09% on the Eval set which reaches a relative CER reduction of 21.63% over the official baseline, and obtains a second place in the challenge.

Title: Towards Equipping Transformer with the Ability of Systematic Compositionality. (arXiv:2312.07280v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07280
Code URL: null
Copy Paste: [[2312.07280]] Towards Equipping Transformer with the Ability of Systematic Compositionality(http://arxiv.org/abs/2312.07280)
Summary:
One of the key factors in language productivity and human cognition is the ability of systematic compositionality, which refers to understanding composed unseen examples of seen primitives. However, recent evidence reveals that the Transformers have difficulty generalizing the composed context based on the seen primitives. To this end, we take the first step to propose a compositionality-aware Transformer called CAT and two novel pre-training tasks to facilitate systematic compositionality. We tentatively provide a successful implementation of a multi-layer CAT on the basis of the especially popular BERT. The experimental results demonstrate that CAT outperforms baselines on compositionality-aware tasks with minimal impact on the effectiveness on standardized language understanding tasks.

Title: Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification. (arXiv:2312.07338v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07338
Code URL: null
Copy Paste: [[2312.07338]] Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification(http://arxiv.org/abs/2312.07338)
Summary:
Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.

Title: Can a Transformer Represent a Kalman Filter?. (arXiv:2312.06937v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06937
Code URL: null
Copy Paste: [[2312.06937]] Can a Transformer Represent a Kalman Filter?(http://arxiv.org/abs/2312.06937)
Summary:
Transformers are a class of autoregressive deep learning architectures which have recently achieved state-of-the-art performance in various vision, language, and robotics tasks. We revisit the problem of Kalman Filtering in linear dynamical systems and show that Transformers can approximate the Kalman Filter in a strong sense. Specifically, for any observable LTI system we construct an explicit causally-masked Transformer which implements the Kalman Filter, up to a small additive error which is bounded uniformly in time; we call our construction the Transformer Filter. Our construction is based on a two-step reduction. We first show that a softmax self-attention block can exactly represent a certain Gaussian kernel smoothing estimator. We then show that this estimator closely approximates the Kalman Filter. We also investigate how the Transformer Filter can be used for measurement-feedback control and prove that the resulting nonlinear controllers closely approximate the performance of standard optimal control policies such as the LQG controller.

Title: Multi-Granularity Framework for Unsupervised Representation Learning of Time Series. (arXiv:2312.07248v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07248
Code URL: null
Copy Paste: [[2312.07248]] Multi-Granularity Framework for Unsupervised Representation Learning of Time Series(http://arxiv.org/abs/2312.07248)
Summary:
Representation learning plays a critical role in the analysis of time series data and has high practical value across a wide range of applications. including trend analysis, time series data retrieval and forecasting. In practice, data confusion is a significant issue as it can considerably impact the effectiveness and accuracy of data analysis, machine learning models and decision-making processes. In general, previous studies did not consider the variability at various levels of granularity, thus resulting in inadequate information utilization, which further exacerbated the issue of data confusion. This paper proposes an unsupervised framework to realize multi-granularity representation learning for time series. Specifically, we employed a cross-granularity transformer to develop an association between fine- and coarse-grained representations. In addition, we introduced a retrieval task as an unsupervised training task to learn the multi-granularity representation of time series. Moreover, a novel loss function was designed to obtain the comprehensive multi-granularity representation of the time series via unsupervised learning. The experimental results revealed that the proposed framework demonstrates significant advantages over alternative representation learning models.

generative

Title: Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning. (arXiv:2312.06699v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06699
Code URL: null
Copy Paste: [[2312.06699]] Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning(http://arxiv.org/abs/2312.06699)
Summary:
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines.

Title: Image Content Generation with Causal Reasoning. (arXiv:2312.07132v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07132
Code URL: https://github.com/ieit-agi/mix-shannon
Copy Paste: [[2312.07132]] Image Content Generation with Causal Reasoning(http://arxiv.org/abs/2312.07132)
Summary:
The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic \textit{Tom and Jerry} animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage. The code and dataset are publicly available at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.

Title: SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models. (arXiv:2312.07492v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07492
Code URL: null
Copy Paste: [[2312.07492]] SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models(http://arxiv.org/abs/2312.07492)
Summary:
Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. We start with a comprehensive list of 93 stigmas documented in social science literature and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two widely used open source generative language models and we demonstrate that the output generated by these models considerably amplifies existing social bias against stigmatized groups. Specifically, we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We discover that the deliberate design of the templates in our benchmark (e.g., by adding biasing text to the prompt or varying the answer that indicates bias) impact the model tendencies to generate socially biased output. Additionally, we report on patterns in the generated chain-of-thought output, finding a variety of problems from subtle bias to evidence of a lack of reasoning.

Warning: This paper contains examples of text which is toxic, biased, and harmful.

large language model

Title: Audio-Visual LLM for Video Understanding. (arXiv:2312.06720v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06720
Code URL: null
Copy Paste: [[2312.06720]] Audio-Visual LLM for Video Understanding(http://arxiv.org/abs/2312.06720)
Summary:
This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. A key design is the modality-augmented training, which involves the integration of modality-specific tokens engineered to activate the appropriate visual and/or auditory encoder selectively. This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. Moreover, we introduce a high-quality video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks. For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA, outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%, respectively. Additionally, our Audio-Visual LLM also achieves competitive performance on audio tasks (e.g., AudioCaps).

Title: EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models. (arXiv:2312.06722v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06722
Code URL: https://github.com/chenyi99/egoplan
Copy Paste: [[2312.06722]] EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models(http://arxiv.org/abs/2312.06722)
Summary:
Multimodal Large Language Models (MLLMs), building upon the powerful Large Language Models (LLMs) with exceptional reasoning and generalization capability, have opened up new avenues for embodied task planning. MLLMs excel in their ability to integrate diverse environmental inputs, such as real-time task progress, visual observations, and open-form language instructions, which are crucial for executable task planning. In this work, we introduce a benchmark with human annotations, EgoPlan-Bench, to quantitatively investigate the potential of MLLMs as embodied task planners in real-world scenarios. Our benchmark is distinguished by realistic tasks derived from real-world videos, a diverse set of actions involving interactions with hundreds of different objects, and complex visual observations from varied environments. We evaluate various open-source MLLMs, revealing that these models have not yet evolved into embodied planning generalists (even GPT-4V). We further construct an instruction-tuning dataset EgoPlan-IT from videos of human-object interactions, to facilitate the learning of high-level task planning in intricate real-world situations. The experiment results demonstrate that the model tuned on EgoPlan-IT not only significantly improves performance on our benchmark, but also effectively acts as embodied planner in simulations.

Title: Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator. (arXiv:2312.06731v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06731
Code URL: null
Copy Paste: [[2312.06731]] Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator(http://arxiv.org/abs/2312.06731)
Summary:
Large Language Models (LLMs) excel in understanding human instructions, driving the development of Multimodal LLMs (MLLMs) with instruction tuning. However, acquiring high-quality multimodal instruction tuning data poses a significant challenge. Previous approaches relying on GPT-4 for data generation proved expensive and exhibited unsatisfactory performance for certain tasks. To solve this, we present Genixer, an innovative data generation pipeline producing high-quality multimodal instruction tuning data for various tasks. Genixer collects datasets for ten prevalent multimodal tasks and designs instruction templates to transform these datasets into instruction-tuning data. It then trains pretrained MLLMs to generate task-specific instruction data and proposes an effective data filtering strategy to ensure high quality. To evaluate Genixer, a base MLLM model, Kakapo, is built and achieves SoTA performance in image captioning and visual question answering (VQA) tasks across multiple datasets. Experimental results show that filtered data from Genixer continually improves Kakapo for image captioning and VQA tasks. For the SoTA Shikra MLLM model on the image-region-related tasks, e.g., region caption and detection, Genixer also successfully generates corresponding data and improves its performance. Genixer opens avenues for generating high-quality multimodal instruction data for diverse tasks, enabling innovative applications across domains. The code and models will be released soon.

Title: SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. (arXiv:2312.06739v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06739
Code URL: https://github.com/TencentARC/SmartEdit
Copy Paste: [[2312.06739]] SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models(http://arxiv.org/abs/2312.06739)
Summary:
Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

Title: Hallucination Augmented Contrastive Learning for Multimodal Large Language Model. (arXiv:2312.06968v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06968
Code URL: null
Copy Paste: [[2312.06968]] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model(http://arxiv.org/abs/2312.06968)
Summary:
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA.

Title: ThinkBot: Embodied Instruction Following with Thought Chain Reasoning. (arXiv:2312.07062v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07062
Code URL: null
Copy Paste: [[2312.07062]] ThinkBot: Embodied Instruction Following with Thought Chain Reasoning(http://arxiv.org/abs/2312.07062)
Summary:
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.

Title: Efficient Few-Shot Clinical Task Adaptation with Large Language Models. (arXiv:2312.07125v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07125
Code URL: null
Copy Paste: [[2312.07125]] Efficient Few-Shot Clinical Task Adaptation with Large Language Models(http://arxiv.org/abs/2312.07125)
Summary:
Few-shot learning has been studied to adapt models to tasks with very few samples. It holds profound significance, particularly in clinical tasks, due to the high annotation cost of medical images. Several works have explored few-shot learning on medical images, yet they still require a large number of medical images for pre-training models to gain domain-specific priors. Vision foundation models recently have achieved remarkable success in natural images. Hence, adapting rapidly advancing vision foundation models from natural images to few-shot clinical tasks holds great promise. MedFMC has recently organized a challenge to shed more light on this topic at NeurIPS 2023. In this work, we present our challenge solution. We observe that a simple variant of fine-tuning with partial freezing shows remarkable performance. Empirical evidence demonstrates that this approach could outperform various common fine-tuning methods under limited sample sizes. Additionally, we explore enhanced utilization of semantic supervision to boost performance. We propose a novel approach that contextualizes labels via large language models (LLMs). Our findings reveal that the context generated by LLMs significantly enhances the discrimination of semantic embeddings for similar categories, resulting in a notable performance improvement of 3%-5% in 1-shot settings compared to commonly employed one-hot labels and other semantic supervision methods. Our solution secures the 1st place in the MedFMC challenge.

Title: MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception. (arXiv:2312.07472v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07472
Code URL: null
Copy Paste: [[2312.07472]] MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception(http://arxiv.org/abs/2312.07472)
Summary:
It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However, existing approaches usually struggle with compound difficulties caused by the logic-aware decomposition and context-aware execution of these tasks. To this end, we introduce MP5, an open-ended multimodal embodied system built upon the challenging Minecraft simulator, which can decompose feasible sub-objectives, design sophisticated situation-aware plans, and perform embodied action control, with frequent communication with a goal-conditioned active perception scheme. Specifically, MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs), and the system is modulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91% success rate on tasks that heavily depend on the context. Moreover, MP5 exhibits a remarkable ability to address many open-ended tasks that are entirely novel.

Title: LMDrive: Closed-Loop End-to-End Driving with Large Language Models. (arXiv:2312.07488v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07488
Code URL: null
Copy Paste: [[2312.07488]] LMDrive: Closed-Loop End-to-End Driving with Large Language Models(http://arxiv.org/abs/2312.07488)
Summary:
Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes can be found at https://github.com/opendilab/LMDrive

Title: Steering Llama 2 via Contrastive Activation Addition. (arXiv:2312.06681v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.06681
Code URL: https://github.com/wusche1/caa_hallucination
Copy Paste: [[2312.06681]] Steering Llama 2 via Contrastive Activation Addition(http://arxiv.org/abs/2312.06681)
Summary:
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying activations during their forward passes. CAA computes ``steering vectors'' by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using both multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, outperforms traditional methods like finetuning and few-shot prompting, and minimally reduces capabilities. Moreover, by employing various activation space interpretation methods, we gain deeper insights into CAA's mechanisms. CAA both accurately steers model outputs and also sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Title: Get an A in Math: Progressive Rectification Prompting. (arXiv:2312.06867v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.06867
Code URL: https://github.com/wzy6642/PRP
Copy Paste: [[2312.06867]] Get an A in Math: Progressive Rectification Prompting(http://arxiv.org/abs/2312.06867)
Summary:
Chain-of-Thought (CoT) prompting methods have enabled large language models (LLMs) to generate reasoning paths and solve math word problems (MWPs). However, they are sensitive to mistakes in the paths, as any mistake can result in an incorrect answer. We propose a novel method named Progressive Rectification Prompting (PRP) to improve average accuracy on eight MWP datasets from 77.3 to 90.5. Given an initial answer from CoT, PRP iterates a verify-then-rectify process to progressively identify incorrect answers and rectify the reasoning paths. With the most likely correct answer, the LLM predicts a masked numerical value in the question; if the prediction does not match the masked value, the answer is likely incorrect. Then the LLM is prompted to re-generate the reasoning path hinted with a set of incorrect answers to prevent itself from repeating previous mistakes. PRP achieves the best performance compared against the CoT methods. Our implementation is made publicly available at https://wzy6642.github.io/prp.github.io/.

Title: SM70: A Large Language Model for Medical Devices. (arXiv:2312.06974v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.06974
Code URL: null
Copy Paste: [[2312.06974]] SM70: A Large Language Model for Medical Devices(http://arxiv.org/abs/2312.06974)
Summary:
We are introducing SM70, a 70 billion-parameter Large Language Model that is specifically designed for SpassMed's medical devices under the brand name 'JEE1' (pronounced as G1 and means 'Life'). This large language model provides more accurate and safe responses to medical-domain questions. To fine-tune SM70, we used around 800K data entries from the publicly available dataset MedAlpaca. The Llama2 70B open-sourced model served as the foundation for SM70, and we employed the QLoRA technique for fine-tuning. The evaluation is conducted across three benchmark datasets - MEDQA - USMLE, PUBMEDQA, and USMLE - each representing a unique aspect of medical knowledge and reasoning. The performance of SM70 is contrasted with other notable LLMs, including Llama2 70B, Clinical Camel 70 (CC70), GPT 3.5, GPT 4, and Med-Palm, to provide a comparative understanding of its capabilities within the medical domain. Our results indicate that SM70 outperforms several established models in these datasets, showcasing its proficiency in handling a range of medical queries, from fact-based questions derived from PubMed abstracts to complex clinical decision-making scenarios. The robust performance of SM70, particularly in the USMLE and PUBMEDQA datasets, suggests its potential as an effective tool in clinical decision support and medical information retrieval. Despite its promising results, the paper also acknowledges the areas where SM70 lags behind the most advanced model, GPT 4, thereby highlighting the need for further development, especially in tasks demanding extensive medical knowledge and intricate reasoning.

Title: Alignment for Honesty. (arXiv:2312.07000v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07000
Code URL: https://github.com/gair-nlp/alignment-for-honesty
Copy Paste: [[2312.07000]] Alignment for Honesty(http://arxiv.org/abs/2312.07000)
Summary:
Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.

Title: Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models. (arXiv:2312.07046v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07046
Code URL: https://github.com/transmuteai/trailmet
Copy Paste: [[2312.07046]] Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models(http://arxiv.org/abs/2312.07046)
Summary:
Due to the substantial scale of Large Language Models (LLMs), the direct application of conventional compression methodologies proves impractical. The computational demands associated with even minimal gradient updates present challenges, particularly on consumer-grade hardware. This paper introduces an innovative approach for the parametric and practical compression of LLMs based on reduced order modelling, which entails low-rank decomposition within the feature space and re-parameterization in the weight space. Notably, this compression technique operates in a layer-wise manner, obviating the need for a GPU device and enabling the compression of billion-scale models within stringent constraints of both memory and time. Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.

Title: Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications. (arXiv:2312.07069v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07069
Code URL: null
Copy Paste: [[2312.07069]] Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications(http://arxiv.org/abs/2312.07069)
Summary:
In this paper, we explore the challenges inherent to Large Language Models (LLMs) like GPT-4, particularly their propensity for hallucinations, logic mistakes, and incorrect conclusions when tasked with answering complex questions. The capacity of LLMs to present erroneous answers in a coherent and semantically rigorous manner further complicates the detection of factual inaccuracies. This issue is especially pronounced in fields that require specialized expertise. Our work delves into these challenges, aiming to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of LLM accuracy and reliability in scientific and other specialized domains. Our findings reveal a non-linear relationship between the context's relevancy and the answers' measured quality. In addition, we demonstrate that with the correct calibration, it is possible to automate the grading procedure -- a finding suggesting that, at least to some degree, the LLMs can be used to self-examine the quality of their own performance. Finally, we describe an experimental platform that can be seen as a proof-of-concept of the techniques described in this work.

Title: Multilingual large language models leak human stereotypes across language boundaries. (arXiv:2312.07141v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07141
Code URL: https://github.com/annasou/stereotype_leakage
Copy Paste: [[2312.07141]] Multilingual large language models leak human stereotypes across language boundaries(http://arxiv.org/abs/2312.07141)
Summary:
Multilingual large language models have been increasingly popular for their proficiency in comprehending and generating text across various languages. Previous research has shown that the presence of stereotypes and biases in monolingual large language models can be attributed to the nature of their training data, which is collected from humans and reflects societal biases. Multilingual language models undergo the same training procedure as monolingual ones, albeit with training data sourced from various languages. This raises the question: do stereotypes present in one social context leak across languages within the model? In our work, we first define the term ``stereotype leakage'' and propose a framework for its measurement. With this framework, we investigate how stereotypical associations leak across four languages: English, Russian, Chinese, and Hindi. To quantify the stereotype leakage, we employ an approach from social psychology, measuring stereotypes via group-trait associations. We evaluate human stereotypes and stereotypical associations manifested in multilingual large language models such as mBERT, mT5, and ChatGPT. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. Notably, Hindi within multilingual models appears to be the most susceptible to influence from other languages, while Chinese is the least. Additionally, ChatGPT exhibits a better alignment with human scores than other models.

Title: Classifying complex documents: comparing bespoke solutions to large language models. (arXiv:2312.07182v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07182
Code URL: null
Copy Paste: [[2312.07182]] Classifying complex documents: comparing bespoke solutions to large language models(http://arxiv.org/abs/2312.07182)
Summary:
Here we search for the best automated classification approach for a set of complex legal documents. Our classification task is not trivial: our aim is to classify ca 30,000 public courthouse records from 12 states and 267 counties at two different levels using nine sub-categories. Specifically, we investigated whether a fine-tuned large language model (LLM) can achieve the accuracy of a bespoke custom-trained model, and what is the amount of fine-tuning necessary.

Title: SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion. (arXiv:2312.07305v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07305
Code URL: null
Copy Paste: [[2312.07305]] SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion(http://arxiv.org/abs/2312.07305)
Summary:
Sparse attention as a efficient method can significantly decrease the computation cost, but current sparse attention tend to rely on window self attention which block the global information flow. For this problem, we present Shifted Cross Chunk Attention (SCCA), using different KV shifting strategy to extend respective field in each attention layer. Except, we combine Dilated Attention(DA) and Dilated Neighborhood Attention(DNA) to present Shifted Dilated Attention(SDA). Both SCCA and SDA can accumulate attention results in multi head attention to obtain approximate respective field in full attention. In this paper, we conduct language modeling experiments using different pattern of SCCA and combination of SCCA and SDA. The proposed shifted cross chunk attention (SCCA) can effectively extend large language models (LLMs) to longer context combined with Positional interpolation(PI) and LoRA than current sparse attention. Notably, SCCA adopts LLaMA2 7B from 4k context to 8k in single V100. This attention pattern can provide a Plug-and-play fine-tuning method to extend model context while retaining their original architectures, and is compatible with most existing techniques.

Title: Large Language Models are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales. (arXiv:2312.07399v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.07399
Code URL: null
Copy Paste: [[2312.07399]] Large Language Models are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales(http://arxiv.org/abs/2312.07399)
Summary:
Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLP-driven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a ``reasoning-aware'' diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area.

Title: Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI. (arXiv:2312.06941v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.06941
Code URL: null
Copy Paste: [[2312.06941]] Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI(http://arxiv.org/abs/2312.06941)
Summary:
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts. Our findings call for careful consideration when integrating LLMs into practical forecasting processes.

Title: HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts. (arXiv:2312.07035v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.07035
Code URL: https://github.com/giangdip2410/hyperrouter
Copy Paste: [[2312.07035]] HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts(http://arxiv.org/abs/2312.07035)
Summary:
By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.

segmentation

Title: OpenSD: Unified Open-Vocabulary Segmentation and Detection. (arXiv:2312.06703v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06703
Code URL: null
Copy Paste: [[2312.06703]] OpenSD: Unified Open-Vocabulary Segmentation and Detection(http://arxiv.org/abs/2312.06703)
Summary:
Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD

Title: AM-RADIO: Agglomerative Model -- Reduce All Domains Into One. (arXiv:2312.06709v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06709
Code URL: null
Copy Paste: [[2312.06709]] AM-RADIO: Agglomerative Model -- Reduce All Domains Into One(http://arxiv.org/abs/2312.06709)
Summary:
A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.

Code: https://github.com/NVlabs/RADIO

Title: Counterfactual World Modeling for Physical Dynamics Understanding. (arXiv:2312.06721v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06721
Code URL: null
Copy Paste: [[2312.06721]] Counterfactual World Modeling for Physical Dynamics Understanding(http://arxiv.org/abs/2312.06721)
Summary:
The ability to understand physical dynamics is essential to learning agents acting in the world. This paper presents Counterfactual World Modeling (CWM), a candidate pure vision foundational model for physical dynamics understanding. CWM consists of three basic concepts. First, we propose a simple and powerful temporally-factored masking policy for masked prediction of video data, which encourages the model to learn disentangled representations of scene appearance and dynamics. Second, as a result of the factoring, CWM is capable of generating counterfactual next-frame predictions by manipulating a few patch embeddings to exert meaningful control over scene dynamics. Third, the counterfactual modeling capability enables the design of counterfactual queries to extract vision structures similar to keypoints, optical flows, and segmentations, which are useful for dynamics understanding. We show that zero-shot readouts of these structures extracted by the counterfactual queries attain competitive performance to prior methods on real-world datasets. Finally, we demonstrate that CWM achieves state-of-the-art performance on the challenging Physion benchmark for evaluating physical dynamics understanding.

Title: A Multimodal Dataset and Benchmark for Radio Galaxy and Infrared Host Detection. (arXiv:2312.06728v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06728
Code URL: null
Copy Paste: [[2312.06728]] A Multimodal Dataset and Benchmark for Radio Galaxy and Infrared Host Detection(http://arxiv.org/abs/2312.06728)
Summary:
We present a novel multimodal dataset developed by expert astronomers to automate the detection and localisation of multi-component extended radio galaxies and their corresponding infrared hosts. The dataset comprises 4,155 instances of galaxies in 2,800 images with both radio and infrared modalities. Each instance contains information on the extended radio galaxy class, its corresponding bounding box that encompasses all of its components, pixel-level segmentation mask, and the position of its corresponding infrared host galaxy. Our dataset is the first publicly accessible dataset that includes images from a highly sensitive radio telescope, infrared satellite, and instance-level annotations for their identification. We benchmark several object detection algorithms on the dataset and propose a novel multimodal approach to identify radio galaxies and the positions of infrared hosts simultaneously.

Title: SqueezeSAM: User friendly mobile interactive segmentation. (arXiv:2312.06736v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06736
Code URL: null
Copy Paste: [[2312.06736]] SqueezeSAM: User friendly mobile interactive segmentation(http://arxiv.org/abs/2312.06736)
Summary:
Segment Anything Model (SAM) is a foundation model for interactive segmentation, and it has catalyzed major advances in generative AI, computational photography, and medical imaging. This model takes in an arbitrary user input and provides segmentation masks of the corresponding objects. It is our goal to develop a version of SAM that is appropriate for use in a photography app. The original SAM model has a few challenges in this setting. First, original SAM a 600 million parameter based on ViT-H, and its high computational cost and large model size that are not suitable for todays mobile hardware. We address this by proposing the SqueezeSAM model architecture, which is 50x faster and 100x smaller than SAM. Next, when a user takes a photo on their phone, it might not occur to them to click on the image and get a mask. Our solution is to use salient object detection to generate the first few clicks. This produces an initial segmentation mask that the user can interactively edit. Finally, when a user clicks on an object, they typically expect all related pieces of the object to be segmented. For instance, if a user clicks on a person t-shirt in a photo, they expect the whole person to be segmented, but SAM typically segments just the t-shirt. We address this with a new data augmentation scheme, and the end result is that if the user clicks on a person holding a basketball, the person and the basketball are all segmented together.

Title: Densify Your Labels: Unsupervised Clustering with Bipartite Matching for Weakly Supervised Point Cloud Segmentation. (arXiv:2312.06799v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06799
Code URL: null
Copy Paste: [[2312.06799]] Densify Your Labels: Unsupervised Clustering with Bipartite Matching for Weakly Supervised Point Cloud Segmentation(http://arxiv.org/abs/2312.06799)
Summary:
We propose a weakly supervised semantic segmentation method for point clouds that predicts "per-point" labels from just "whole-scene" annotations while achieving the performance of recent fully supervised approaches. Our core idea is to propagate the scene-level labels to each point in the point cloud by creating pseudo labels in a conservative way. Specifically, we over-segment point cloud features via unsupervised clustering and associate scene-level labels with clusters through bipartite matching, thus propagating scene labels only to the most relevant clusters, leaving the rest to be guided solely via unsupervised clustering. We empirically demonstrate that over-segmentation and bipartite assignment plays a crucial role. We evaluate our method on ScanNet and S3DIS datasets, outperforming state of the art, and demonstrate that we can achieve results comparable to fully supervised methods.

Title: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment. (arXiv:2312.06960v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06960
Code URL: null
Copy Paste: [[2312.06960]] Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment(http://arxiv.org/abs/2312.06960)
Summary:
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.

Title: MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving. (arXiv:2312.06988v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.06988
Code URL: https://github.com/jiangxb98/mwsis-plugin
Copy Paste: [[2312.06988]] MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving(http://arxiv.org/abs/2312.06988)
Summary:
Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label generation and correction modules for both 2D and 3D modalities to improve the quality of pseudo labels, along with a new multimodal cross-supervision approach, named Consistency Sparse Cross-modal Supervision (CSCS), to reduce the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.

Title: Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation. (arXiv:2312.07051v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07051
Code URL: null
Copy Paste: [[2312.07051]] Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation(http://arxiv.org/abs/2312.07051)
Summary:
Automatic estimation of 3D human pose from monocular RGB images is a challenging and unsolved problem in computer vision. In a supervised manner, approaches heavily rely on laborious annotations and present hampered generalization ability due to the limited diversity of 3D pose datasets. To address these challenges, we propose a unified framework that leverages mask as supervision for unsupervised 3D pose estimation. With general unsupervised segmentation algorithms, the proposed model employs skeleton and physique representations that exploit accurate pose information from coarse to fine. Compared with previous unsupervised approaches, we organize the human skeleton in a fully unsupervised way which enables the processing of annotation-free data and provides ready-to-use estimation results. Comprehensive experiments demonstrate our state-of-the-art pose estimation performance on Human3.6M and MPI-INF-3DHP datasets. Further experiments on in-the-wild datasets also illustrate the capability to access more data to boost our model. Code will be available at https://github.com/Charrrrrlie/Mask-as-Supervision.

Title: MaxQ: Multi-Axis Query for N:M Sparsity Network. (arXiv:2312.07061v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07061
Code URL: https://github.com/jingyangxiang/maxq
Copy Paste: [[2312.07061]] MaxQ: Multi-Axis Query for N:M Sparsity Network(http://arxiv.org/abs/2312.07061)
Summary:
N:M sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. However, existing N:M sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides, they directly apply N:M sparsity to the whole network, which will cause severe information loss. Thus, they are still sub-optimal. In this paper, we propose an efficient and effective Multi-Axis Query methodology, dubbed as MaxQ, to rectify these problems. During the training, MaxQ employs a dynamic approach to generate soft N:M masks, considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile, a sparsity strategy that gradually increases the percentage of N:M weight blocks is applied, which allows the network to heal from the pruning-induced damage progressively. During the runtime, the N:M soft masks can be precomputed as constants and folded into weights without causing any distortion to the sparse pattern and incurring additional computational overhead. Comprehensive experiments demonstrate that MaxQ achieves consistent improvements across diverse CNN architectures in various computer vision tasks, including image classification, object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern, MaxQ can achieve 74.6\% top-1 accuracy on ImageNet and improve by over 2.8\% over the state-of-the-art.

Title: Semi-supervised Active Learning for Video Action Detection. (arXiv:2312.07169v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07169
Code URL: null
Copy Paste: [[2312.07169]] Semi-supervised Active Learning for Video Action Detection(http://arxiv.org/abs/2312.07169)
Summary:
In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos.

Title: MCFNet: Multi-scale Covariance Feature Fusion Network for Real-time Semantic Segmentation. (arXiv:2312.07207v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07207
Code URL: null
Copy Paste: [[2312.07207]] MCFNet: Multi-scale Covariance Feature Fusion Network for Real-time Semantic Segmentation(http://arxiv.org/abs/2312.07207)
Summary:
The low-level spatial detail information and high-level semantic abstract information are both essential to the semantic segmentation task. The features extracted by the deep network can obtain rich semantic information, while a lot of spatial information is lost. However, how to recover spatial detail information effectively and fuse it with high-level semantics has not been well addressed so far. In this paper, we propose a new architecture based on Bilateral Segmentation Network (BiseNet) called Multi-scale Covariance Feature Fusion Network (MCFNet). Specifically, this network introduces a new feature refinement module and a new feature fusion module. Furthermore, a gating unit named L-Gate is proposed to filter out invalid information and fuse multi-scale features. We evaluate our proposed model on Cityscapes, CamVid datasets and compare it with the state-of-the-art methods. Extensive experiments show that our method achieves competitive success. On Cityscapes, we achieve 75.5% mIOU with a speed of 151.3 FPS.

Title: Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation. (arXiv:2312.07221v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07221
Code URL: null
Copy Paste: [[2312.07221]] Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation(http://arxiv.org/abs/2312.07221)
Summary:
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set, which limits their application in real-world scenarios due to the lack of generalization ability. Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks, but are still unable to be applied to 3D semantic segmentation directly. In this work, we focus on zero-shot point cloud semantic segmentation and propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder at both feature and output levels. Both feature-level and output-level alignments are conducted between 2D and 3D encoders for effective knowledge transfer. Concretely, a Multi-granularity Cross-modal Feature Alignment (MCFA) module is proposed to align 2D and 3D features from global semantic and local position perspectives for feature-level alignment. For the output level, per-pixel pseudo labels of unseen classes are extracted using the pre-trained CLIP model as supervision for the 3D segmentation model to mimic the behavior of the CLIP image encoder. Extensive experiments are conducted on two popular benchmarks of point cloud segmentation. Our method outperforms significantly previous state-of-the-art methods under zero-shot setting (+29.2% mIoU on SemanticKITTI and 31.8% mIoU on nuScenes), and further achieves promising results in the annotation-free point cloud semantic segmentation setting, showing its great potential for label-efficient learning.

Title: Dual Structure-Preserving Image Filterings for Semi-supervised Medical Image Segmentation. (arXiv:2312.07264v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07264
Code URL: null
Copy Paste: [[2312.07264]] Dual Structure-Preserving Image Filterings for Semi-supervised Medical Image Segmentation(http://arxiv.org/abs/2312.07264)
Summary:
Semi-supervised image segmentation has attracted great attention recently. The key is how to leverage unlabeled images in the training process. Most methods maintain consistent predictions of the unlabeled images under variations (e.g., adding noise/perturbations, or creating alternative versions) in the image and/or model level. In most image-level variation, medical images often have prior structure information, which has not been well explored. In this paper, we propose novel dual structure-preserving image filterings (DSPIF) as the image-level variations for semi-supervised medical image segmentation. Motivated by connected filtering that simplifies image via filtering in structure-aware tree-based image representation, we resort to the dual contrast invariant Max-tree and Min-tree representation. Specifically, we propose a novel connected filtering that removes topologically equivalent nodes (i.e. connected components) having no siblings in the Max/Min-tree. This results in two filtered images preserving topologically critical structure. Applying such dual structure-preserving image filterings in mutual supervision is beneficial for semi-supervised medical image segmentation. Extensive experimental results on three benchmark datasets demonstrate that the proposed method significantly/consistently outperforms some state-of-the-art methods. The source codes will be publicly available.

Title: Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images. (arXiv:2312.07273v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07273
Code URL: null
Copy Paste: [[2312.07273]] Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images(http://arxiv.org/abs/2312.07273)
Summary:
Near- and duplicate image detection is a critical concern in the field of medical imaging. Medical datasets often contain similar or duplicate images from various sources, which can lead to significant performance issues and evaluation biases, especially in machine learning tasks due to data leakage between training and testing subsets. In this paper, we present an approach for identifying near- and duplicate 3D medical images leveraging publicly available 2D computer vision embeddings. We assessed our approach by comparing embeddings extracted from two state-of-the-art self-supervised pretrained models and two different vector index structures for similarity retrieval. We generate an experimental benchmark based on the publicly available Medical Segmentation Decathlon dataset. The proposed method yields promising results for near- and duplicate image detection achieving a mean sensitivity and specificity of 0.9645 and 0.8559, respectively.

Title: Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization. (arXiv:2312.07342v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07342
Code URL: null
Copy Paste: [[2312.07342]] Expand-and-Quantize: Unsupervised Semantic Segmentation Using High-Dimensional Space and Product Quantization(http://arxiv.org/abs/2312.07342)
Summary:
Unsupervised semantic segmentation (USS) aims to discover and recognize meaningful categories without any labels. For a successful USS, two key abilities are required: 1) information compression and 2) clustering capability. Previous methods have relied on feature dimension reduction for information compression, however, this approach may hinder the process of clustering. In this paper, we propose a novel USS framework called Expand-and-Quantize Unsupervised Semantic Segmentation (EQUSS), which combines the benefits of high-dimensional spaces for better clustering and product quantization for effective information compression. Our extensive experiments demonstrate that EQUSS achieves state-of-the-art results on three standard benchmarks. In addition, we analyze the entropy of USS features, which is the first step towards understanding USS from the perspective of information theory.

Title: Adversarial Semi-Supervised Domain Adaptation for Semantic Segmentation: A New Role for Labeled Target Samples. (arXiv:2312.07370v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07370
Code URL: null
Copy Paste: [[2312.07370]] Adversarial Semi-Supervised Domain Adaptation for Semantic Segmentation: A New Role for Labeled Target Samples(http://arxiv.org/abs/2312.07370)
Summary:
Adversarial learning baselines for domain adaptation (DA) approaches in the context of semantic segmentation are under explored in semi-supervised framework. These baselines involve solely the available labeled target samples in the supervision loss. In this work, we propose to enhance their usefulness on both semantic segmentation and the single domain classifier neural networks. We design new training objective losses for cases when labeled target data behave as source samples or as real target samples. The underlying rationale is that considering the set of labeled target samples as part of source domain helps reducing the domain discrepancy and, hence, improves the contribution of the adversarial loss. To support our approach, we consider a complementary method that mixes source and labeled target data, then applies the same adaptation process. We further propose an unsupervised selection procedure using entropy to optimize the choice of labeled target samples for adaptation. We illustrate our findings through extensive experiments on the benchmarks GTA5, SYNTHIA, and Cityscapes. The empirical evaluation highlights competitive performance of our proposed approach.

Title: Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects. (arXiv:2312.07374v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07374
Code URL: https://github.com/jyLin8100/GenSAM
Copy Paste: [[2312.07374]] Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects(http://arxiv.org/abs/2312.07374)
Summary:
Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting the targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt.To that end, we introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.

Title: ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Medical Image. (arXiv:2312.07381v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.07381
Code URL: https://github.com/halleewong/ScribblePrompt
Copy Paste: [[2312.07381]] ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Medical Image(http://arxiv.org/abs/2312.07381)
Summary:
Semantic medical image segmentation is a crucial part of both scientific research and clinical care. With enough labelled data, deep learning models can be trained to accurately automate specific medical image segmentation tasks. However, manually segmenting images to create training data is highly labor intensive. In this paper, we present ScribblePrompt, an interactive segmentation framework for medical imaging that enables human annotators to segment unseen structures using scribbles, clicks, and bounding boxes. Scribbles are an intuitive and effective form of user interaction for complex tasks, however most existing methods focus on click-based interactions. We introduce algorithms for simulating realistic scribbles that enable training models that are amenable to multiple types of interaction. To achieve generalization to new tasks, we train on a diverse collection of 65 open-access biomedical datasets -- using both real and synthetic labels. We test ScribblePrompt on multiple network architectures and unseen datasets, and demonstrate that it can be used in real-time on a single CPU. We evaluate ScribblePrompt using manually-collected scribbles, simulated interactions, and a user study. ScribblePrompt outperforms existing methods in all our evaluations. In the user study, ScribblePrompt reduced annotation time by 28% while improving Dice by 15% compared to existing methods. We showcase ScribblePrompt in an online demo and provide code at https://scribbleprompt.csail.mit.edu