secure

security

Title: Assessing and Exploiting Domain Name Misinformation. (arXiv:2307.07610v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.07610
Code URL: null
Copy Paste: [[2307.07610] Assessing and Exploiting Domain Name Misinformation](http://arxiv.org/abs/2307.07610) #security
Summary:
Cloud providers' support for network evasion techniques that misrepresent the server's domain name is more prevalent than previously believed, which has serious implications for security and privacy due to the reliance on domain names in common security architectures. Domain fronting is one such evasive technique used by privacy enhancing technologies and malware to hide the domains they visit, and it uses shared hosting and HTTPS to present a benign domain to observers while signaling the target domain in the encrypted HTTP request. In this paper, we construct an ontology of domain name misinformation and detail a novel measurement methodology to identify support among cloud infrastructure providers. Despite several of the largest cloud providers having publicly stated that they no longer support domain fronting, our findings demonstrate a more complex environment with many exceptions.

We also present a novel and straightforward attack that allows an adversary to man-in-the-middle all the victim's encrypted traffic bound to a content delivery network that supports domain fronting, breaking the authenticity, confidentiality, and integrity guarantees expected by the victim when using HTTPS. By using dynamic linker hijacking to rewrite the HTTP Host field, our attack does not generate any artifacts that are visible to the victim or passive network monitoring solutions, and the attacker does not need a separate channel to exfiltrate data or perform command-and-control, which can be achieved by rewriting HTTP headers.

Title: Saudi Arabian Perspective of Security, Privacy, and Attitude of Using Facial Recognition Technology. (arXiv:2307.07671v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.07671
Code URL: null
Copy Paste: [[2307.07671] Saudi Arabian Perspective of Security, Privacy, and Attitude of Using Facial Recognition Technology](http://arxiv.org/abs/2307.07671) #security
Summary:
Facial Recognition Technology (FRT) is a pioneering field of mass surveillance that sparks privacy concerns and is considered a growing threat in the modern world. FRT has been widely adopted in the Kingdom of Saudi Arabia to improve public services and surveillance. Accordingly, the following study aims to understand the privacy and security concerns, trust, and acceptance of FRT in Saudi Arabia. Validated Privacy Concerns (IUIPC-8), Security Attitudes (SA-6), and Security Behavior (SeBIS) scales are used along with replicate studies from Pew Research Center trust questions and government trust questions. In addition, we examine potential differences between Saudis and Americans. To gain insights into these concerns, we conducted an online survey involving 53 Saudi Arabia citizens who are residing in the USA. We have collected data in the US instead of Saudi Arabia to avoid the regulatory challenges of the Saudi Data & Artificial Intelligence Authority (SDAIA). Responses from closed-ended questions revealed that Saudis score much lower than Americans when it comes to security attitudes, whereas they score lower when it comes to privacy concerns. We found no significant difference between Saudis' and Americans' acceptance of the use of FRT in different scenarios, but we found that Saudis trust advertisers more than Americans. Additionally, Saudis are more likely than Americans to agree that the government should strictly limit the use of FRT.

Title: Understanding Cyber Threats Against the Universities, Colleges, and Schools. (arXiv:2307.07755v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.07755
Code URL: null
Copy Paste: [[2307.07755] Understanding Cyber Threats Against the Universities, Colleges, and Schools](http://arxiv.org/abs/2307.07755) #security
Summary:
Universities hold and process a vast amount of valuable user and research data. This makes them a prime target for cyber criminals. Additionally, universities and other educational settings, such as schools and college IT systems, have become a prime target for some of their own students -- often motivated by an opportunity to cause damage to networks and websites, and/or improve their grades.

This paper provides a focused assessment of the current cyber security threat to universities, colleges, and schools (`the education sector') worldwide, providing chronological sequencing of attacks and highlighting the insider threat posed by students. Fifty-eight attacks were identified, with ransomware being the most common type of external attack, and hacking motivated by personal gain showing as the most common form of internal attack. Students, who have become a significant internal threat by either aiding or carrying out attacks are not a homogeneous group, as students may be motivated by different factors, therefore calling for targeted responses. Furthermore, the education sector is increasingly reliant on third party IT service providers meaning attacks on third parties can impact the university and its users. There is very little research analysing this problem, even less research analysing the problem in the context of schools. Hence this paper provides one of the first known assessment of the cyber attacks against the education sector, focusing on insider threat posed by students and offering recommendations for mitigating wider cyber threats.

privacy

Title: Smooth Lower Bounds for Differentially Private Algorithms via Padding-and-Permuting Fingerprinting Codes. (arXiv:2307.07604v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.07604
Code URL: null
Copy Paste: [[2307.07604] Smooth Lower Bounds for Differentially Private Algorithms via Padding-and-Permuting Fingerprinting Codes](http://arxiv.org/abs/2307.07604) #privacy
Summary:
Fingerprinting arguments, first introduced by Bun, Ullman, and Vadhan (STOC 2014), are the most widely used method for establishing lower bounds on the sample complexity or error of approximately differentially private (DP) algorithms. Still, there are many problems in differential privacy for which we don't know suitable lower bounds, and even for problems that we do, the lower bounds are not smooth, and usually become vacuous when the error is larger than some threshold.

In this work, we present a simple method to generate hard instances by applying a padding-and-permuting transformation to a fingerprinting code. We illustrate the applicability of this method by providing new lower bounds in various settings:

1. A tight lower bound for DP averaging in the low-accuracy regime, which in particular implies a new lower bound for the private 1-cluster problem introduced by Nissim, Stemmer, and Vadhan (PODS 2016).

2. A lower bound on the additive error of DP algorithms for approximate k-means clustering, as a function of the multiplicative error, which is tight for a constant multiplication error.

3. A lower bound for estimating the top singular vector of a matrix under DP in low-accuracy regimes, which is a special case of DP subspace estimation studied by Singhal and Steinke (NeurIPS 2021).

Our main technique is to apply a padding-and-permuting transformation to a fingerprinting code. However, rather than proving our results using a black-box access to an existing fingerprinting code (e.g., Tardos' code), we develop a new fingerprinting lemma that is stronger than those of Dwork et al. (FOCS 2015) and Bun et al. (SODA 2017), and prove our lower bounds directly from the lemma. Our lemma, in particular, gives a simpler fingerprinting code construction with optimal rate (up to polylogarithmic factors) that is of independent interest.

Title: On the Utility Gain of Iterative Bayesian Update for Locally Differentially Private Mechanisms. (arXiv:2307.07744v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2307.07744
Code URL: null
Copy Paste: [[2307.07744] On the Utility Gain of Iterative Bayesian Update for Locally Differentially Private Mechanisms](http://arxiv.org/abs/2307.07744) #privacy
Summary:
This paper investigates the utility gain of using Iterative Bayesian Update (IBU) for private discrete distribution estimation using data obfuscated with Locally Differentially Private (LDP) mechanisms. We compare the performance of IBU to Matrix Inversion (MI), a standard estimation technique, for seven LDP mechanisms designed for one-time data collection and for other seven LDP mechanisms designed for multiple data collections (e.g., RAPPOR). To broaden the scope of our study, we also varied the utility metric, the number of users n, the domain size k, and the privacy parameter {\epsilon}, using both synthetic and real-world data. Our results suggest that IBU can be a useful post-processing tool for improving the utility of LDP mechanisms in different scenarios without any additional privacy cost. For instance, our experiments show that IBU can provide better utility than MI, especially in high privacy regimes (i.e., when {\epsilon} is small). Our paper provides insights for practitioners to use IBU in conjunction with existing LDP mechanisms for more accurate and privacy-preserving data analysis. Finally, we implemented IBU for all fourteen LDP mechanisms into the state-of-the-art multi-freq-ldpy Python package (https://pypi.org/project/multi-freq-ldpy/) and open-sourced all our code used for the experiments as tutorials.

protect

defense

attack

Title: RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World. (arXiv:2307.07653v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07653
Code URL: null
Copy Paste: [[2307.07653] RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World](http://arxiv.org/abs/2307.07653) #attack
Summary:
Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.

Title: Unified Adversarial Patch for Cross-modal Attacks in the Physical World. (arXiv:2307.07859v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07859
Code URL: null
Copy Paste: [[2307.07859] Unified Adversarial Patch for Cross-modal Attacks in the Physical World](http://arxiv.org/abs/2307.07859) #attack
Summary:
Recently, physical adversarial attacks have been presented to evade DNNs-based object detectors. To ensure the security, many scenarios are simultaneously deployed with visible sensors and infrared sensors, leading to the failures of these single-modal physical attacks. To show the potential risks under such scenes, we propose a unified adversarial patch to perform cross-modal physical attacks, i.e., fooling visible and infrared object detectors at the same time via a single patch. Considering different imaging mechanisms of visible and infrared sensors, our work focuses on modeling the shapes of adversarial patches, which can be captured in different modalities when they change. To this end, we design a novel boundary-limited shape optimization to achieve the compact and smooth shapes, and thus they can be easily implemented in the physical world. In addition, to balance the fooling degree between visible detector and infrared detector during the optimization process, we propose a score-aware iterative evaluation, which can guide the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. We finally test our method against the one-stage detector: YOLOv3 and the two-stage detector: Faster RCNN. Results show that our unified patch achieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More importantly, we verify the effective attacks in the physical world when visible and infrared sensors shoot the objects under various settings like different angles, distances, postures, and scenes.

Title: Towards Understanding Adversarial Transferability From Surrogate Training. (arXiv:2307.07873v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07873
Code URL: null
Copy Paste: [[2307.07873] Towards Understanding Adversarial Transferability From Surrogate Training](http://arxiv.org/abs/2307.07873) #attack
Summary:
Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.

Title: On the Robustness of Split Learning against Adversarial Attacks. (arXiv:2307.07916v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07916
Code URL: null
Copy Paste: [[2307.07916] On the Robustness of Split Learning against Adversarial Attacks](http://arxiv.org/abs/2307.07916) #attack
Summary:
Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into model security. Specifically, by exploring full models, attackers can launch adversarial attacks, and split learning can mitigate this severe threat by only disclosing part of models to untrusted servers.This paper aims to evaluate the robustness of split learning against adversarial attacks, particularly in the most challenging setting where untrusted servers only have access to the intermediate layers of the model.Existing adversarial attacks mostly focus on the centralized setting instead of the collaborative setting, thus, to better evaluate the robustness of split learning, we develop a tailored attack called SPADV, which comprises two stages: 1) shadow model training that addresses the issue of lacking part of the model and 2) local adversarial attack that produces adversarial examples to evaluate.The first stage only requires a few unlabeled non-IID data, and, in the second stage, SPADV perturbs the intermediate output of natural samples to craft the adversarial ones. The overall cost of the proposed attack process is relatively low, yet the empirical attack effectiveness is significantly high, demonstrating the surprising vulnerability of split learning to adversarial attacks.

Title: Efficient Adversarial Attacks on Online Multi-agent Reinforcement Learning. (arXiv:2307.07670v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07670
Code URL: null
Copy Paste: [[2307.07670] Efficient Adversarial Attacks on Online Multi-agent Reinforcement Learning](http://arxiv.org/abs/2307.07670) #attack
Summary:
Due to the broad range of applications of multi-agent reinforcement learning (MARL), understanding the effects of adversarial attacks against MARL model is essential for the safe applications of this model. Motivated by this, we investigate the impact of adversarial attacks on MARL. In the considered setup, there is an exogenous attacker who is able to modify the rewards before the agents receive them or manipulate the actions before the environment receives them. The attacker aims to guide each agent into a target policy or maximize the cumulative rewards under some specific reward function chosen by the attacker, while minimizing the amount of manipulation on feedback and action. We first show the limitations of the action poisoning only attacks and the reward poisoning only attacks. We then introduce a mixed attack strategy with both the action poisoning and the reward poisoning. We show that the mixed attack strategy can efficiently attack MARL agents even if the attacker has no prior information about the underlying environment and the agents' algorithms.

robust

Title: Flow-Guided Controllable Line Drawing Generation. (arXiv:2307.07540v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07540
Code URL: null
Copy Paste: [[2307.07540] Flow-Guided Controllable Line Drawing Generation](http://arxiv.org/abs/2307.07540) #robust
Summary:
In this paper, we investigate the problem of automatically controllable artistic character line drawing generation from photographs by proposing a Vector Flow Aware and Line Controllable Image-to-Image Translation architecture, which can be viewed as an appealing intersection between Artificial Intelligence and Arts. Specifically, we first present an Image-to-Flow network (I2FNet) to efficiently and robustly create the vector flow field in a learning-based manner, which can provide a direction guide for drawing lines. Then, we introduce our well-designed Double Flow Generator (DFG) framework to fuse features from learned vector flow and input image flow guaranteeing the spatial coherence of lines. Meanwhile, in order to allow for controllable character line drawing generation, we integrate a Line Control Matrix (LCM) into DFG and train a Line Control Regressor (LCR) to synthesize drawings with different styles by elaborately controlling the level of details, such as thickness, smoothness, and continuity, of lines. Finally, we design a Fourier Transformation Loss to further constrain the character line generation from the frequency domain view of the point. Quantitative and qualitative experiments demonstrate that our approach can obtain superior performance in producing high-resolution character line-drawing images with perceptually realistic characteristics.

Title: Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning. (arXiv:2307.07603v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07603
Code URL: https://github.com/dibya404/gastrointestinal-disease-classification-through-explainable-and-cost-sensitive-dnn-with-scl
Copy Paste: [[2307.07603] Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning](http://arxiv.org/abs/2307.07603) #robust
Summary:
Gastrointestinal diseases pose significant healthcare chall-enges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is pivotal in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach on classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness and interpretability of our model. We have made the implementation of our proposed approach publicly available at https://github.com/dibya404/Gastrointestinal-Disease-Classification-through-Explainable-and-Cost-Sensitive-DNN-with-SCL

Title: Prawn Morphometrics and Weight Estimation from Images using Deep Learning for Landmark Localization. (arXiv:2307.07732v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07732
Code URL: null
Copy Paste: [[2307.07732] Prawn Morphometrics and Weight Estimation from Images using Deep Learning for Landmark Localization](http://arxiv.org/abs/2307.07732) #robust
Summary:
Accurate weight estimation and morphometric analyses are useful in aquaculture for optimizing feeding, predicting harvest yields, identifying desirable traits for selective breeding, grading processes, and monitoring the health status of production animals. However, the collection of phenotypic data through traditional manual approaches at industrial scales and in real-time is time-consuming, labour-intensive, and prone to errors. Digital imaging of individuals and subsequent training of prediction models using Deep Learning (DL) has the potential to rapidly and accurately acquire phenotypic data from aquaculture species. In this study, we applied a novel DL approach to automate weight estimation and morphometric analysis using the black tiger prawn (Penaeus monodon) as a model crustacean. The DL approach comprises two main components: a feature extraction module that efficiently combines low-level and high-level features using the Kronecker product operation; followed by a landmark localization module that then uses these features to predict the coordinates of key morphological points (landmarks) on the prawn body. Once these landmarks were extracted, weight was estimated using a weight regression module based on the extracted landmarks using a fully connected network. For morphometric analyses, we utilized the detected landmarks to derive five important prawn traits. Principal Component Analysis (PCA) was also used to identify landmark-derived distances, which were found to be highly correlated with shape features such as body length, and width. We evaluated our approach on a large dataset of 8164 images of the Black tiger prawn (Penaeus monodon) collected from Australian farms. Our experimental results demonstrate that the novel DL approach outperforms existing DL methods in terms of accuracy, robustness, and efficiency.

Title: Integrating Human Parsing and Pose Network for Human Action Recognition. (arXiv:2307.07977v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07977
Code URL: https://github.com/liujf69/ipp-net-parsing
Copy Paste: [[2307.07977] Integrating Human Parsing and Pose Network for Human Action Recognition](http://arxiv.org/abs/2307.07977) #robust
Summary:
Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at https://github.com/liujf69/IPP-Net-Parsing .

Title: Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text. (arXiv:2307.07696v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07696
Code URL: https://github.com/azreasoners/llm-asp
Copy Paste: [[2307.07696] Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text](http://arxiv.org/abs/2307.07696) #robust
Summary:
While large language models (LLMs), such as GPT-3, appear to be robust and general, their reasoning ability is not at a level to compete with the best models trained for specific natural language reasoning problems. In this study, we observe that a large language model can serve as a highly effective few-shot semantic parser. It can convert natural language sentences into a logical form that serves as input for answer set programs, a logic-based declarative knowledge representation formalism. The combination results in a robust and general system that can handle multiple question-answering tasks without requiring retraining for each new task. It only needs a few examples to guide the LLM's adaptation to a specific task, along with reusable ASP knowledge modules that can be applied to multiple tasks. We demonstrate that this method achieves state-of-the-art performance on several NLP benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN. Additionally, it successfully tackles robot planning tasks that an LLM alone fails to solve.

Title: Zero-shot NLG evaluation through Pairware Comparisons with LLMs. (arXiv:2307.07889v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07889
Code URL: null
Copy Paste: [[2307.07889] Zero-shot NLG evaluation through Pairware Comparisons with LLMs](http://arxiv.org/abs/2307.07889) #robust
Summary:
Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.

Title: Efficient Action Robust Reinforcement Learning with Probabilistic Policy Execution Uncertainty. (arXiv:2307.07666v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07666
Code URL: null
Copy Paste: [[2307.07666] Efficient Action Robust Reinforcement Learning with Probabilistic Policy Execution Uncertainty](http://arxiv.org/abs/2307.07666) #robust
Summary:
Robust reinforcement learning (RL) aims to find a policy that optimizes the worst-case performance in the face of uncertainties. In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty, in which, instead of always carrying out the action specified by the policy, the agent will take the action specified by the policy with probability $1-\rho$ and an alternative adversarial action with probability $\rho$. We establish the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty and provide the action robust Bellman optimality equation for its solution. Furthermore, we develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that achieves minimax optimal regret and sample complexity. Furthermore, we conduct numerical experiments to validate our approach's robustness, demonstrating that ARRLC outperforms non-robust RL algorithms and converges faster than the robust TD algorithm in the presence of action perturbations.

Title: On the Robustness of Epoch-Greedy in Multi-Agent Contextual Bandit Mechanisms. (arXiv:2307.07675v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07675
Code URL: null
Copy Paste: [[2307.07675] On the Robustness of Epoch-Greedy in Multi-Agent Contextual Bandit Mechanisms](http://arxiv.org/abs/2307.07675) #robust
Summary:
Efficient learning in multi-armed bandit mechanisms such as pay-per-click (PPC) auctions typically involves three challenges: 1) inducing truthful bidding behavior (incentives), 2) using personalization in the users (context), and 3) circumventing manipulations in click patterns (corruptions). Each of these challenges has been studied orthogonally in the literature; incentives have been addressed by a line of work on truthful multi-armed bandit mechanisms, context has been extensively tackled by contextual bandit algorithms, while corruptions have been discussed via a recent line of work on bandits with adversarial corruptions. Since these challenges co-exist, it is important to understand the robustness of each of these approaches in addressing the other challenges, provide algorithms that can handle all simultaneously, and highlight inherent limitations in this combination. In this work, we show that the most prominent contextual bandit algorithm, $\epsilon$-greedy can be extended to handle the challenges introduced by strategic arms in the contextual multi-arm bandit mechanism setting. We further show that $\epsilon$-greedy is inherently robust to adversarial data corruption attacks and achieves performance that degrades linearly with the amount of corruption.

Title: Minimal Random Code Learning with Mean-KL Parameterization. (arXiv:2307.07816v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07816
Code URL: null
Copy Paste: [[2307.07816] Minimal Random Code Learning with Mean-KL Parameterization](http://arxiv.org/abs/2307.07816) #robust
Summary:
This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.

Title: Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation. (arXiv:2307.07907v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07907
Code URL: null
Copy Paste: [[2307.07907] Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation](http://arxiv.org/abs/2307.07907) #robust
Summary:
Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have causality but have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and sequential structure of RL, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in breaking spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks.

Title: Enhancing Energy Efficiency and Reliability in Autonomous Systems Estimation using Neuromorphic Approach. (arXiv:2307.07963v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07963
Code URL: null
Copy Paste: [[2307.07963] Enhancing Energy Efficiency and Reliability in Autonomous Systems Estimation using Neuromorphic Approach](http://arxiv.org/abs/2307.07963) #robust
Summary:
Energy efficiency and reliability have long been crucial factors for ensuring cost-effective and safe missions in autonomous systems computers. With the rapid evolution of industries such as space robotics and advanced air mobility, the demand for these low size, weight, and power (SWaP) computers has grown significantly. This study focuses on introducing an estimation framework based on spike coding theories and spiking neural networks (SNN), leveraging the efficiency and scalability of neuromorphic computers. Therefore, we propose an SNN-based Kalman filter (KF), a fundamental and widely adopted optimal strategy for well-defined linear systems. Furthermore, based on the modified sliding innovation filter (MSIF) we present a robust strategy called SNN-MSIF. Notably, the weight matrices of the networks are designed according to the system model, eliminating the need for learning. To evaluate the effectiveness of the proposed strategies, we compare them to their algorithmic counterparts, namely the KF and the MSIF, using Monte Carlo simulations. Additionally, we assess the robustness of SNN-MSIF by comparing it to SNN-KF in the presence of modeling uncertainties and neuron loss. Our results demonstrate the applicability of the proposed methods and highlight the superior performance of SNN-MSIF in terms of accuracy and robustness. Furthermore, the spiking pattern observed from the networks serves as evidence of the energy efficiency achieved by the proposed methods, as they exhibited an impressive reduction of approximately 97 percent in emitted spikes compared to possible spikes.

Title: Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in An Adversarial Environment. (arXiv:2307.07980v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07980
Code URL: null
Copy Paste: [[2307.07980] Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in An Adversarial Environment](http://arxiv.org/abs/2307.07980) #robust
Summary:
This paper studies distributed online learning under Byzantine attacks. The performance of an online learning algorithm is often characterized by (adversarial) regret, which evaluates the quality of one-step-ahead decision-making when an environment provides adversarial losses, and a sublinear bound is preferred. But we prove that, even with a class of state-of-the-art robust aggregation rules, in an adversarial environment and in the presence of Byzantine participants, distributed online gradient descent can only achieve a linear adversarial regret bound, which is tight. This is the inevitable consequence of Byzantine attacks, even though we can control the constant of the linear adversarial regret to a reasonable level. Interestingly, when the environment is not fully adversarial so that the losses of the honest participants are i.i.d. (independent and identically distributed), we show that sublinear stochastic regret, in contrast to the aforementioned adversarial regret, is possible. We develop a Byzantine-robust distributed online momentum algorithm to attain such a sublinear stochastic regret bound. Extensive numerical experiments corroborate our theoretical analysis.

biometric

steal

extraction

Title: Spatial-Spectral Hyperspectral Classification based on Learnable 3D Group Convolution. (arXiv:2307.07720v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07720
Code URL: null
Copy Paste: [[2307.07720] Spatial-Spectral Hyperspectral Classification based on Learnable 3D Group Convolution](http://arxiv.org/abs/2307.07720) #extraction
Summary:
Deep neural networks have faced many problems in hyperspectral image classification, including the ineffective utilization of spectral-spatial joint information and the problems of gradient vanishing and overfitting that arise with increasing depth. In order to accelerate the deployment of models on edge devices with strict latency requirements and limited computing power, this paper proposes a learnable group convolution network (LGCNet) based on an improved 3D-DenseNet model and a lightweight model design. The LGCNet module improves the shortcomings of group convolution by introducing a dynamic learning method for the input channels and convolution kernel grouping, enabling flexible grouping structures and generating better representation ability. Through the overall loss and gradient of the backpropagation network, the 3D group convolution is dynamically determined and updated in an end-to-end manner. The learnable number of channels and corresponding grouping can capture different complementary visual features of input images, allowing the CNN to learn richer feature representations. When extracting high-dimensional and redundant hyperspectral data, the 3D convolution kernels also contain a large amount of redundant information. The LGC module allows the 3D-DenseNet to choose channel information with more semantic features, and is very efficient, making it suitable for embedding in any deep neural network for acceleration and efficiency improvements. LGC enables the 3D-CNN to achieve sufficient feature extraction while also meeting speed and computing requirements. Furthermore, LGCNet has achieved progress in inference speed and accuracy, and outperforms mainstream hyperspectral image classification methods on the Indian Pines, Pavia University, and KSC datasets.

Title: Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments. (arXiv:2307.07757v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07757
Code URL: null
Copy Paste: [[2307.07757] Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments](http://arxiv.org/abs/2307.07757) #extraction
Summary:
Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.

Title: DocTr: Document Transformer for Structured Information Extraction in Documents. (arXiv:2307.07929v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07929
Code URL: null
Copy Paste: [[2307.07929] DocTr: Document Transformer for Structured Information Extraction in Documents](http://arxiv.org/abs/2307.07929) #extraction
Summary:
We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.

Title: QontSum: On Contrasting Salient Content for Query-focused Summarization. (arXiv:2307.07586v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07586
Code URL: null
Copy Paste: [[2307.07586] QontSum: On Contrasting Salient Content for Query-focused Summarization](http://arxiv.org/abs/2307.07586) #extraction
Summary:
Query-focused summarization (QFS) is a challenging task in natural language processing that generates summaries to address specific queries. The broader field of Generative Information Retrieval (Gen-IR) aims to revolutionize information extraction from vast document corpora through generative approaches, encompassing Generative Document Retrieval (GDR) and Grounded Answer Retrieval (GAR). This paper highlights the role of QFS in Grounded Answer Generation (GAR), a key subdomain of Gen-IR that produces human-readable answers in direct correspondence with queries, grounded in relevant documents. In this study, we propose QontSum, a novel approach for QFS that leverages contrastive learning to help the model attend to the most relevant regions of the input document. We evaluate our approach on a couple of benchmark datasets for QFS and demonstrate that it either outperforms existing state-of-the-art or exhibits a comparable performance with considerably reduced computational cost through enhancements in the fine-tuning stage, rather than relying on large-scale pre-training experiments, which is the focus of current SOTA. Moreover, we conducted a human study and identified improvements in the relevance of generated summaries to the posed queries without compromising fluency. We further conduct an error analysis study to understand our model's limitations and propose avenues for future research.

membership infer

federate

fair

Title: Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks. (arXiv:2307.08013v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08013
Code URL: null
Copy Paste: [[2307.08013] Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks](http://arxiv.org/abs/2307.08013) #fair
Summary:
Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.

Title: Learning Subjective Time-Series Data via Utopia Label Distribution Approximation. (arXiv:2307.07682v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07682
Code URL: null
Copy Paste: [[2307.07682] Learning Subjective Time-Series Data via Utopia Label Distribution Approximation](http://arxiv.org/abs/2307.07682) #fair
Summary:
Subjective time-series regression (STR) tasks have gained increasing attention recently. However, most existing methods overlook the label distribution bias in STR data, which results in biased models. Emerging studies on imbalanced regression tasks, such as age estimation and depth estimation, hypothesize that the prior label distribution of the dataset is uniform. However, we observe that the label distributions of training and test sets in STR tasks are likely to be neither uniform nor identical. This distinct feature calls for new approaches that estimate more reasonable distributions to train a fair model. In this work, we propose Utopia Label Distribution Approximation (ULDA) for time-series data, which makes the training label distribution closer to real-world but unknown (utopia) label distribution. This would enhance the model's fairness. Specifically, ULDA first convolves the training label distribution by a Gaussian kernel. After convolution, the required sample quantity at each regression label may change. We further devise the Time-slice Normal Sampling (TNS) to generate new samples when the required sample quantity is greater than the initial sample quantity, and the Convolutional Weighted Loss (CWL) to lower the sample weight when the required sample quantity is less than the initial quantity. These two modules not only assist the model training on the approximated utopia label distribution, but also maintain the sample continuity in temporal context space. To the best of our knowledge, ULDA is the first method to address the label distribution bias in time-series data. Extensive experiments demonstrate that ULDA lifts the state-of-the-art performance on two STR tasks and three benchmark datasets.

interpretability

Title: LUCYD: A Feature-Driven Richardson-Lucy Deconvolution Network. (arXiv:2307.07998v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07998
Code URL: https://github.com/ctom2/lucyd-deconvolution
Copy Paste: [[2307.07998] LUCYD: A Feature-Driven Richardson-Lucy Deconvolution Network](http://arxiv.org/abs/2307.07998) #interpretability
Summary:
The process of acquiring microscopic images in life sciences often results in image degradation and corruption, characterised by the presence of noise and blur, which poses significant challenges in accurately analysing and interpreting the obtained data. This paper proposes LUCYD, a novel method for the restoration of volumetric microscopy images that combines the Richardson-Lucy deconvolution formula and the fusion of deep features obtained by a fully convolutional network. By integrating the image formation process into a feature-driven restoration model, the proposed approach aims to enhance the quality of the restored images whilst reducing computational costs and maintaining a high degree of interpretability. Our results demonstrate that LUCYD outperforms the state-of-the-art methods in both synthetic and real microscopy images, achieving superior performance in terms of image quality and generalisability. We show that the model can handle various microscopy modalities and different imaging conditions by evaluating it on two different microscopy datasets, including volumetric widefield and light-sheet microscopy. Our experiments indicate that LUCYD can significantly improve resolution, contrast, and overall quality of microscopy images. Therefore, it can be a valuable tool for microscopy image restoration and can facilitate further research in various microscopy applications. We made the source code for the model accessible under https://github.com/ctom2/lucyd-deconvolution.

Title: Efficiently Factorizing Boolean Matrices using Proximal Gradient Descent. (arXiv:2307.07615v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07615
Code URL: null
Copy Paste: [[2307.07615] Efficiently Factorizing Boolean Matrices using Proximal Gradient Descent](http://arxiv.org/abs/2307.07615) #interpretability
Summary:
Addressing the interpretability problem of NMF on Boolean data, Boolean Matrix Factorization (BMF) uses Boolean algebra to decompose the input into low-rank Boolean factor matrices. These matrices are highly interpretable and very useful in practice, but they come at the high computational cost of solving an NP-hard combinatorial optimization problem. To reduce the computational burden, we propose to relax BMF continuously using a novel elastic-binary regularizer, from which we derive a proximal gradient algorithm. Through an extensive set of experiments, we demonstrate that our method works well in practice: On synthetic data, we show that it converges quickly, recovers the ground truth precisely, and estimates the simulated rank exactly. On real-world data, we improve upon the state of the art in recall, loss, and runtime, and a case study from the medical domain confirms that our results are easily interpretable and semantically meaningful.

Title: Towards Flexible Time-to-event Modeling: Optimizing Neural Networks via Rank Regression. (arXiv:2307.08044v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.08044
Code URL: https://github.com/teboozas/dart_ecai23
Copy Paste: [[2307.08044] Towards Flexible Time-to-event Modeling: Optimizing Neural Networks via Rank Regression](http://arxiv.org/abs/2307.08044) #interpretability
Summary:
Time-to-event analysis, also known as survival analysis, aims to predict the time of occurrence of an event, given a set of features. One of the major challenges in this area is dealing with censored data, which can make learning algorithms more complex. Traditional methods such as Cox's proportional hazards model and the accelerated failure time (AFT) model have been popular in this field, but they often require assumptions such as proportional hazards and linearity. In particular, the AFT models often require pre-specified parametric distributional assumptions. To improve predictive performance and alleviate strict assumptions, there have been many deep learning approaches for hazard-based models in recent years. However, representation learning for AFT has not been widely explored in the neural network literature, despite its simplicity and interpretability in comparison to hazard-focused methods. In this work, we introduce the Deep AFT Rank-regression model for Time-to-event prediction (DART). This model uses an objective function based on Gehan's rank statistic, which is efficient and reliable for representation learning. On top of eliminating the requirement to establish a baseline event time distribution, DART retains the advantages of directly predicting event time in standard AFT models. The proposed method is a semiparametric approach to AFT modeling that does not impose any distributional assumptions on the survival time distribution. This also eliminates the need for additional hyperparameters or complex model architectures, unlike existing neural network-based AFT models. Through quantitative analysis on various benchmark datasets, we have shown that DART has significant potential for modeling high-throughput censored time-to-event data.

explainability

watermark

diffusion

Title: ExposureDiffusion: Learning to Expose for Low-light Image Enhancement. (arXiv:2307.07710v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07710
Code URL: null
Copy Paste: [[2307.07710] ExposureDiffusion: Learning to Expose for Low-light Image Enhancement](http://arxiv.org/abs/2307.07710) #diffusion
Summary:
Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

Title: Analysing Gender Bias in Text-to-Image Models using Object Detection. (arXiv:2307.08025v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08025
Code URL: https://github.com/harveymannering/text-to-image-bias
Copy Paste: [[2307.08025] Analysing Gender Bias in Text-to-Image Models using Object Detection](http://arxiv.org/abs/2307.08025) #diffusion
Summary:
This work presents a novel strategy to measure bias in text-to-image models. Using paired prompts that specify gender and vaguely reference an object (e.g. "a man/woman holding an item") we can examine whether certain objects are associated with a certain gender. In analysing results from Stable Diffusion, we observed that male prompts generated objects such as ties, knives, trucks, baseball bats, and bicycles more frequently. On the other hand, female prompts were more likely to generate objects such as handbags, umbrellas, bowls, bottles, and cups. We hope that the method outlined here will be a useful tool for examining bias in text-to-image models.

Title: LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection. (arXiv:2307.08059v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08059
Code URL: null
Copy Paste: [[2307.08059] LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection](http://arxiv.org/abs/2307.08059) #diffusion
Summary:
In the context of flexible manufacturing systems that are required to produce different types and quantities of products with minimal reconfiguration, this paper addresses the problem of unsupervised multi-class anomaly detection: develop a unified model to detect anomalies from objects belonging to multiple classes when only normal data is accessible. We first explore the generative-based approach and investigate latent diffusion models for reconstruction to mitigate the notorious identity shortcut'' issue in auto-encoder based methods. We then introduce a feature editing strategy that modifies the input feature space of the diffusion model to further alleviateidentity shortcuts'' and meanwhile improve the reconstruction quality of normal regions, leading to fewer false positive predictions. Moreover, we are the first who pose the problem of hyperparameter selection in unsupervised anomaly detection, and propose a solution of synthesizing anomaly data for a pseudo validation set to address this problem. Extensive experiments on benchmark datasets MVTec-AD and MPDD show that the proposed LafitE, \ie, Latent Diffusion Model with Feature Editing, outperforms state-of-art methods by a significant margin in terms of average AUROC. The hyperparamters selected via our pseudo validation set are well-matched to the real test set.

noise learning

data-free

transformer

Title: ConTrack: Contextual Transformer for Device Tracking in X-ray. (arXiv:2307.07541v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07541
Code URL: null
Copy Paste: [[2307.07541] ConTrack: Contextual Transformer for Device Tracking in X-ray](http://arxiv.org/abs/2307.07541) #transformer
Summary:
Device tracking is an important prerequisite for guidance during endovascular procedures. Especially during cardiac interventions, detection and tracking of guiding the catheter tip in 2D fluoroscopic images is important for applications such as mapping vessels from angiography (high dose with contrast) to fluoroscopy (low dose without contrast). Tracking the catheter tip poses different challenges: the tip can be occluded by contrast during angiography or interventional devices; and it is always in continuous movement due to the cardiac and respiratory motions. To overcome these challenges, we propose ConTrack, a transformer-based network that uses both spatial and temporal contextual information for accurate device detection and tracking in both X-ray fluoroscopy and angiography. The spatial information comes from the template frames and the segmentation module: the template frames define the surroundings of the device, whereas the segmentation module detects the entire device to bring more context for the tip prediction. Using multiple templates makes the model more robust to the change in appearance of the device when it is occluded by the contrast agent. The flow information computed on the segmented catheter mask between the current and the previous frame helps in further refining the prediction by compensating for the respiratory and cardiac motions. The experiments show that our method achieves 45% or higher accuracy in detection and tracking when compared to state-of-the-art tracking models.

Title: CoTracker: It is Better to Track Together. (arXiv:2307.07635v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07635
Code URL: https://github.com/facebookresearch/co-tracker
Copy Paste: [[2307.07635] CoTracker: It is Better to Track Together](http://arxiv.org/abs/2307.07635) #transformer
Summary:
Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.

Title: Semantic Contrastive Bootstrapping for Single-positive Multi-label Recognition. (arXiv:2307.07680v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07680
Code URL: https://github.com/iCVTEAM/Scob
Copy Paste: [[2307.07680] Semantic Contrastive Bootstrapping for Single-positive Multi-label Recognition](http://arxiv.org/abs/2307.07680) #transformer
Summary:
Learning multi-label image recognition with incomplete annotation is gaining popularity due to its superior performance and significant labor savings when compared to training with fully labeled datasets. Existing literature mainly focuses on label completion and co-occurrence learning while facing difficulties with the most common single-positive label manner. To tackle this problem, we present a semantic contrastive bootstrapping (Scob) approach to gradually recover the cross-object relationships by introducing class activation as semantic guidance. With this learning guidance, we then propose a recurrent semantic masked transformer to extract iconic object-level representations and delve into the contrastive learning problems on multi-label classification tasks. We further propose a bootstrapping framework in an Expectation-Maximization fashion that iteratively optimizes the network parameters and refines semantic guidance to alleviate possible disturbance caused by wrong semantic guidance. Extensive experimental results demonstrate that the proposed joint learning framework surpasses the state-of-the-art models by a large margin on four public multi-label image recognition benchmarks. Codes can be found at https://github.com/iCVTEAM/Scob.

Title: SINC: Self-Supervised In-Context Learning for Vision-Language Tasks. (arXiv:2307.07742v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07742
Code URL: null
Copy Paste: [[2307.07742] SINC: Self-Supervised In-Context Learning for Vision-Language Tasks](http://arxiv.org/abs/2307.07742) #transformer
Summary:
Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: ``How can we enable in-context learning for general models without being constrained on large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

Title: Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation. (arXiv:2307.07812v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07812
Code URL: null
Copy Paste: [[2307.07812] Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation](http://arxiv.org/abs/2307.07812) #transformer
Summary:
Few-shot video segmentation is the task of delineating a specific novel class in a query video using few labelled support images. Typical approaches compare support and query features while limiting comparisons to a single feature layer and thereby ignore potentially valuable information. We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation that combines information across scales within a transformer decoder. Typical multiscale transformer decoders for segmentation tasks learn a compressed representation, their queries, through information exchange across scales. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange via a multiscale memory transformer decoding to reduce confusion between the background and novel class. Integral to the approach, we investigate multiple forms of information exchange across scales in different tasks and provide insights with empirical evidence on which to use in each task. The overall comparisons among query and support features benefit from both rich semantics and precise localization. We demonstrate our approach primarily on few-shot video object segmentation and an adapted version on the fully supervised counterpart. In all cases, our approach outperforms the baseline and yields state-of-the-art performance. Our code is publicly available at https://github.com/MSiam/MMC-MultiscaleMemory.

Title: S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality. (arXiv:2307.07935v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07935
Code URL: null
Copy Paste: [[2307.07935] S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality](http://arxiv.org/abs/2307.07935) #transformer
Summary:
Due to the lack of real multi-agent data and time-consuming of labeling, existing multi-agent cooperative perception algorithms usually select the simulated sensor data for training and validating. However, the perception performance is degraded when these simulation-trained models are deployed to the real world, due to the significant domain gap between the simulated and real data. In this paper, we propose the first Simulation-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT, which considers both the Implementation Gap and Feature Gap between simulated and real data. We investigate the effects of these two types of domain gaps and propose a novel uncertainty-aware vision transformer to effectively relief the Implementation Gap and an agent-based feature adaptation module with inter-agent and ego-agent discriminators to reduce the Feature Gap. Our intensive experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality and outperform other methods significantly for point cloud-based 3D object detection.

Title: CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion. (arXiv:2307.07938v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07938
Code URL: null
Copy Paste: [[2307.07938] CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion](http://arxiv.org/abs/2307.07938) #transformer
Summary:
Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for each voxel. In the cross-view transformer, we employ the cross-view fusion to comprehensively learn the cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels. We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.

Title: Language Conditioned Traffic Generation. (arXiv:2307.07947v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07947
Code URL: null
Copy Paste: [[2307.07947] Language Conditioned Traffic Generation](http://arxiv.org/abs/2307.07947) #transformer
Summary:
Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.

Title: A Survey of Techniques for Optimizing Transformer Inference. (arXiv:2307.07982v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07982
Code URL: null
Copy Paste: [[2307.07982] A Survey of Techniques for Optimizing Transformer Inference](http://arxiv.org/abs/2307.07982) #transformer
Summary:
Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.

Title: Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer. (arXiv:2307.08015v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08015
Code URL: null
Copy Paste: [[2307.08015] Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer](http://arxiv.org/abs/2307.08015) #transformer
Summary:
Image retrieval-based cross-view localization methods often lead to very coarse camera pose estimation, due to the limited sampling density of the database satellite images. In this paper, we propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image. Our approach designs a geometry-guided cross-view transformer that combines the benefits of conventional geometry and learnable cross-view transformers to map the ground-view observations to an overhead view. Given the synthesized overhead view and observed satellite feature maps, we construct a neural pose optimizer with strong global information embedding ability to estimate the relative rotation between them. After aligning their rotations, we develop an uncertainty-guided spatial correlation to generate a probability map of the vehicle locations, from which the relative translation can be determined. Experimental results demonstrate that our method significantly outperforms the state-of-the-art. Notably, the likelihood of restricting the vehicle lateral pose to be within 1m of its Ground Truth (GT) value on the cross-view KITTI dataset has been improved from $35.54\%$ to $76.44\%$, and the likelihood of restricting the vehicle orientation to be within $1^{\circ}$ of its GT value has been improved from $19.64\%$ to $99.10\%$.

Title: Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making. (arXiv:2307.08016v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08016
Code URL: null
Copy Paste: [[2307.08016] Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making](http://arxiv.org/abs/2307.08016) #transformer
Summary:
Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained \textit{units}, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on the TEACH benchmark, we demonstrate that our proposed framework outperforms existing state-of-the-art methods in terms of all evaluation metrics. Overall, our work introduces a novel approach to tackling the VLDM task by breaking it down into smaller, manageable units and utilizing a hybrid-training framework. By doing so, we provide a more flexible and effective solution for multimodal decision making.

Title: Transformers are Universal Predictors. (arXiv:2307.07843v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2307.07843
Code URL: null
Copy Paste: [[2307.07843] Transformers are Universal Predictors](http://arxiv.org/abs/2307.07843) #transformer
Summary:
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

Title: GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT. (arXiv:2307.07930v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07930
Code URL: null
Copy Paste: [[2307.07930] GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT](http://arxiv.org/abs/2307.07930) #transformer
Summary:
Decision-makers in GIS need to combine a series of spatial algorithms and operations to solve geospatial tasks. For example, in the task of facility siting, the Buffer tool is usually first used to locate areas close or away from some specific entities; then, the Intersect or Erase tool is used to select candidate areas satisfied multiple requirements. Though professionals can easily understand and solve these geospatial tasks by sequentially utilizing relevant tools, it is difficult for non-professionals to handle these problems. Recently, Generative Pre-trained Transformer (e.g., ChatGPT) presents strong performance in semantic understanding and reasoning. Especially, AutoGPT can further extend the capabilities of large language models (LLMs) by automatically reasoning and calling externally defined tools. Inspired by these studies, we attempt to lower the threshold of non-professional users to solve geospatial tasks by integrating the semantic understanding ability inherent in LLMs with mature tools within the GIS community. Specifically, we develop a new framework called GeoGPT that can conduct geospatial data collection, processing, and analysis in an autonomous manner with the instruction of only natural language. In other words, GeoGPT is used to understand the demands of non-professional users merely based on input natural language descriptions, and then think, plan, and execute defined GIS tools to output final effective results. Several cases including geospatial data crawling, spatial query, facility siting, and mapping validate the effectiveness of our framework. Though limited cases are presented in this paper, GeoGPT can be further extended to various tasks by equipping with more GIS tools, and we think the paradigm of "foundational plus professional" implied in GeoGPT provides an effective way to develop next-generation GIS in this era of large foundation models.

generative

Title: Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting. (arXiv:2307.07678v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07678
Code URL: null
Copy Paste: [[2307.07678] Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting](http://arxiv.org/abs/2307.07678) #generative
Summary:
Deep generative approaches have obtained great success in image inpainting recently. However, most generative inpainting networks suffer from either over-smooth results or aliasing artifacts. The former lacks high-frequency details, while the latter lacks semantic structure. To address this issue, we propose an effective Frequency-Spatial Complementary Network (FSCN) by exploiting rich semantic information in both spatial and frequency domains. Specifically, we introduce an extra Frequency Branch and Frequency Loss on the spatial-based network to impose direct supervision on the frequency information, and propose a Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features and combine the corresponding characteristics. With our FSCAB, the inpainting network is capable of capturing frequency information and preserving visual consistency simultaneously. Extensive quantitative and qualitative experiments demonstrate that our inpainting network can effectively achieve superior results, outperforming previous state-of-the-art approaches with significantly fewer parameters and less computation cost. The code will be released soon.

Title: Householder Projector for Unsupervised Latent Semantics Discovery. (arXiv:2307.08012v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08012
Code URL: https://github.com/kingjamessong/householdergan
Copy Paste: [[2307.08012] Householder Projector for Unsupervised Latent Semantics Discovery](http://arxiv.org/abs/2307.08012) #generative
Summary:
Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only $1\%$ of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.

large language model

Title: Planting a SEED of Vision in Large Language Model. (arXiv:2307.08041v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08041
Code URL: https://github.com/ailab-cvc/seed
Copy Paste: [[2307.08041] Planting a SEED of Vision in Large Language Model](http://arxiv.org/abs/2307.08041) #large language model
Summary:
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

Title: Othering and low prestige framing of immigrant cuisines in US restaurant reviews and large language models. (arXiv:2307.07645v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07645
Code URL: null
Copy Paste: [[2307.07645] Othering and low prestige framing of immigrant cuisines in US restaurant reviews and large language models](http://arxiv.org/abs/2307.07645) #large language model
Summary:
Identifying and understanding implicit attitudes toward food can help efforts to mitigate social prejudice due to food's pervasive role as a marker of cultural and ethnic identity. Stereotypes about food are a form of microaggression that contribute to harmful public discourse that may in turn perpetuate prejudice toward ethnic groups and negatively impact economic outcomes for restaurants. Through careful linguistic analyses, we evaluate social theories about attitudes toward immigrant cuisine in a large-scale study of framing differences in 2.1M English language Yelp reviews of restaurants in 14 US states. Controlling for factors such as restaurant price and neighborhood racial diversity, we find that immigrant cuisines are more likely to be framed in objectifying and othering terms of authenticity (e.g., authentic, traditional), exoticism (e.g., exotic, different), and prototypicality (e.g., typical, usual), but that non-Western immigrant cuisines (e.g., Indian, Mexican) receive more othering than European cuisines (e.g., French, Italian). We further find that non-Western immigrant cuisines are framed less positively and as lower status, being evaluated in terms of affordability and hygiene. Finally, we show that reviews generated by large language models (LLMs) reproduce many of the same framing tendencies. Our results empirically corroborate social theories of taste and gastronomic stereotyping, and reveal linguistic processes by which such attitudes are reified.

Title: Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. (arXiv:2307.07697v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07697
Code URL: null
Copy Paste: [[2307.07697] Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph](http://arxiv.org/abs/2307.07697) #large language model
Summary:
Large language models (LLMs) have made significant strides in various tasks, yet they often struggle with complex reasoning and exhibit poor performance in scenarios where knowledge traceability, timeliness, and accuracy are crucial. To address these limitations, we present Think-on-Graph (ToG), a novel framework that leverages knowledge graphs to enhance LLMs' ability for deep and responsible reasoning. By employing ToG, we can identify entities relevant to a given question and conduct exploration and reasoning to retrieve related triples from an external knowledge database. This iterative procedure generates multiple reasoning pathways consisting of sequentially connected triplets until sufficient information is gathered to answer the question or the maximum depth is reached. Through experiments on complex multi-hop reasoning question-answering tasks, we demonstrate that ToG outperforms existing methods, effectively addressing the aforementioned limitations of LLMs without incurring additional training costs.

Title: CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models. (arXiv:2307.07705v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07705
Code URL: null
Copy Paste: [[2307.07705] CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models](http://arxiv.org/abs/2307.07705) #large language model
Summary:
Parameter-efficient tuning (PET) has been widely explored in recent years because it tunes much fewer parameters (PET modules) than full-parameter fine-tuning (FT) while still stimulating sufficient knowledge from large language models (LLMs) for downstream tasks. Moreover, when PET is employed to serve multiple tasks, different task-specific PET modules can be built on a frozen LLM, avoiding redundant LLM deployments. Although PET significantly reduces the cost of tuning and deploying LLMs, its inference still suffers from the computational bottleneck of LLMs. To address the above issue, we propose an effective PET framework based on compressed LLMs, named "CPET". In CPET, we evaluate the impact of mainstream LLM compression techniques on PET performance and then introduce knowledge inheritance and recovery strategies to restore the knowledge loss caused by these compression techniques. Our experimental results demonstrate that, owing to the restoring strategies of CPET, collaborating task-specific PET modules with a compressed LLM can achieve comparable performance to collaborating PET modules with the original version of the compressed LLM and outperform directly applying vanilla PET methods to the compressed LLM.

Title: Large Language Models as Superpositions of Cultural Perspectives. (arXiv:2307.07870v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2307.07870
Code URL: null
Copy Paste: [[2307.07870] Large Language Models as Superpositions of Cultural Perspectives](http://arxiv.org/abs/2307.07870) #large language model
Summary:
Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at https://sites.google.com/view/llm-superpositions .

segmentation

Title: ACF-Net: An Attention-enhanced Co-interactive Fusion Network for Automated Structural Condition Assessment in Visual Inspection. (arXiv:2307.07643v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07643
Code URL: null
Copy Paste: [[2307.07643] ACF-Net: An Attention-enhanced Co-interactive Fusion Network for Automated Structural Condition Assessment in Visual Inspection](http://arxiv.org/abs/2307.07643) #segmentation
Summary:
Efficiently monitoring the condition of civil infrastructures necessitates automating the structural condition assessment in visual inspection. This paper proposes an Attention-enhanced Co-interactive Fusion Network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. The ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. It integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. Experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset Steel Bridge Condition Inspection Visual (SBCIV) testing set. An ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. The code will be open-source after acceptance.

Title: MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. (arXiv:2307.07662v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07662
Code URL: null
Copy Paste: [[2307.07662] MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression](http://arxiv.org/abs/2307.07662) #segmentation
Summary:
Bounding box regression (BBR) has been widely used in object detection and instance segmentation, which is an important step in object localization. However, most of the existing loss functions for bounding box regression cannot be optimized when the predicted box has the same aspect ratio as the groundtruth box, but the width and height values are exactly different. In order to tackle the issues mentioned above, we fully explore the geometric features of horizontal rectangle and propose a novel bounding box similarity comparison metric MPDIoU based on minimum point distance, which contains all of the relevant factors considered in the existing loss functions, namely overlapping or non-overlapping area, central points distance, and deviation of width and height, while simplifying the calculation process. On this basis, we propose a bounding box regression loss function based on MPDIoU, called LMPDIoU . Experimental results show that the MPDIoU loss function is applied to state-of-the-art instance segmentation (e.g., YOLACT) and object detection (e.g., YOLOv7) model trained on PASCAL VOC, MS COCO, and IIIT5k outperforms existing loss functions.

Title: Learning from Pseudo-labeled Segmentation for Multi-Class Object Counting. (arXiv:2307.07677v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07677
Code URL: null
Copy Paste: [[2307.07677] Learning from Pseudo-labeled Segmentation for Multi-Class Object Counting](http://arxiv.org/abs/2307.07677) #segmentation
Summary:
Class-agnostic counting (CAC) has numerous potential applications across various domains. The goal is to count objects of an arbitrary category during testing, based on only a few annotated exemplars. In this paper, we point out that the task of counting objects of interest when there are multiple object classes in the image (namely, multi-class object counting) is particularly challenging for current object counting models. They often greedily count every object regardless of the exemplars. To address this issue, we propose localizing the area containing the objects of interest via an exemplar-based segmentation model before counting them. The key challenge here is the lack of segmentation supervision to train this model. To this end, we propose a method to obtain pseudo segmentation masks using only box exemplars and dot annotations. We show that the segmentation model trained on these pseudo-labeled masks can effectively localize objects of interest for an arbitrary multi-class image based on the exemplars. To evaluate the performance of different methods on multi-class counting, we introduce two new benchmarks, a synthetic multi-class dataset and a new test set of real images in which objects from multiple classes are present. Our proposed method shows a significant advantage over the previous CAC methods on these two benchmarks.

Title: PSGformer: Enhancing 3D Point Cloud Instance Segmentation via Precise Semantic Guidance. (arXiv:2307.07708v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07708
Code URL: null
Copy Paste: [[2307.07708] PSGformer: Enhancing 3D Point Cloud Instance Segmentation via Precise Semantic Guidance](http://arxiv.org/abs/2307.07708) #segmentation
Summary:
Most existing 3D instance segmentation methods are derived from 3D semantic segmentation models. However, these indirect approaches suffer from certain limitations. They fail to fully leverage global and local semantic information for accurate prediction, which hampers the overall performance of the 3D instance segmentation framework. To address these issues, this paper presents PSGformer, a novel 3D instance segmentation network. PSGformer incorporates two key advancements to enhance the performance of 3D instance segmentation. Firstly, we propose a Multi-Level Semantic Aggregation Module, which effectively captures scene features by employing foreground point filtering and multi-radius aggregation. This module enables the acquisition of more detailed semantic information from global and local perspectives. Secondly, PSGformer introduces a Parallel Feature Fusion Transformer Module that independently processes super-point features and aggregated features using transformers. The model achieves a more comprehensive feature representation by the features which connect global and local features. We conducted extensive experiments on the ScanNetv2 dataset. Notably, PSGformer exceeds compared state-of-the-art methods by 2.2% on ScanNetv2 hidden test set in terms of mAP. Our code and models will be publicly released.

Title: Improving Translation Invariance in Convolutional Neural Networks with Peripheral Prediction Padding. (arXiv:2307.07725v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07725
Code URL: null
Copy Paste: [[2307.07725] Improving Translation Invariance in Convolutional Neural Networks with Peripheral Prediction Padding](http://arxiv.org/abs/2307.07725) #segmentation
Summary:
Zero padding is often used in convolutional neural networks to prevent the feature map size from decreasing with each layer. However, recent studies have shown that zero padding promotes encoding of absolute positional information, which may adversely affect the performance of some tasks. In this work, a novel padding method called Peripheral Prediction Padding (PP-Pad) method is proposed, which enables end-to-end training of padding values suitable for each task instead of zero padding. Moreover, novel metrics to quantitatively evaluate the translation invariance of the model are presented. By evaluating with these metrics, it was confirmed that the proposed method achieved higher accuracy and translation invariance than the previous methods in a semantic segmentation task.

Title: Handwritten and Printed Text Segmentation: A Signature Case Study. (arXiv:2307.07887v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07887
Code URL: null
Copy Paste: [[2307.07887] Handwritten and Printed Text Segmentation: A Signature Case Study](http://arxiv.org/abs/2307.07887) #segmentation
Summary:
While analyzing scanned documents, handwritten text can overlay printed text. This causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses only on the binary classification of handwritten text, or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This results in the assignment of the handwritten and printed overlapping pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches for addressing the challenges of handwritten and printed text segmentation with the goal of recovering text in different classes in whole, especially improving the segmentation performance on the overlapping parts. As such, to facilitate with this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for handwritten and printed text segmentation task. Our best configuration outperforms the prior work on two different datasets by 17.9% and 7.3% on IoU scores.

Title: Holistic Prototype Attention Network for Few-Shot VOS. (arXiv:2307.07933v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07933
Code URL: null
Copy Paste: [[2307.07933] Holistic Prototype Attention Network for Few-Shot VOS](http://arxiv.org/abs/2307.07933) #segmentation
Summary:
Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method.

Title: Dual-level Interaction for Domain Adaptive Semantic Segmentation. (arXiv:2307.07972v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07972
Code URL: https://github.com/rainjamesy/dida
Copy Paste: [[2307.07972] Dual-level Interaction for Domain Adaptive Semantic Segmentation](http://arxiv.org/abs/2307.07972) #segmentation
Summary:
To circumvent the costly pixel-wise annotations of real-world images in the semantic segmentation task, the Unsupervised Domain Adaptation (UDA) is explored to firstly train a model with the labeled source data (synthetic images) and then adapt it to the unlabeled target data (real images). Among all the techniques being studied, the self-training approach recently secures its position in domain adaptive semantic segmentation, where a model is trained with target domain pseudo-labels. Current advances have mitigated noisy pseudo-labels resulting from the domain gap. However, they still struggle with erroneous pseudo-labels near the decision boundaries of the semantic classifier. In this paper, we tackle this issue by proposing a dual-level interaction for domain adaptation (DIDA) in semantic segmentation. Explicitly, we encourage the different augmented views of the same pixel to have not only similar class prediction (semantic-level) but also akin similarity relationship respected to other pixels (instance-level). As it is impossible to keep features of all pixel instances for a dataset, we novelly design and maintain a labeled instance bank with dynamic updating strategies to selectively store the informative features of instances. Further, DIDA performs cross-level interaction with scattering and gathering techniques to regenerate more reliable pseudolabels. Our method outperforms the state-of-the-art by a notable margin, especially on confusing and long-tailed classes. Code is available at https://github.com/RainJamesY/DIDA.

Title: HRHD-HK: A benchmark dataset of high-rise and high-density urban scenes for 3D semantic segmentation of photogrammetric point clouds. (arXiv:2307.07976v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.07976
Code URL: https://github.com/luzaijiaoxial/hrhd-hk
Copy Paste: [[2307.07976] HRHD-HK: A benchmark dataset of high-rise and high-density urban scenes for 3D semantic segmentation of photogrammetric point clouds](http://arxiv.org/abs/2307.07976) #segmentation
Summary:
Many existing 3D semantic segmentation methods, deep learning in computer vision notably, claimed to achieve desired results on urban point clouds, in which the city objects are too many and diverse for people to judge qualitatively. Thus, it is significant to assess these methods quantitatively in diversified real-world urban scenes, encompassing high-rise, low-rise, high-density, and low-density urban areas. However, existing public benchmark datasets primarily represent low-rise scenes from European cities and cannot assess the methods comprehensively. This paper presents a benchmark dataset of high-rise urban point clouds, namely High-Rise, High-Density urban scenes of Hong Kong (HRHD-HK), which has been vacant for a long time. HRHD-HK arranged in 150 tiles contains 273 million colorful photogrammetric 3D points from diverse urban settings. The semantic labels of HRHD-HK include building, vegetation, road, waterbody, facility, terrain, and vehicle. To the best of our knowledge, HRHD-HK is the first photogrammetric dataset that focuses on HRHD urban areas. This paper also comprehensively evaluates eight popular semantic segmentation methods on the HRHD-HK dataset. Experimental results confirmed plenty of room for enhancing the current 3D semantic segmentation of point clouds, especially for city objects with small volumes. Our dataset is publicly available at: https://github.com/LuZaiJiaoXiaL/HRHD-HK.

Title: Multi-Object Discovery by Low-Dimensional Object Motion. (arXiv:2307.08027v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2307.08027
Code URL: null
Copy Paste: [[2307.08027] Multi-Object Discovery by Low-Dimensional Object Motion](http://arxiv.org/abs/2307.08027) #segmentation
Summary:
Recent work in unsupervised multi-object segmentation shows impressive results by predicting motion from a single image despite the inherent ambiguity in predicting motion without the next image. On the other hand, the set of possible motions for an image can be constrained to a low-dimensional space by considering the scene structure and moving objects in it. We propose to model pixel-wise geometry and object motion to remove ambiguity in reconstructing flow from a single image. Specifically, we divide the image into coherently moving regions and use depth to construct flow bases that best explain the observed flow in each region. We achieve state-of-the-art results in unsupervised multi-object segmentation on synthetic and real-world datasets by modeling the scene structure and object motion. Our evaluation of the predicted depth maps shows reliable performance in monocular depth estimation.