secure

Title: A Novel Approach for Invoice Management using Blockchain. (arXiv:2309.03303v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03303
Code URL: null
Copy Paste: [[2309.03303]] A Novel Approach for Invoice Management using Blockchain(http://arxiv.org/abs/2309.03303)
Summary:
Electronic invoicing is another area where blockchain technology is being used. Additionally, it has the power to alter how payments are made, invoices are issued, and transactions are validated. Using a blockchain-based invoicing system will enable smooth payments from a customer's digital wallet to a business's digital wallet. Transactions are simple to track and monitor, and the blockchain may be used to retrieve an exchange's full history. Sometimes shopkeepers create fake bills and submit them to the higher tax-paying authorities. To bring transparency to this billing system between customers, shopkeepers, and tax-paying authorities billing system using blockchain is to be implemented using the concept of Blockchain and make the billing system in our country work smoothly. Blockchain technology can revolutionize the invoicing and payment process by providing a secure, transparent and tamper-proof system. A blockchain-based billing system can facilitate smooth payments, allow for easy tracking and monitoring of transactions, and provide a tamper-proof history of all exchanges. The use of blockchain can prevent fraud and increase transparency among customers, shopkeepers, and tax-paying authorities. Furthermore, it can streamline the process by using digital wallets for both customers and businesses, reducing time and resources for traditional invoicing methods. Overall, blockchain technology can bring greater efficiency and trust to the billing system, benefiting all parties involved. It can prevent fraud, increase transparency and streamline the invoicing and payment process. This technology can create a more secure and efficient billing system ultimately benefiting all parties involved.

Title: Exploring Post-Quantum Cryptographic Schemes for TLS in 5G Nb-IoT: Feasibility and Recommendations. (arXiv:2309.03338v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03338
Code URL: null
Copy Paste: [[2309.03338]] Exploring Post-Quantum Cryptographic Schemes for TLS in 5G Nb-IoT: Feasibility and Recommendations(http://arxiv.org/abs/2309.03338)
Summary:
Narrowband Internet of Things (NB-IoT) is a wireless communication technology that enables a wide range of applications, from smart cities to industrial automation. As a part of the 5G extension, NB-IoT promises to connect billions of devices with low-power and low-cost requirements. However, with the advent of quantum computers, the incoming NB-IoT era is already under threat by these devices, which might break the conventional cryptographic algorithms that can be adapted to secure NB-IoT devices on large scale. In this context, we investigate the feasibility of using post-quantum key exchange and signature algorithms for securing NB-IoT applications. We develop a realistic ns-3 environment to represent the characteristics of NB-IoT networks and analyze the usage of post-quantum algorithms to secure communication. Our findings suggest that using NIST-selected post-quantum key-exchange protocol Kyber does not introduce significant overhead, but post-quantum signature schemes can result in impractical latency times and lower throughputs

Title: An Adaptive and Modular Blockchain Enabled Architecture for a Decentralized Metaverse. (arXiv:2309.03502v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03502
Code URL: null
Copy Paste: [[2309.03502]] An Adaptive and Modular Blockchain Enabled Architecture for a Decentralized Metaverse(http://arxiv.org/abs/2309.03502)
Summary:
A metaverse breaks the boundaries of time and space between people, realizing a more realistic virtual experience, improving work efficiency, and creating a new business model. Blockchain, as one of the key supporting technologies for a metaverse design, provides a trusted interactive environment. However, the rich and varied scenes of a metaverse have led to excessive consumption of on-chain resources, raising the threshold for ordinary users to join, thereby losing the human-centered design. Therefore, we propose an adaptive and modular blockchain-enabled architecture for a decentralized metaverse to address these issues. The solution includes an adaptive consensus/ledger protocol based on a modular blockchain, which can effectively adapt to the ever-changing scenarios of the metaverse, reduce resource consumption, and provide a secure and reliable interactive environment. In addition, we propose the concept of Non-Fungible Resource (NFR) to virtualize idle resources. Users can establish a temporary trusted environment and rent others' NFR to meet their computing needs. Finally, we simulate and test our solution based on XuperChain, and the experimental results prove the feasibility of our design.

Title: Caveat (IoT) Emptor: Towards Transparency of IoT Device Presence. (arXiv:2309.03574v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03574
Code URL: null
Copy Paste: [[2309.03574]] Caveat (IoT) Emptor: Towards Transparency of IoT Device Presence(http://arxiv.org/abs/2309.03574)
Summary:
As many types of IoT devices worm their way into numerous settings and many aspects of our daily lives, awareness of their presence and functionality becomes a source of major concern. Hidden IoT devices can snoop (via sensing) on nearby unsuspecting users, and impact the environment where unaware users are present, via actuation. This prompts, respectively, privacy and security/safety issues. The dangers of hidden IoT devices have been recognized and prior research suggested some means of mitigation, mostly based on traffic analysis or using specialized hardware to uncover devices. While such approaches are partially effective, there is currently no comprehensive approach to IoT device transparency. Prompted in part by recent privacy regulations (GDPR and CCPA), this paper motivates and constructs a privacy-agile Root-of-Trust architecture for IoT devices, called PAISA: Privacy-Agile IoT Sensing and Actuation. It guarantees timely and secure announcements about IoT devices' presence and their capabilities. PAISA has two components: one on the IoT device that guarantees periodic announcements of its presence even if all device software is compromised, and the other that runs on the user device, which captures and processes announcements. Notably, PAISA requires no hardware modifications; it uses a popular off-the-shelf Trusted Execution Environment (TEE) -- ARM TrustZone. This work also comprises a fully functional (open-sourced) prototype implementation of PAISA, which includes: an IoT device that makes announcements via IEEE 802.11 WiFi beacons and an Android smartphone-based app that captures and processes announcements. Both security and performance of PAISA design and prototype are discussed.

security

Title: MALITE: Lightweight Malware Detection and Classification for Constrained Devices. (arXiv:2309.03294v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03294
Code URL: null
Copy Paste: [[2309.03294]] MALITE: Lightweight Malware Detection and Classification for Constrained Devices(http://arxiv.org/abs/2309.03294)
Summary:
Today, malware is one of the primary cyberthreats to organizations. Malware has pervaded almost every type of computing device including the ones having limited memory, battery and computation power such as mobile phones, tablets and embedded devices like Internet-of-Things (IoT) devices. Consequently, the privacy and security of the malware infected systems and devices have been heavily jeopardized. In recent years, researchers have leveraged machine learning based strategies for malware detection and classification. Malware analysis approaches can only be employed in resource constrained environments if the methods are lightweight in nature. In this paper, we present MALITE, a lightweight malware analysis system, that can classify various malware families and distinguish between benign and malicious binaries. MALITE converts a binary into a gray scale or an RGB image and employs low memory and battery power consuming as well as computationally inexpensive malware analysis strategies. We have designed MALITE-MN, a lightweight neural network based architecture and MALITE-HRF, an ultra lightweight random forest based method that uses histogram features extracted by a sliding window. We evaluate the performance of both on six publicly available datasets (Malimg, Microsoft BIG, Dumpware10, MOTIF, Drebin and CICAndMal2017), and compare them to four state-of-the-art malware classification techniques. The results show that MALITE-MN and MALITE-HRF not only accurately identify and classify malware but also respectively consume several orders of magnitude lower resources (in terms of both memory as well as computation capabilities), making them much more suitable for resource constrained environments.

Title: This is How You Lose the Transient Execution War. (arXiv:2309.03376v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03376
Code URL: null
Copy Paste: [[2309.03376]] This is How You Lose the Transient Execution War(http://arxiv.org/abs/2309.03376)
Summary:
A new class of vulnerabilities related to speculative and out-of-order execution, fault-injection, and microarchitectural side channels rose to attention in 2018. The techniques behind the transient execution vulnerabilities were not new, but the combined application of the techniques was more sophisticated, and the security impact more severe, than previously considered possible. Numerous mitigations have been proposed and implemented for variants of the transient execution vulnerabilities. While Meltdown-type exception-based transient execution vulnerabilities have proven to be tractable, Spectre-type vulnerabilities and other speculation-based transient execution vulnerabilities have been far more resistant to countermeasures. A few proposed mitigations have been widely adopted by hardware vendors and software developers, but combining those commonly deployed mitigations does not produce an effective and comprehensive solution, it only protects against a small subset of the variants. Over the years, newly proposed mitigations have been trending towards more effective and comprehensive approaches with better performance, and yet, older mitigations remain the most popular despite limited security benefits and prohibitive performance penalties. If we continue this way, we can look forward to many generations of hardware debilitated by performance penalties from increasing layers of mitigations as new variants are discovered, and yet still vulnerable to both known and future variants.

Title: Measuring Website Password Creation Policies At Scale. (arXiv:2309.03384v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03384
Code URL: null
Copy Paste: [[2309.03384]] Measuring Website Password Creation Policies At Scale(http://arxiv.org/abs/2309.03384)
Summary:
Researchers have extensively explored how password creation policies influence the security and usability of user-chosen passwords, producing evidence-based policy guidelines. However, for web authentication to improve in practice, websites must actually implement these recommendations. To date, there has been limited investigation into what password creation policies are actually deployed by sites. Existing works are mostly dated and all studies relied on manual evaluations, assessing a small set of sites (at most 150, skewed towards top sites). Thus, we lack a broad understanding of the password policies used today. In this paper, we develop an automated technique for inferring a website's password creation policy, and apply it at scale to measure the policies of over 20K sites, over two orders of magnitude (135x) more sites than prior work. Our findings identify the common policies deployed, potential causes of weak policies, and directions for improving authentication in practice. Ultimately, our study provides the first large-scale understanding of password creation policies on the web.

Title: Assume but Verify: Deductive Verification of Leaked Information in Concurrent Applications (Extended Version). (arXiv:2309.03442v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03442
Code URL: null
Copy Paste: [[2309.03442]] Assume but Verify: Deductive Verification of Leaked Information in Concurrent Applications (Extended Version)(http://arxiv.org/abs/2309.03442)
Summary:
We consider the problem of specifying and proving the security of non-trivial, concurrent programs that intentionally leak information. We present a method that decomposes the problem into (a) proving that the program only leaks information it has declassified via assume annotations already widely used in deductive program verification; and (b) auditing the declassifications against a declarative security policy. We show how condition (a) can be enforced by an extension of the existing program logic SecCSL, and how (b) can be checked by proving a set of simple entailments. Part of the challenge is to define respective semantic soundness criteria and to formally connect these to the logic rules and policy audit. We support our methodology in an auto-active program verifier, which we apply to verify the implementations of various case study programs against a range of declassification policies.

Title: An Anonymous yet Accountable Contract Wallet System using Account Abstraction. (arXiv:2309.03480v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03480
Code URL: null
Copy Paste: [[2309.03480]] An Anonymous yet Accountable Contract Wallet System using Account Abstraction(http://arxiv.org/abs/2309.03480)
Summary:
Account abstraction allows a contract wallet to initiate transaction execution. Thus, account abstraction is useful for preserving the privacy of externally owned accounts (EOAs) because it can remove a transaction issued from an EOA to the contract wallet and hides who issued the transaction by additionally employing anonymous authentication procedures such as ring signatures. However, unconditional anonymity is undesirable in practice because it prevents to reveal who is accountable for a problem when it arises. Thus, maintaining a balancing between anonymity and accountability is important.

In this paper, we propose an anonymous yet accountable contract wallet system. In addition to account abstraction, the proposed system also utilizes accountable ring signatures (Bootle et al., ESORICS 2015). The proposed system provides (1) anonymity of a transaction issuer that hides who agreed with running the contract wallet, and (2) accountability of the issuer, which allows the issuer to prove they agreed with running the contract wallet. Moreover, due to a security requirement of accountable ring signatures, the transaction issuer cannot claim that someone else issued the transaction. This functionality allows us to clarify the accountability involved in issuing a transaction. In addition, the proposed system allows an issuer to employ a typical signature scheme, e.g., ECDSA, together with the ring signature scheme. This functionality can be considered an extension of the common multi-signatures that require a certain number of ECDSA signatures to run a contract wallet. The proposed system was implemented using zkSync (Solidity). We discuss several potential applications of the proposed system, i.e., medical information sharing and asset management.

Title: A New Model for Testing IPv6 Fragment Handling. (arXiv:2309.03525v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03525
Code URL: null
Copy Paste: [[2309.03525]] A New Model for Testing IPv6 Fragment Handling(http://arxiv.org/abs/2309.03525)
Summary:
Since the origins of the Internet, various vulnerabilities exploiting the IP fragmentation process have plagued IPv4 protocol, many leading to a wide range of attacks. IPv6 modified the handling of fragmentations and introduced a specific extension header, not solving the related problems, as proved by extensive literature. One of the primary sources of problems has been the overlapping fragments, which result in unexpected or malicious packets when reassembled. To overcome the problem related to fragmentation, the authors of RFC 5722 decided that IPv6 hosts MUST silently drop overlapping fragments.

Since then, several studies have proposed methodologies to check if IPv6 hosts accept overlapping fragments and are still vulnerable to related attacks. However, some of the above methodologies have not been proven complete or need to be more accurate. In this paper we propose a novel model to check IPv6 fragmentation handling specifically suited for the reassembling strategies of modern operating systems. Previous models, indeed, considered OS reassembly policy as byte-based. However, nowadays, reassembly policies are fragment-based, making previous models inadequate. Our model leverages the commutative property of the checksum, simplifying the whole assessing process. Starting with this new model, we were able to better evaluate the RFC-5722 and RFC-9099 compliance of modern operating systems against fragmentation handling. Our results suggest that IPv6 fragmentation can still be considered a threat and that more effort is needed to solve related security issues.

Title: Security assessment of common open source MQTT brokers and clients. (arXiv:2309.03547v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03547
Code URL: null
Copy Paste: [[2309.03547]] Security assessment of common open source MQTT brokers and clients(http://arxiv.org/abs/2309.03547)
Summary:
Security and dependability of devices are paramount for the IoT ecosystem. Message Queuing Telemetry Transport protocol (MQTT) is the de facto standard and the most common alternative for those limited devices that cannot leverage HTTP. However, the MQTT protocol was designed with no security concern since initially designed for private networks of the oil and gas industry. Since MQTT is widely used for real applications, it is under the lens of the security community, also considering the widespread attacks targeting IoT devices. Following this direction research, in this paper we present an empirical security evaluation of several widespread implementations of MQTT system components, namely five broker libraries and three client libraries. While the results of our research do not capture very critical flaws, there are several scenarios where some libraries do not fully adhere to the standard and leave some margins that could be maliciously exploited and potentially cause system inconsistencies.

Title: Zero Trust: Applications, Challenges, and Opportunities. (arXiv:2309.03582v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03582
Code URL: null
Copy Paste: [[2309.03582]] Zero Trust: Applications, Challenges, and Opportunities(http://arxiv.org/abs/2309.03582)
Summary:
The escalating complexity of cybersecurity threats necessitates innovative approaches to safeguard digital assets and sensitive information. The Zero Trust paradigm offers a transformative solution by challenging conventional security models and emphasizing continuous verification and least privilege access. This survey comprehensively explores the theoretical foundations, practical implementations, applications, challenges, and future trends of Zero Trust. Through meticulous analysis, we highlight the relevance of Zero Trust in securing cloud environments, facilitating remote work, and protecting the Internet of Things (IoT) ecosystem. While cultural barriers and technical complexities present challenges, their mitigation unlocks Zero Trust's potential. Integrating Zero Trust with emerging technologies like AI and machine learning augments its efficacy, promising a dynamic and responsive security landscape. Embracing Zero Trust empowers organizations to navigate the ever-evolving cybersecurity realm with resilience and adaptability, redefining trust in the digital age.

Title: ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search. (arXiv:2309.03647v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03647
Code URL: null
Copy Paste: [[2309.03647]] ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search(http://arxiv.org/abs/2309.03647)
Summary:
We present ProvG-Searcher, a novel approach for detecting known APT behaviors within system security logs. Our approach leverages provenance graphs, a comprehensive graph representation of event logs, to capture and depict data provenance relations by mapping system entities as nodes and their interactions as edges. We formulate the task of searching provenance graphs as a subgraph matching problem and employ a graph representation learning method. The central component of our search methodology involves embedding of subgraphs in a vector space where subgraph relationships can be directly evaluated. We achieve this through the use of order embeddings that simplify subgraph matching to straightforward comparisons between a query and precomputed subgraph representations. To address challenges posed by the size and complexity of provenance graphs, we propose a graph partitioning scheme and a behavior-preserving graph reduction method. Overall, our technique offers significant computational efficiency, allowing most of the search computation to be performed offline while incorporating a lightweight comparison step during query execution. Experimental results on standard datasets demonstrate that ProvG-Searcher achieves superior performance, with an accuracy exceeding 99% in detecting query behaviors and a false positive rate of approximately 0.02%, outperforming other approaches.

Title: The complexity of solving a random polynomial system. (arXiv:2309.03855v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03855
Code URL: null
Copy Paste: [[2309.03855]] The complexity of solving a random polynomial system(http://arxiv.org/abs/2309.03855)
Summary:
A multivariate cryptograpic instance in practice is a multivariate polynomial system. So the security of a protocol rely on the complexity of solving a multivariate polynomial system. In this paper there is an overview on a general algorithm used to solve a multivariate system and the quantity to which the complexity of this algorithm depends on: the solving degree. Unfortunately, it is hard to compute. For this reason, it is introduced an invariant: the degree of regularity. This invariant, under certain condition, give us an upper bound on the solving degree. Then we speak about random polynomial systems and in particular what "random" means to us. Finally, we give an upper bound on both the degree of regularity and the solving degree of such random systems.

Title: A Natural Gas Consumption Forecasting System for Continual Learning Scenarios based on Hoeffding Trees with Change Point Detection Mechanism. (arXiv:2309.03720v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03720
Code URL: null
Copy Paste: [[2309.03720]] A Natural Gas Consumption Forecasting System for Continual Learning Scenarios based on Hoeffding Trees with Change Point Detection Mechanism(http://arxiv.org/abs/2309.03720)
Summary:
Forecasting natural gas consumption, considering seasonality and trends, is crucial in planning its supply and consumption and optimizing the cost of obtaining it, mainly by industrial entities. However, in times of threats to its supply, it is also a critical element that guarantees the supply of this raw material to meet individual consumers' needs, ensuring society's energy security. This article introduces a novel multistep ahead forecasting of natural gas consumption with change point detection integration for model collection selection with continual learning capabilities using data stream processing. The performance of the forecasting models based on the proposed approach is evaluated in a complex real-world use case of natural gas consumption forecasting. We employed Hoeffding tree predictors as forecasting models and the Pruned Exact Linear Time (PELT) algorithm for the change point detection procedure. The change point detection integration enables selecting a different model collection for successive time frames. Thus, three model collection selection procedures (with and without an error feedback loop) are defined and evaluated for forecasting scenarios with various densities of detected change points. These models were compared with change point agnostic baseline approaches. Our experiments show that fewer change points result in a lower forecasting error regardless of the model collection selection procedure employed. Also, simpler model collection selection procedures omitting forecasting error feedback leads to more robust forecasting models suitable for continual learning tasks.

privacy

Title: FisheyePP4AV: A privacy-preserving method for autonomous vehicles on fisheye camera images. (arXiv:2309.03799v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03799
Code URL: null
Copy Paste: [[2309.03799]] FisheyePP4AV: A privacy-preserving method for autonomous vehicles on fisheye camera images(http://arxiv.org/abs/2309.03799)
Summary:
In many parts of the world, the use of vast amounts of data collected on public roadways for autonomous driving has increased. In order to detect and anonymize pedestrian faces and nearby car license plates in actual road-driving scenarios, there is an urgent need for effective solutions. As more data is collected, privacy concerns regarding it increase, including but not limited to pedestrian faces and surrounding vehicle license plates. Normal and fisheye cameras are the two common camera types that are typically mounted on collection vehicles. With complex camera distortion models, fisheye camera images were deformed in contrast to regular images. It causes computer vision tasks to perform poorly when using numerous deep learning models. In this work, we pay particular attention to protecting privacy while yet adhering to several laws for fisheye camera photos taken by driverless vehicles. First, we suggest a framework for extracting face and plate identification knowledge from several teacher models. Our second suggestion is to transform both the image and the label from a regular image to fisheye-like data using a varied and realistic fisheye transformation. Finally, we run a test using the open-source PP4AV dataset. The experimental findings demonstrated that our model outperformed baseline methods when trained on data from autonomous vehicles, even when the data were softly labeled. The implementation code is available at our github: https://github.com/khaclinh/FisheyePP4AV.

Title: Byzantine-Robust Federated Learning with Variance Reduction and Differential Privacy. (arXiv:2309.03437v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03437
Code URL: null
Copy Paste: [[2309.03437]] Byzantine-Robust Federated Learning with Variance Reduction and Differential Privacy(http://arxiv.org/abs/2309.03437)
Summary:
Federated learning (FL) is designed to preserve data privacy during model training, where the data remains on the client side (i.e., IoT devices), and only model updates of clients are shared iteratively for collaborative learning. However, this process is vulnerable to privacy attacks and Byzantine attacks: the local model updates shared throughout the FL network will leak private information about the local training data, and they can also be maliciously crafted by Byzantine attackers to disturb the learning. In this paper, we propose a new FL scheme that guarantees rigorous privacy and simultaneously enhances system robustness against Byzantine attacks. Our approach introduces sparsification- and momentum-driven variance reduction into the client-level differential privacy (DP) mechanism, to defend against Byzantine attackers. The security design does not violate the privacy guarantee of the client-level DP mechanism; hence, our approach achieves the same client-level DP guarantee as the state-of-the-art. We conduct extensive experiments on both IID and non-IID datasets and different tasks and evaluate the performance of our approach against different Byzantine attacks by comparing it with state-of-the-art defense methods. The results of our experiments show the efficacy of our framework and demonstrate its ability to improve system robustness against Byzantine attacks while achieving a strong privacy guarantee.

Title: Privacy-preserving Continual Federated Clustering via Adaptive Resonance Theory. (arXiv:2309.03487v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03487
Code URL: https://github.com/Masuyama-lab/FCAC
Copy Paste: [[2309.03487]] Privacy-preserving Continual Federated Clustering via Adaptive Resonance Theory(http://arxiv.org/abs/2309.03487)
Summary:
With the increasing importance of data privacy protection, various privacy-preserving machine learning methods have been proposed. In the clustering domain, various algorithms with a federated learning framework (i.e., federated clustering) have been actively studied and showed high clustering performance while preserving data privacy. However, most of the base clusterers (i.e., clustering algorithms) used in existing federated clustering algorithms need to specify the number of clusters in advance. These algorithms, therefore, are unable to deal with data whose distributions are unknown or continually changing. To tackle this problem, this paper proposes a privacy-preserving continual federated clustering algorithm. In the proposed algorithm, an adaptive resonance theory-based clustering algorithm capable of continual learning is used as a base clusterer. Therefore, the proposed algorithm inherits the ability of continual learning. Experimental results with synthetic and real-world datasets show that the proposed algorithm has superior clustering performance to state-of-the-art federated clustering algorithms while realizing data privacy protection and continual learning ability. The source code is available at \url{https://github.com/Masuyama-lab/FCAC}.

Title: TSGBench: Time Series Generation Benchmark. (arXiv:2309.03755v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03755
Code URL: null
Copy Paste: [[2309.03755]] TSGBench: Time Series Generation Benchmark(http://arxiv.org/abs/2309.03755)
Summary:
Synthetic Time Series Generation (TSG) is crucial in a range of applications, including data augmentation, anomaly detection, and privacy preservation. Although significant strides have been made in this field, existing methods exhibit three key limitations: (1) They often benchmark against similar model types, constraining a holistic view of performance capabilities. (2) The use of specialized synthetic and private datasets introduces biases and hampers generalizability. (3) Ambiguous evaluation measures, often tied to custom networks or downstream tasks, hinder consistent and fair comparison.

To overcome these limitations, we introduce \textsf{TSGBench}, the inaugural TSG Benchmark, designed for a unified and comprehensive assessment of TSG methods. It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; (3) a pioneering generalization test rooted in Domain Adaptation (DA), compatible with all methods. We have conducted extensive experiments across ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures, all gauged through \textsf{TSGBench}. The results highlight its remarkable efficacy and consistency. More importantly, \textsf{TSGBench} delivers a statistical breakdown of method rankings, illuminating performance variations across different datasets and measures, and offering nuanced insights into the effectiveness of each method.

protect

defense

Title: DiffDefense: Defending against Adversarial Attacks via Diffusion Models. (arXiv:2309.03702v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03702
Code URL: null
Copy Paste: [[2309.03702]] DiffDefense: Defending against Adversarial Attacks via Diffusion Models(http://arxiv.org/abs/2309.03702)
Summary:
This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.

attack

Title: MIRA: Cracking Black-box Watermarking on Deep Neural Networks via Model Inversion-based Removal Attacks. (arXiv:2309.03466v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03466
Code URL: null
Copy Paste: [[2309.03466]] MIRA: Cracking Black-box Watermarking on Deep Neural Networks via Model Inversion-based Removal Attacks(http://arxiv.org/abs/2309.03466)
Summary:
To protect the intellectual property of well-trained deep neural networks (DNNs), black-box DNN watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. Recent studies empirically prove the robustness of most black-box watermarking schemes against known removal attempts.

In this paper, we propose a novel Model Inversion-based Removal Attack (\textsc{Mira}), which is watermark-agnostic and effective against most of mainstream black-box DNN watermarking schemes. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss caused by \textsc{Mira} and achieve data-free watermark removal on half of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Mira} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with six baseline removal attacks, \textsc{Mira} achieves strong watermark removal effects on the covered watermarks, preserving at least $90\%$ of the stolen model utility, under more relaxed or even no assumptions on the dataset availability.

Title: Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains. (arXiv:2309.03660v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03660
Code URL: null
Copy Paste: [[2309.03660]] Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains(http://arxiv.org/abs/2309.03660)
Summary:
Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for model training and deployment. In this paper, we propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains in an organization with limited training data. Specifically, it utilizes meta-learning to share knowledge across these domains, e.g., the relationship between HTTP requests in heterogeneous domains, to efficiently train detection models. Moreover, we develop an adaptive preprocessing module to facilitate semantic analysis of Web requests across different domains and design a multi-domain representation method to capture semantic correlations between different domains for cross-domain model training. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. The experimental results demonstrate that RETSINA outperforms the existing unsupervised Web attack detection methods with limited training data, e.g., RETSINA needs only 5-minute training data to achieve comparable detection performance to the existing methods that train separate models for different domains using 1-day training data. We also conduct real-world deployment in an Internet company. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.

Title: How adversarial attacks can disrupt seemingly stable accurate classifiers. (arXiv:2309.03665v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03665
Code URL: null
Copy Paste: [[2309.03665]] How adversarial attacks can disrupt seemingly stable accurate classifiers(http://arxiv.org/abs/2309.03665)
Summary:
Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

robust

Title: Robust Visual Tracking by Motion Analyzing. (arXiv:2309.03247v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03247
Code URL: null
Copy Paste: [[2309.03247]] Robust Visual Tracking by Motion Analyzing(http://arxiv.org/abs/2309.03247)
Summary:
In recent years, Video Object Segmentation (VOS) has emerged as a complementary method to Video Object Tracking (VOT). VOS focuses on classifying all the pixels around the target, allowing for precise shape labeling, while VOT primarily focuses on the approximate region where the target might be. However, traditional segmentation modules usually classify pixels frame by frame, disregarding information between adjacent frames.

In this paper, we propose a new algorithm that addresses this limitation by analyzing the motion pattern using the inherent tensor structure. The tensor structure, obtained through Tucker2 tensor decomposition, proves to be effective in describing the target's motion. By incorporating this information, we achieved competitive results on Four benchmarks LaSOT\cite{fan2019lasot}, AVisT\cite{noman2022avist}, OTB100\cite{7001050}, and GOT-10k\cite{huang2019got} LaSOT\cite{fan2019lasot} with SOTA. Furthermore, the proposed tracker is capable of real-time operation, adding value to its practical application.

Title: ViewMix: Augmentation for Robust Representation in Self-Supervised Learning. (arXiv:2309.03360v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03360
Code URL: null
Copy Paste: [[2309.03360]] ViewMix: Augmentation for Robust Representation in Self-Supervised Learning(http://arxiv.org/abs/2309.03360)
Summary:
Joint Embedding Architecture-based self-supervised learning methods have attributed the composition of data augmentations as a crucial factor for their strong representation learning capabilities. While regional dropout strategies have proven to guide models to focus on lesser indicative parts of the objects in supervised methods, it hasn't been adopted by self-supervised methods for generating positive pairs. This is because the regional dropout methods are not suitable for the input sampling process of the self-supervised methodology. Whereas dropping informative pixels from the positive pairs can result in inefficient training, replacing patches of a specific object with a different one can steer the model from maximizing the agreement between different positive pairs. Moreover, joint embedding representation learning methods have not made robustness their primary training outcome. To this end, we propose the ViewMix augmentation policy, specially designed for self-supervised learning, upon generating different views of the same image, patches are cut and pasted from one view to another. By leveraging the different views created by this augmentation strategy, multiple joint embedding-based self-supervised methodologies obtained better localization capability and consistently outperformed their corresponding baseline methods. It is also demonstrated that incorporating ViewMix augmentation policy promotes robustness of the representations in the state-of-the-art methods. Furthermore, our experimentation and analysis of compute times suggest that ViewMix augmentation doesn't introduce any additional overhead compared to other counterparts.

Title: Active shooter detection and robust tracking utilizing supplemental synthetic data. (arXiv:2309.03381v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03381
Code URL: null
Copy Paste: [[2309.03381]] Active shooter detection and robust tracking utilizing supplemental synthetic data(http://arxiv.org/abs/2309.03381)
Summary:
The increasing concern surrounding gun violence in the United States has led to a focus on developing systems to improve public safety. One approach to developing such a system is to detect and track shooters, which would help prevent or mitigate the impact of violent incidents. In this paper, we proposed detecting shooters as a whole, rather than just guns, which would allow for improved tracking robustness, as obscuring the gun would no longer cause the system to lose sight of the threat. However, publicly available data on shooters is much more limited and challenging to create than a gun dataset alone. Therefore, we explore the use of domain randomization and transfer learning to improve the effectiveness of training with synthetic data obtained from Unreal Engine environments. This enables the model to be trained on a wider range of data, increasing its ability to generalize to different situations. Using these techniques with YOLOv8 and Deep OC-SORT, we implemented an initial version of a shooter tracking system capable of running on edge hardware, including both a Raspberry Pi and a Jetson Nano.

Title: Towards Robust Natural-Looking Mammography Lesion Synthesis on Ipsilateral Dual-Views Breast Cancer Analysis. (arXiv:2309.03506v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03506
Code URL: null
Copy Paste: [[2309.03506]] Towards Robust Natural-Looking Mammography Lesion Synthesis on Ipsilateral Dual-Views Breast Cancer Analysis(http://arxiv.org/abs/2309.03506)
Summary:
In recent years, many mammographic image analysis methods have been introduced for improving cancer classification tasks. Two major issues of mammogram classification tasks are leveraging multi-view mammographic information and class-imbalance handling. In the first problem, many multi-view methods have been released for concatenating features of two or more views for the training and inference stage. Having said that, most multi-view existing methods are not explainable in the meaning of feature fusion, and treat many views equally for diagnosing. Our work aims to propose a simple but novel method for enhancing examined view (main view) by leveraging low-level feature information from the auxiliary view (ipsilateral view) before learning the high-level feature that contains the cancerous features. For the second issue, we also propose a simple but novel malignant mammogram synthesis framework for upsampling minor class samples. Our easy-to-implement and no-training framework has eliminated the current limitation of the CutMix algorithm which is unreliable synthesized images with random pasted patches, hard-contour problems, and domain shift problems. Our results on VinDr-Mammo and CMMD datasets show the effectiveness of our two new frameworks for both multi-view training and synthesizing mammographic images, outperforming the previous conventional methods in our experimental settings.

Title: A Robust Negative Learning Approach to Partial Domain Adaptation Using Source Prototypes. (arXiv:2309.03531v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03531
Code URL: null
Copy Paste: [[2309.03531]] A Robust Negative Learning Approach to Partial Domain Adaptation Using Source Prototypes(http://arxiv.org/abs/2309.03531)
Summary:
This work proposes a robust Partial Domain Adaptation (PDA) framework that mitigates the negative transfer problem by incorporating a robust target-supervision strategy. It leverages ensemble learning and includes diverse, complementary label feedback, alleviating the effect of incorrect feedback and promoting pseudo-label refinement. Rather than relying exclusively on first-order moments for distribution alignment, our approach offers explicit objectives to optimize intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion. Notably, we ensure source data privacy by eliminating the need to access the source data during the adaptation phase through a priori inference of source prototypes. We conducted a series of comprehensive experiments, including an ablation analysis, covering a range of partial domain adaptation tasks. Comprehensive evaluations on benchmark datasets corroborate our framework's enhanced robustness and generalization, demonstrating its superiority over existing state-of-the-art PDA approaches.

Title: Chasing Consistency in Text-to-3D Generation from a Single Image. (arXiv:2309.03599v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03599
Code URL: null
Copy Paste: [[2309.03599]] Chasing Consistency in Text-to-3D Generation from a Single Image(http://arxiv.org/abs/2309.03599)
Summary:
Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.

Title: CenTime: Event-Conditional Modelling of Censoring in Survival Analysis. (arXiv:2309.03851v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03851
Code URL: https://github.com/ahmedhshahin/CenTime
Copy Paste: [[2309.03851]] CenTime: Event-Conditional Modelling of Censoring in Survival Analysis(http://arxiv.org/abs/2309.03851)
Summary:
Survival analysis is a valuable tool for estimating the time until specific events, such as death or cancer recurrence, based on baseline observations. This is particularly useful in healthcare to prognostically predict clinically important events based on patient data. However, existing approaches often have limitations; some focus only on ranking patients by survivability, neglecting to estimate the actual event time, while others treat the problem as a classification task, ignoring the inherent time-ordered structure of the events. Furthermore, the effective utilization of censored samples - training data points where the exact event time is unknown - is essential for improving the predictive accuracy of the model. In this paper, we introduce CenTime, a novel approach to survival analysis that directly estimates the time to event. Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce. We demonstrate that our approach forms a consistent estimator for the event model parameters, even in the absence of uncensored data. Furthermore, CenTime is easily integrated with deep learning models with no restrictions on batch size or the number of uncensored samples. We compare our approach with standard survival analysis methods, including the Cox proportional-hazard model and DeepHit. Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance. Our implementation is publicly available at https://github.com/ahmedhshahin/CenTime.

Title: Implicit Design Choices and Their Impact on Emotion Recognition Model Development and Evaluation. (arXiv:2309.03238v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03238
Code URL: null
Copy Paste: [[2309.03238]] Implicit Design Choices and Their Impact on Emotion Recognition Model Development and Evaluation(http://arxiv.org/abs/2309.03238)
Summary:
Emotion recognition is a complex task due to the inherent subjectivity in both the perception and production of emotions. The subjectivity of emotions poses significant challenges in developing accurate and robust computational models. This thesis examines critical facets of emotion recognition, beginning with the collection of diverse datasets that account for psychological factors in emotion production.

To handle the challenge of non-representative training data, this work collects the Multimodal Stressed Emotion dataset, which introduces controlled stressors during data collection to better represent real-world influences on emotion production. To address issues with label subjectivity, this research comprehensively analyzes how data augmentation techniques and annotation schemes impact emotion perception and annotator labels. It further handles natural confounding variables and variations by employing adversarial networks to isolate key factors like stress from learned emotion representations during model training. For tackling concerns about leakage of sensitive demographic variables, this work leverages adversarial learning to strip sensitive demographic information from multimodal encodings. Additionally, it proposes optimized sociological evaluation metrics aligned with cost-effective, real-world needs for model testing.

This research advances robust, practical emotion recognition through multifaceted studies of challenges in datasets, labels, modeling, demographic and membership variable encoding in representations, and evaluation. The groundwork has been laid for cost-effective, generalizable emotion recognition models that are less likely to encode sensitive demographic information.

Title: Community-Based Hierarchical Positive-Unlabeled (PU) Model Fusion for Chronic Disease Prediction. (arXiv:2309.03386v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03386
Code URL: null
Copy Paste: [[2309.03386]] Community-Based Hierarchical Positive-Unlabeled (PU) Model Fusion for Chronic Disease Prediction(http://arxiv.org/abs/2309.03386)
Summary:
Positive-Unlabeled (PU) Learning is a challenge presented by binary classification problems where there is an abundance of unlabeled data along with a small number of positive data instances, which can be used to address chronic disease screening problem. State-of-the-art PU learning methods have resulted in the development of various risk estimators, yet they neglect the differences among distinct populations. To address this issue, we present a novel Positive-Unlabeled Learning Tree (PUtree) algorithm. PUtree is designed to take into account communities such as different age or income brackets, in tasks of chronic disease prediction. We propose a novel approach for binary decision-making, which hierarchically builds community-based PU models and then aggregates their deliverables. Our method can explicate each PU model on the tree for the optimized non-leaf PU node splitting. Furthermore, a mask-recovery data augmentation strategy enables sufficient training of the model in individual communities. Additionally, the proposed approach includes an adversarial PU risk estimator to capture hierarchical PU-relationships, and a model fusion network that integrates data from each tree path, resulting in robust binary classification results. We demonstrate the superior performance of PUtree as well as its variants on two benchmarks and a new diabetes-prediction dataset.

Title: Short-Term Load Forecasting Using A Particle-Swarm Optimized Multi-Head Attention-Augmented CNN-LSTM Network. (arXiv:2309.03694v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03694
Code URL: null
Copy Paste: [[2309.03694]] Short-Term Load Forecasting Using A Particle-Swarm Optimized Multi-Head Attention-Augmented CNN-LSTM Network(http://arxiv.org/abs/2309.03694)
Summary:
Short-term load forecasting is of paramount importance in the efficient operation and planning of power systems, given its inherent non-linear and dynamic nature. Recent strides in deep learning have shown promise in addressing this challenge. However, these methods often grapple with hyperparameter sensitivity, opaqueness in interpretability, and high computational overhead for real-time deployment. In this paper, I propose a novel solution that surmounts these obstacles. Our approach harnesses the power of the Particle-Swarm Optimization algorithm to autonomously explore and optimize hyperparameters, a Multi-Head Attention mechanism to discern the salient features crucial for accurate forecasting, and a streamlined framework for computational efficiency. Our method undergoes rigorous evaluation using a genuine electricity demand dataset. The results underscore its superiority in terms of accuracy, robustness, and computational efficiency. Notably, our Mean Absolute Percentage Error of 1.9376 marks a significant advancement over existing state-of-the-art approaches, heralding a new era in short-term load forecasting.

Title: Adversarially Robust Deep Learning with Optimal-Transport-Regularized Divergences. (arXiv:2309.03791v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03791
Code URL: null
Copy Paste: [[2309.03791]] Adversarially Robust Deep Learning with Optimal-Transport-Regularized Divergences(http://arxiv.org/abs/2309.03791)
Summary:
We introduce the $ARMOR_D$ methods as novel approaches to enhancing the adversarial robustness of deep learning models. These methods are based on a new class of optimal-transport-regularized divergences, constructed via an infimal convolution between an information divergence and an optimal-transport (OT) cost. We use these as tools to enhance adversarial robustness by maximizing the expected loss over a neighborhood of distributions, a technique known as distributionally robust optimization. Viewed as a tool for constructing adversarial samples, our method allows samples to be both transported, according to the OT cost, and re-weighted, according to the information divergence. We demonstrate the effectiveness of our method on malware detection and image recognition applications and find that, to our knowledge, it outperforms existing methods at enhancing the robustness against adversarial attacks. $ARMOR_D$ yields the robustified accuracy of $98.29\%$ against $FGSM$ and $98.18\%$ against $PGD^{40}$ on the MNIST dataset, reducing the error rate by more than $19.7\%$ and $37.2\%$ respectively compared to prior methods. Similarly, in malware detection, a discrete (binary) data domain, $ARMOR_D$ improves the robustified accuracy under $rFGSM^{50}$ attack compared to the previous best-performing adversarial training methods by $37.0\%$ while lowering false negative and false positive rates by $51.1\%$ and $57.53\%$, respectively.

biometric

steal

Title: Password-Stealing without Hacking: Wi-Fi Enabled Practical Keystroke Eavesdropping. (arXiv:2309.03492v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2309.03492
Code URL: null
Copy Paste: [[2309.03492]] Password-Stealing without Hacking: Wi-Fi Enabled Practical Keystroke Eavesdropping(http://arxiv.org/abs/2309.03492)
Summary:
The contact-free sensing nature of Wi-Fi has been leveraged to achieve privacy breaches, yet existing attacks relying on Wi-Fi CSI (channel state information) demand hacking Wi-Fi hardware to obtain desired CSIs. Since such hacking has proven prohibitively hard due to compact hardware, its feasibility in keeping up with fast-developing Wi-Fi technology becomes very questionable. To this end, we propose WiKI-Eve to eavesdrop keystrokes on smartphones without the need for hacking. WiKI-Eve exploits a new feature, BFI (beamforming feedback information), offered by latest Wi-Fi hardware: since BFI is transmitted from a smartphone to an AP in clear-text, it can be overheard (hence eavesdropped) by any other Wi-Fi devices switching to monitor mode. As existing keystroke inference methods offer very limited generalizability, WiKI-Eve further innovates in an adversarial learning scheme to enable its inference generalizable towards unseen scenarios. We implement WiKI-Eve and conduct extensive evaluation on it; the results demonstrate that WiKI-Eve achieves 88.9% inference accuracy for individual keystrokes and up to 65.8% top-10 accuracy for stealing passwords of mobile applications (e.g., WeChat).

extraction

Title: Source Camera Identification and Detection in Digital Videos through Blind Forensics. (arXiv:2309.03353v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03353
Code URL: null
Copy Paste: [[2309.03353]] Source Camera Identification and Detection in Digital Videos through Blind Forensics(http://arxiv.org/abs/2309.03353)
Summary:
Source camera identification in digital videos is the problem of associating an unknown digital video with its source device, within a closed set of possible devices. The existing techniques in source detection of digital videos try to find a fingerprint of the actual source in the video in form of PRNU (Photo Response Non--Uniformity), and match it against the SPN (Sensor Pattern Noise) of each possible device. The highest correlation indicates the correct source. We investigate the problem of identifying a video source through a feature based approach using machine learning. In this paper, we present a blind forensic technique of video source authentication and identification, based on feature extraction, feature selection and subsequent source classification. The main aim is to determine whether a claimed source for a video is actually its original source. If not, we identify its original source. Our experimental results prove the efficiency of the proposed method compared to traditional fingerprint based technique.

Title: A novel method for iris recognition using BP neural network and parallel computing by the aid of GPUs (Graphics Processing Units). (arXiv:2309.03390v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03390
Code URL: null
Copy Paste: [[2309.03390]] A novel method for iris recognition using BP neural network and parallel computing by the aid of GPUs (Graphics Processing Units)(http://arxiv.org/abs/2309.03390)
Summary:
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as being unique to each iris. Then the back propagation neural network (BPNN) is used as a classifier. In this system, the BPNN parallel algorithms and their implementation on GPUs have been used by the aid of CUDA in order to speed up the learning process. Finally, the system performance and the speeding outcomes in a way that this algorithm is done in series are presented.

Title: ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles. (arXiv:2309.03734v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03734
Code URL: null
Copy Paste: [[2309.03734]] ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles(http://arxiv.org/abs/2309.03734)
Summary:
Thanks to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods may reliably produce accurate detections even in low-visibility conditions. This makes them preferable to use in autonomous vehicles' perception systems, especially as the combined cost of both sensors is cheaper than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion which often involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. While performing fusion on the image plane is generally simpler and faster, projecting radar points onto the image plane flattens the depth dimension of the point cloud which might lead to information loss and makes extracting the spatial features of the point cloud harder. We proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters: a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

Title: Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty. (arXiv:2309.03433v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03433
Code URL: null
Copy Paste: [[2309.03433]] Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty(http://arxiv.org/abs/2309.03433)
Summary:
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text, typically in the form of (subject, relation, object) triples. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks due to two key issues. First, LLMs struggle to distinguish irrelevant context from relevant relations and generate structured output due to the restrictions on fine-tuning the model. Second, LLMs generates responses autoregressively based on probability, which makes the predicted relations lack confidence. In this paper, we assess the capabilities of LLMs in improving the OIE task. Particularly, we propose various in-context learning strategies to enhance LLM's instruction-following ability and a demonstration uncertainty quantification module to enhance the confidence of the generated relations. Our experiments on three OIE benchmark datasets show that our approach holds its own against established supervised methods, both quantitatively and qualitatively.

Title: Introducing "Forecast Utterance" for Conversational Data Science. (arXiv:2309.03877v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03877
Code URL: null
Copy Paste: [[2309.03877]] Introducing "Forecast Utterance" for Conversational Data Science(http://arxiv.org/abs/2309.03877)
Summary:
Envision an intelligent agent capable of assisting users in conducting forecasting tasks through intuitive, natural conversations, without requiring in-depth knowledge of the underlying machine learning (ML) processes. A significant challenge for the agent in this endeavor is to accurately comprehend the user's prediction goals and, consequently, formulate precise ML tasks. In this paper, we take a pioneering step towards this ambitious goal by introducing a new concept called Forecast Utterance and then focus on the automatic and accurate interpretation of users' prediction goals from these utterances. Specifically, we frame the task as a slot-filling problem, where each slot corresponds to a specific aspect of the goal prediction task. We then employ two zero-shot methods for solving the slot-filling task, namely: 1) Entity Extraction (EE), and 2) Question-Answering (QA) techniques. Our experiments, conducted with three meticulously crafted data sets, validate the viability of our ambitious goal and demonstrate the effectiveness of both EE and QA techniques in interpreting Forecast Utterances.

membership infer

federate

Title: Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size. (arXiv:2309.03469v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03469
Code URL: null
Copy Paste: [[2309.03469]] Fast FixMatch: Faster Semi-Supervised Learning with Curriculum Batch Size(http://arxiv.org/abs/2309.03469)
Summary:
Advances in Semi-Supervised Learning (SSL) have almost entirely closed the gap between SSL and Supervised Learning at a fraction of the number of labels. However, recent performance improvements have often come \textit{at the cost of significantly increased training computation}. To address this, we propose Curriculum Batch Size (CBS), \textit{an unlabeled batch size curriculum which exploits the natural training dynamics of deep neural networks.} A small unlabeled batch size is used in the beginning of training and is gradually increased to the end of training. A fixed curriculum is used regardless of dataset, model or number of epochs, and reduced training computations is demonstrated on all settings. We apply CBS, strong labeled augmentation, Curriculum Pseudo Labeling (CPL) \citep{FlexMatch} to FixMatch \citep{FixMatch} and term the new SSL algorithm Fast FixMatch. We perform an ablation study to show that strong labeled augmentation and/or CPL do not significantly reduce training computations, but, in synergy with CBS, they achieve optimal performance. Fast FixMatch also achieves substantially higher data utilization compared to previous state-of-the-art. Fast FixMatch achieves between $2.1\times$ - $3.4\times$ reduced training computations on CIFAR-10 with all but 40, 250 and 4000 labels removed, compared to vanilla FixMatch, while attaining the same cited state-of-the-art error rate \citep{FixMatch}. Similar results are achieved for CIFAR-100, SVHN and STL-10. Finally, Fast MixMatch achieves between $2.6\times$ - $3.3\times$ reduced training computations in federated SSL tasks and online/streaming learning SSL tasks, which further demonstrate the generializbility of Fast MixMatch to different scenarios and tasks.

Title: Sparse Federated Training of Object Detection in the Internet of Vehicles. (arXiv:2309.03569v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03569
Code URL: null
Copy Paste: [[2309.03569]] Sparse Federated Training of Object Detection in the Internet of Vehicles(http://arxiv.org/abs/2309.03569)
Summary:
As an essential component part of the Intelligent Transportation System (ITS), the Internet of Vehicles (IoV) plays a vital role in alleviating traffic issues. Object detection is one of the key technologies in the IoV, which has been widely used to provide traffic management services by analyzing timely and sensitive vehicle-related information. However, the current object detection methods are mostly based on centralized deep training, that is, the sensitive data obtained by edge devices need to be uploaded to the server, which raises privacy concerns. To mitigate such privacy leakage, we first propose a federated learning-based framework, where well-trained local models are shared in the central server. However, since edge devices usually have limited computing power, plus a strict requirement of low latency in IoVs, we further propose a sparse training process on edge devices, which can effectively lighten the model, and ensure its training efficiency on edge devices, thereby reducing communication overheads. In addition, due to the diverse computing capabilities and dynamic environment, different sparsity rates are applied to edge devices. To further guarantee the performance, we propose, FedWeg, an improved aggregation scheme based on FedAvg, which is designed by the inverse ratio of sparsity rates. Experiments on the real-life dataset using YOLO show that the proposed scheme can achieve the required object detection rate while saving considerable communication costs.

Title: Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat. (arXiv:2309.03237v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03237
Code URL: null
Copy Paste: [[2309.03237]] Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat(http://arxiv.org/abs/2309.03237)
Summary:
We carefully evaluate a number of algorithms for learning in a federated environment, and test their utility for a variety of image classification tasks. We consider many issues that have not been adequately considered before: whether learning over data sets that do not have diverse sets of images affects the results; whether to use a pre-trained feature extraction "backbone"; how to evaluate learner performance (we argue that classification accuracy is not enough), among others. Overall, across a wide variety of settings, we find that vertically decomposing a neural network seems to give the best results, and outperforms more standard reconciliation-used methods.

fair

Title: FLM-101B: An Open LLM and How to Train It with $100K Budget. (arXiv:2309.03852v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03852
Code URL: null
Copy Paste: [[2309.03852]] FLM-101B: An Open LLM and How to Train It with $100K Budget(http://arxiv.org/abs/2309.03852)
Summary:
Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks. Despite these successes, their development faces two main challenges: (i) high computational cost; and (ii) difficulty in conducting fair and objective evaluations. LLMs are prohibitively expensive, making it feasible for only a few major players to undertake their training, thereby constraining both research and application opportunities. This underscores the importance of cost-effective LLM training. In this paper, we utilize a growth strategy to significantly reduce LLM training cost. We demonstrate that an LLM with 101B parameters and 0.31TB tokens can be trained on a $100K budget. We also adopt a systematic evaluation paradigm for the IQ evaluation of LLMs, in complement to existing evaluations that focus more on knowledge-oriented abilities. We introduce our benchmark including evaluations on important aspects of intelligence including symbolic mapping, itrule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model FLM-101B, trained with a budget of $100K, achieves comparable performance to powerful and well-known models, eg GPT-3 and GLM-130B, especially in the IQ benchmark evaluations with contexts unseen in training data. The checkpoint of FLM-101B will be open-sourced at https://huggingface.co/CofeAI/FLM-101B.

Title: Equal Long-term Benefit Rate: Adapting Static Fairness Notions to Sequential Decision Making. (arXiv:2309.03426v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03426
Code URL: null
Copy Paste: [[2309.03426]] Equal Long-term Benefit Rate: Adapting Static Fairness Notions to Sequential Decision Making(http://arxiv.org/abs/2309.03426)
Summary:
Decisions made by machine learning models may have lasting impacts over time, making long-term fairness a crucial consideration. It has been shown that when ignoring the long-term effect, naively imposing fairness criterion in static settings can actually exacerbate bias over time. To explicitly address biases in sequential decision-making, recent works formulate long-term fairness notions in Markov Decision Process (MDP) framework. They define the long-term bias to be the sum of static bias over each time step. However, we demonstrate that naively summing up the step-wise bias can cause a false sense of fairness since it fails to consider the importance difference of different time steps during transition. In this work, we introduce a long-term fairness notion called Equal Long-term Benefit Rate (ELBERT), which explicitly considers varying temporal importance and adapts static fairness principles to the sequential setting. Moreover, we show that the policy gradient of Long-term Benefit Rate can be analytically reduced to standard policy gradient. This makes standard policy optimization methods applicable for reducing the bias, leading to our proposed bias mitigation method ELBERT-PO. Experiments on three sequential decision making environments show that ELBERT-PO significantly reduces bias and maintains high utility. Code is available at https://github.com/Yuancheng-Xu/ELBERT.

Title: Characterizing Lipschitz Stability of GNN for Fairness. (arXiv:2309.03648v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03648
Code URL: null
Copy Paste: [[2309.03648]] Characterizing Lipschitz Stability of GNN for Fairness(http://arxiv.org/abs/2309.03648)
Summary:
The Lipschitz bound, a technique from robust statistics, can limit the maximum changes in the output concerning the input, taking into account associated irrelevant biased factors. It is an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. Recently, Graph Neural Networks (GNNs), which operate on non-Euclidean data, have gained significant attention. However, no previous research has investigated the GNN Lipschitz bounds to shed light on stabilizing model outputs, especially when working on non-Euclidean data with inherent biases. Given the inherent biases in common graph data used for GNN training, it poses a serious challenge to constraining the GNN output perturbations induced by input biases, thereby safeguarding fairness during training. Recently, despite the Lipschitz constant's use in controlling the stability of Euclideanneural networks, the calculation of the precise Lipschitz constant remains elusive for non-Euclidean neural networks like GNNs, especially within fairness contexts. To narrow this gap, we begin with the general GNNs operating on an attributed graph, and formulate a Lipschitz bound to limit the changes in the output regarding biases associated with the input. Additionally, we theoretically analyze how the Lipschitz constant of a GNN model could constrain the output perturbations induced by biases learned from data for fairness training. We experimentally validate the Lipschitz bound's effectiveness in limiting biases of the model output. Finally, from a training dynamics perspective, we demonstrate why the theoretical Lipschitz bound can effectively guide the GNN training to better trade-off between accuracy and fairness.

interpretability

Title: A Function Interpretation Benchmark for Evaluating Interpretability Methods. (arXiv:2309.03886v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03886
Code URL: null
Copy Paste: [[2309.03886]] A Function Interpretation Benchmark for Evaluating Interpretability Methods(http://arxiv.org/abs/2309.03886)
Summary:
Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.

explainability

Title: Expert Uncertainty and Severity Aware Chest X-Ray Classification by Multi-Relationship Graph Learning. (arXiv:2309.03331v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03331
Code URL: null
Copy Paste: [[2309.03331]] Expert Uncertainty and Severity Aware Chest X-Ray Classification by Multi-Relationship Graph Learning(http://arxiv.org/abs/2309.03331)
Summary:
Patients undergoing chest X-rays (CXR) often endure multiple lung diseases. When evaluating a patient's condition, due to the complex pathologies, subtle texture changes of different lung lesions in images, and patient condition differences, radiologists may make uncertain even when they have experienced long-term clinical training and professional guidance, which makes much noise in extracting disease labels based on CXR reports. In this paper, we re-extract disease labels from CXR reports to make them more realistic by considering disease severity and uncertainty in classification. Our contributions are as follows: 1. We re-extracted the disease labels with severity and uncertainty by a rule-based approach with keywords discussed with clinical experts. 2. To further improve the explainability of chest X-ray diagnosis, we designed a multi-relationship graph learning method with an expert uncertainty-aware loss function. 3. Our multi-relationship graph learning method can also interpret the disease classification results. Our experimental results show that models considering disease severity and uncertainty outperform previous state-of-the-art methods.

watermark

Title: T2IW: Joint Text to Image & Watermark Generation. (arXiv:2309.03815v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03815
Code URL: null
Copy Paste: [[2309.03815]] T2IW: Joint Text to Image & Watermark Generation(http://arxiv.org/abs/2309.03815)
Summary:
Recent developments in text-conditioned image generative models have revolutionized the production of realistic results. Unfortunately, this has also led to an increase in privacy violations and the spread of false information, which requires the need for traceability, privacy protection, and other security measures. However, existing text-to-image paradigms lack the technical capabilities to link traceable messages with image generation. In this study, we introduce a novel task for the joint generation of text to image and watermark (T2IW). This T2IW scheme ensures minimal damage to image quality when generating a compound image by forcing the semantic feature and the watermark signal to be compatible in pixels. Additionally, by utilizing principles from Shannon information theory and non-cooperative game theory, we are able to separate the revealed image and the revealed watermark from the compound image. Furthermore, we strengthen the watermark robustness of our approach by subjecting the compound image to various post-processing attacks, with minimal pixel distortion observed in the revealed watermark. Extensive experiments have demonstrated remarkable achievements in image quality, watermark invisibility, and watermark robustness, supported by our proposed set of evaluation metrics.

diffusion

Title: SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction. (arXiv:2309.03335v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03335
Code URL: null
Copy Paste: [[2309.03335]] SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction(http://arxiv.org/abs/2309.03335)
Summary:
3D image reconstruction from a limited number of 2D images has been a long-standing challenge in computer vision and image analysis. While deep learning-based approaches have achieved impressive performance in this area, existing deep networks often fail to effectively utilize the shape structures of objects presented in images. As a result, the topology of reconstructed objects may not be well preserved, leading to the presence of artifacts such as discontinuities, holes, or mismatched connections between different parts. In this paper, we propose a shape-aware network based on diffusion models for 3D image reconstruction, named SADIR, to address these issues. In contrast to previous methods that primarily rely on spatial correlations of image intensities for 3D reconstruction, our model leverages shape priors learned from the training data to guide the reconstruction process. To achieve this, we develop a joint learning network that simultaneously learns a mean shape under deformation models. Each reconstructed image is then considered as a deformed variant of the mean shape. We validate our model, SADIR, on both brain and cardiac magnetic resonance images (MRIs). Experimental results show that our method outperforms the baselines with lower reconstruction error and better preservation of the shape structure of objects within the images.

Title: Relay Diffusion: Unifying diffusion process across resolutions for image synthesis. (arXiv:2309.03350v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03350
Code URL: null
Copy Paste: [[2309.03350]] Relay Diffusion: Unifying diffusion process across resolutions for image synthesis(http://arxiv.org/abs/2309.03350)
Summary:
Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.

Title: Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy. (arXiv:2309.03445v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03445
Code URL: null
Copy Paste: [[2309.03445]] Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy(http://arxiv.org/abs/2309.03445)
Summary:
In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. Our method adapts conditional denoising diffusion probabilistic models to generate the corresponding enhanced images by using the underwater images and the Gaussian noise as the inputs. Additionally, in order to improve the efficiency of the reverse process in the diffusion model, we adopt two different ways. We firstly propose a lightweight transformer-based denoising network, which can effectively promote the time of network forward per iteration. On the other hand, we introduce a skip sampling strategy to reduce the number of iterations. Besides, based on the skip sampling strategy, we propose two different non-uniform sampling methods for the sequence of the time step, namely piecewise sampling and searching with the evolutionary algorithm. Both of them are effective and can further improve performance by using the same steps against the previous uniform sampling. In the end, we conduct a relative evaluation of the widely used underwater enhancement datasets between the recent state-of-the-art methods and the proposed approach. The experimental results prove that our approach can achieve both competitive performance and high efficiency. Our code is available at \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater}}.

Title: SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. (arXiv:2309.03453v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03453
Code URL: null
Copy Paste: [[2309.03453]] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image(http://arxiv.org/abs/2309.03453)
Summary:
In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.

Title: Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation. (arXiv:2309.03549v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03549
Code URL: null
Copy Paste: [[2309.03549]] Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation(http://arxiv.org/abs/2309.03549)
Summary:
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.

Title: Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model. (arXiv:2309.03550v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03550
Code URL: null
Copy Paste: [[2309.03550]] Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model(http://arxiv.org/abs/2309.03550)
Summary:
Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Title: Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption. (arXiv:2309.03729v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03729
Code URL: null
Copy Paste: [[2309.03729]] Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption(http://arxiv.org/abs/2309.03729)
Summary:
Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.

Title: Text-to-feature diffusion for audio-visual few-shot learning. (arXiv:2309.03869v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03869
Code URL: null
Copy Paste: [[2309.03869]] Text-to-feature diffusion for audio-visual few-shot learning(http://arxiv.org/abs/2309.03869)
Summary:
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

Title: DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection. (arXiv:2309.03893v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03893
Code URL: null
Copy Paste: [[2309.03893]] DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection(http://arxiv.org/abs/2309.03893)
Summary:
Data is the cornerstone of deep learning. This paper reveals that the recently developed Diffusion Model is a scalable data engine for object detection. Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we presentDiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based adapter to scale up data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.

Title: InstructDiffusion: A Generalist Modeling Interface for Vision Tasks. (arXiv:2309.03895v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03895
Code URL: null
Copy Paste: [[2309.03895]] InstructDiffusion: A Generalist Modeling Interface for Vision Tasks(http://arxiv.org/abs/2309.03895)
Summary:
We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

noise learning

data-free

transformer

Title: Self-Supervised Masked Digital Elevation Models Encoding for Low-Resource Downstream Tasks. (arXiv:2309.03367v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03367
Code URL: null
Copy Paste: [[2309.03367]] Self-Supervised Masked Digital Elevation Models Encoding for Low-Resource Downstream Tasks(http://arxiv.org/abs/2309.03367)
Summary:
The lack of quality labeled data is one of the main bottlenecks for training Deep Learning models. As the task increases in complexity, there is a higher penalty for overfitting and unstable learning. The typical paradigm employed today is Self-Supervised learning, where the model attempts to learn from a large corpus of unstructured and unlabeled data and then transfer that knowledge to the required task. Some notable examples of self-supervision in other modalities are BERT for Large Language Models, Wav2Vec for Speech Recognition, and the Masked AutoEncoder for Vision, which all utilize Transformers to solve a masked prediction task. GeoAI is uniquely poised to take advantage of the self-supervised methodology due to the decades of data collected, little of which is precisely and dependably annotated. Our goal is to extract building and road segmentations from Digital Elevation Models (DEM) that provide a detailed topography of the earths surface. The proposed architecture is the Masked Autoencoder pre-trained on ImageNet (with the limitation that there is a large domain discrepancy between ImageNet and DEM) with an UperNet Head for decoding segmentations. We tested this model with 450 and 50 training images only, utilizing roughly 5% and 0.5% of the original data respectively. On the building segmentation task, this model obtains an 82.1% Intersection over Union (IoU) with 450 Images and 69.1% IoU with only 50 images. On the more challenging road detection task the model obtains an 82.7% IoU with 450 images and 73.2% IoU with only 50 images. Any hand-labeled dataset made today about the earths surface will be immediately obsolete due to the constantly changing nature of the landscape. This motivates the clear necessity for data-efficient learners that can be used for a wide variety of downstream tasks.

Title: Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation. (arXiv:2309.03467v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03467
Code URL: null
Copy Paste: [[2309.03467]] Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation(http://arxiv.org/abs/2309.03467)
Summary:
A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.

Title: DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions. (arXiv:2309.03576v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03576
Code URL: null
Copy Paste: [[2309.03576]] DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions(http://arxiv.org/abs/2309.03576)
Summary:
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

Title: Interpretable Visual Question Answering via Reasoning Supervision. (arXiv:2309.03726v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03726
Code URL: null
Copy Paste: [[2309.03726]] Interpretable Visual Question Answering via Reasoning Supervision(http://arxiv.org/abs/2309.03726)
Summary:
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.

Title: ProPainter: Improving Propagation and Transformer for Video Inpainting. (arXiv:2309.03897v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03897
Code URL: null
Copy Paste: [[2309.03897]] ProPainter: Improving Propagation and Transformer for Video Inpainting(http://arxiv.org/abs/2309.03897)
Summary:
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.

Title: The Making and Breaking of Camouflage. (arXiv:2309.03899v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03899
Code URL: null
Copy Paste: [[2309.03899]] The Making and Breaking of Camouflage(http://arxiv.org/abs/2309.03899)
Summary:
Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between background and foreground features and boundary visibility. We use these camouflage scores to assess and compare all available camouflage datasets. We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner. The generated synthetic dataset is used to train a transformer-based model for segmenting camouflaged animals in videos. Experimentally, we demonstrate state-of-the-art camouflage breaking performance on the public MoCA-Mask benchmark.

Title: Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation. (arXiv:2309.03340v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03340
Code URL: null
Copy Paste: [[2309.03340]] Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation(http://arxiv.org/abs/2309.03340)
Summary:
There has been significant research on developing pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio captioning. First, we propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. Then, we propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data. During the beam decoding step, the smaller model utilizes an audio-text shared latent representation to semantically align the generated text with corresponding input audio. Faithful guidance is introduced into the beam probability by incorporating the cosine similarity between latent representation projections of greedy rolled out intermediate beams and audio clip. We show the efficacy of our algorithm on benchmark datasets and evaluate the proposed scheme against baselines using conventional audio captioning and semantic similarity metrics while illustrating tradeoffs between performance and complexity.

Title: Exploring an LM to generate Prolog Predicates from Mathematics Questions. (arXiv:2309.03667v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03667
Code URL: null
Copy Paste: [[2309.03667]] Exploring an LM to generate Prolog Predicates from Mathematics Questions(http://arxiv.org/abs/2309.03667)
Summary:
Recently, there has been a surge in interest in NLP driven by ChatGPT. ChatGPT, a transformer-based generative language model of substantial scale, exhibits versatility in performing various tasks based on natural language. Nevertheless, large language models often exhibit poor performance in solving mathematics questions that require reasoning. Prior research has demonstrated the effectiveness of chain-of-thought prompting in enhancing reasoning capabilities. Now, we aim to investigate whether fine-tuning a model for the generation of Prolog codes, a logic language, and subsequently passing these codes to a compiler can further improve accuracy. Consequently, we employ chain-of-thought to fine-tune LLaMA7B as a baseline model and develop other fine-tuned LLaMA7B models for the generation of Prolog code, Prolog code + chain-of-thought, and chain-of-thought + Prolog code, respectively. The results reveal that the Prolog generation model surpasses the baseline in performance, while the combination generation models do not yield significant improvements. The Prolog corpus based on GSM8K and the correspondingly finetuned Prolog generation model based on LLaMA7B are released to the research community.

Title: Insights Into the Inner Workings of Transformer Models for Protein Function Prediction. (arXiv:2309.03631v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03631
Code URL: null
Copy Paste: [[2309.03631]] Insights Into the Inner Workings of Transformer Models for Protein Function Prediction(http://arxiv.org/abs/2309.03631)
Summary:
Motivation: We explored how explainable AI (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. Results: The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g., transmembrane regions, active sites) across many proteins. Availability and Implementation: Source code can be accessed at https://github.com/markuswenzel/xai-proteins .

Title: Training Acceleration of Low-Rank Decomposed Networks using Sequential Freezing and Rank Quantization. (arXiv:2309.03824v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03824
Code URL: null
Copy Paste: [[2309.03824]] Training Acceleration of Low-Rank Decomposed Networks using Sequential Freezing and Rank Quantization(http://arxiv.org/abs/2309.03824)
Summary:
Low Rank Decomposition (LRD) is a model compression technique applied to the weight tensors of deep learning models in order to reduce the number of trainable parameters and computational complexity. However, due to high number of new layers added to the architecture after applying LRD, it may not lead to a high training/inference acceleration if the decomposition ranks are not small enough. The issue is that using small ranks increases the risk of significant accuracy drop after decomposition. In this paper, we propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition. These methods include rank optimization and sequential freezing of decomposed layers. We perform experiments on both convolutional and transformer-based models. Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together while preserving the accuracy close to that of the original models

generative

Title: Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation. (arXiv:2309.03472v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03472
Code URL: null
Copy Paste: [[2309.03472]] Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation(http://arxiv.org/abs/2309.03472)
Summary:
Despite substantial efforts dedicated to the design of heuristic models for omnidirectional (i.e., 360$^\circ$) image quality assessment (OIQA), a conspicuous gap remains due to the lack of consideration for the diversity of viewing behaviors that leads to the varying perceptual quality of 360$^\circ$ images. Two critical aspects underline this oversight: the neglect of viewing conditions that significantly sway user gaze patterns and the overreliance on a single viewport sequence from the 360$^\circ$ image for quality inference. To address these issues, we introduce a unique generative scanpath representation (GSR) for effective quality inference of 360$^\circ$ images, which aggregates varied perceptual experiences of multi-hypothesis users under a predefined viewing condition. More specifically, given a viewing condition characterized by the starting point of viewing and exploration time, a set of scanpaths consisting of dynamic visual fixations can be produced using an apt scanpath generator. Following this vein, we use the scanpaths to convert the 360$^\circ$ image into the unique GSR, which provides a global overview of gazed-focused contents derived from scanpaths. As such, the quality inference of the 360$^\circ$ image is swiftly transformed to that of GSR. We then propose an efficient OIQA computational framework by learning the quality maps of GSR. Comprehensive experimental results validate that the predictions of the proposed framework are highly consistent with human perception in the spatiotemporal domain, especially in the challenging context of locally distorted 360$^\circ$ images under varied viewing conditions. The code will be released at https://github.com/xiangjieSui/GSR

Title: AnthroNet: Conditional Generation of Humans via Anthropometrics. (arXiv:2309.03812v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03812
Code URL: https://github.com/Unity-Technologies/AnthroNet
Copy Paste: [[2309.03812]] AnthroNet: Conditional Generation of Humans via Anthropometrics(http://arxiv.org/abs/2309.03812)
Summary:
We present a novel human body model formulated by an extensive set of anthropocentric measurements, which is capable of generating a wide range of human body shapes and poses. The proposed model enables direct modeling of specific human identities through a deep generative architecture, which can produce humans in any arbitrary pose. It is the first of its kind to have been trained end-to-end using only synthetically generated data, which not only provides highly accurate human mesh representations but also allows for precise anthropometry of the body. Moreover, using a highly diverse animation library, we articulated our synthetic humans' body and hands to maximize the diversity of the learnable priors for model training. Our model was trained on a dataset of $100k$ procedurally-generated posed human meshes and their corresponding anthropometric measurements. Our synthetic data generator can be used to generate millions of unique human identities and poses for non-commercial academic research purposes.

Title: Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis. (arXiv:2309.03904v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03904
Code URL: null
Copy Paste: [[2309.03904]] Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis(http://arxiv.org/abs/2309.03904)
Summary:
Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.

large language model

Title: GPT Can Solve Mathematical Problems Without a Calculator. (arXiv:2309.03241v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03241
Code URL: null
Copy Paste: [[2309.03241]] GPT Can Solve Mathematical Problems Without a Calculator(http://arxiv.org/abs/2309.03241)
Summary:
Previous studies have typically assumed that large language models are unable to accurately perform arithmetic operations, particularly multiplication of >8 digits, and operations involving decimals and fractions, without the use of calculator tools. This paper aims to challenge this misconception. With sufficient training data, a 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with almost 100% accuracy without data leakage, significantly surpassing GPT-4 (whose multi-digit multiplication accuracy is only 4.3%). We also demonstrate that our MathGLM, fine-tuned from GLM-10B on a dataset with additional multi-step arithmetic operations and math problems described in text, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.

Title: Large Language Models as Optimizers. (arXiv:2309.03409v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.03409
Code URL: null
Copy Paste: [[2309.03409]] Large Language Models as Optimizers(http://arxiv.org/abs/2309.03409)
Summary:
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.

Title: From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models. (arXiv:2309.03412v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03412
Code URL: null
Copy Paste: [[2309.03412]] From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models(http://arxiv.org/abs/2309.03412)
Summary:
Instruction tuning is essential for large language models (LLMs) to become interactive. While many instruction tuning datasets exist in English, there is a noticeable lack in other languages. Also, their effectiveness has not been well verified in non-English languages. We construct a Japanese instruction dataset by expanding and filtering existing datasets and apply the dataset to a Japanese pre-trained base model. We performed Low-Rank Adaptation (LoRA) tuning on both Japanese and English existing models using our instruction dataset. We evaluated these models from both quantitative and qualitative perspectives. As a result, the effectiveness of Japanese instruction datasets is confirmed. The results also indicate that even with relatively small LLMs, performances in downstream tasks would be improved through instruction tuning. Our instruction dataset, tuned models, and implementation are publicly available online.

Title: XGen-7B Technical Report. (arXiv:2309.03450v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03450
Code URL: null
Copy Paste: [[2309.03450]] XGen-7B Technical Report(http://arxiv.org/abs/2309.03450)
Summary:
Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.

Title: Evaluating the Efficacy of Supervised Learning vs Large Language Models for Identifying Cognitive Distortions and Suicidal Risks in Chinese Social Media. (arXiv:2309.03564v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03564
Code URL: null
Copy Paste: [[2309.03564]] Evaluating the Efficacy of Supervised Learning vs Large Language Models for Identifying Cognitive Distortions and Suicidal Risks in Chinese Social Media(http://arxiv.org/abs/2309.03564)
Summary:
Large language models, particularly those akin to the rapidly progressing GPT series, are gaining traction for their expansive influence. While there is keen interest in their applicability within medical domains such as psychology, tangible explorations on real-world data remain scant. Concurrently, users on social media platforms are increasingly vocalizing personal sentiments; under specific thematic umbrellas, these sentiments often manifest as negative emotions, sometimes escalating to suicidal inclinations. Timely discernment of such cognitive distortions and suicidal risks is crucial to effectively intervene and potentially avert dire circumstances. Our study ventured into this realm by experimenting on two pivotal tasks: suicidal risk and cognitive distortion identification on Chinese social media platforms. Using supervised learning as a baseline, we examined and contrasted the efficacy of large language models via three distinct strategies: zero-shot, few-shot, and fine-tuning. Our findings revealed a discernible performance gap between the large language models and traditional supervised learning approaches, primarily attributed to the models' inability to fully grasp subtle categories. Notably, while GPT-4 outperforms its counterparts in multiple scenarios, GPT-3.5 shows significant enhancement in suicide risk classification after fine-tuning. To our knowledge, this investigation stands as the maiden attempt at gauging large language models on Chinese social media tasks. This study underscores the forward-looking and transformative implications of using large language models in the field of psychology. It lays the groundwork for future applications in psychological research and practice.

Title: Enhancing Pipeline-Based Conversational Agents with Large Language Models. (arXiv:2309.03748v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03748
Code URL: null
Copy Paste: [[2309.03748]] Enhancing Pipeline-Based Conversational Agents with Large Language Models(http://arxiv.org/abs/2309.03748)
Summary:
The latest advancements in AI and deep learning have led to a breakthrough in large language model (LLM)-based agents such as GPT-4. However, many commercial conversational agent development tools are pipeline-based and have limitations in holding a human-like conversation. This paper investigates the capabilities of LLMs to enhance pipeline-based conversational agents during two phases: 1) in the design and development phase and 2) during operations. In 1) LLMs can aid in generating training data, extracting entities and synonyms, localization, and persona design. In 2) LLMs can assist in contextualization, intent classification to prevent conversational breakdown and handle out-of-scope questions, auto-correcting utterances, rephrasing responses, formulating disambiguation questions, summarization, and enabling closed question-answering capabilities. We conducted informal experiments with GPT-4 in the private banking domain to demonstrate the scenarios above with a practical example. Companies may be hesitant to replace their pipeline-based agents with LLMs entirely due to privacy concerns and the need for deep integration within their existing ecosystems. A hybrid approach in which LLMs' are integrated into the pipeline-based agents allows them to save time and costs of building and running agents by capitalizing on the capabilities of LLMs while retaining the integration and privacy safeguards of their existing systems.

Title: USA: Universal Sentiment Analysis Model & Construction of Japanese Sentiment Text Classification and Part of Speech Dataset. (arXiv:2309.03787v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03787
Code URL: null
Copy Paste: [[2309.03787]] USA: Universal Sentiment Analysis Model & Construction of Japanese Sentiment Text Classification and Part of Speech Dataset(http://arxiv.org/abs/2309.03787)
Summary:
Sentiment analysis is a pivotal task in the domain of natural language processing. It encompasses both text-level sentiment polarity classification and word-level Part of Speech(POS) sentiment polarity determination. Such analysis challenges models to understand text holistically while also extracting nuanced information. With the rise of Large Language Models(LLMs), new avenues for sentiment analysis have opened. This paper proposes enhancing performance by leveraging the Mutual Reinforcement Effect(MRE) between individual words and the overall text. It delves into how word polarity influences the overarching sentiment of a passage. To support our research, we annotated four novel Sentiment Text Classification and Part of Speech(SCPOS) datasets, building upon existing sentiment classification datasets. Furthermore, we developed a Universal Sentiment Analysis(USA) model, with a 7-billion parameter size. Experimental results revealed that our model surpassed the performance of gpt-3.5-turbo across all four datasets, underscoring the significance of MRE in sentiment analysis.

Title: OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs. (arXiv:2309.03876v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03876
Code URL: null
Copy Paste: [[2309.03876]] OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs(http://arxiv.org/abs/2309.03876)
Summary:
Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable ability to generate fitting responses to natural language instructions. However, an open research question concerns the inherent biases of trained models and their responses. For instance, if the data used to tune an LLM is dominantly written by persons with a specific political bias, we might expect generated answers to share this bias. Current research work seeks to de-bias such models, or suppress potentially biased answers. With this demonstration, we take a different view on biases in instruction-tuning: Rather than aiming to suppress them, we aim to make them explicit and transparent. To this end, we present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on text representing each of the selected biases, allowing side-by-side comparison. To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics. This paper presents OpinionGPT, illustrates how we trained the bias-aware model and showcases the web application (available at https://opiniongpt.informatik.hu-berlin.de).

Title: On Large Language Models' Selection Bias in Multi-Choice Questions. (arXiv:2309.03882v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03882
Code URL: null
Copy Paste: [[2309.03882]] On Large Language Models' Selection Bias in Multi-Choice Questions(http://arxiv.org/abs/2309.03882)
Summary:
Multi-choice questions (MCQs) serve as a common yet important task format in the research of large language models (LLMs). Our work shows that LLMs exhibit an inherent "selection bias" in MCQs, which refers to LLMs' preferences to select options located at specific positions (like "Option C"). This bias is prevalent across various LLMs, making their performance vulnerable to option position changes in MCQs. We identify that one primary cause resulting in selection bias is option numbering, i.e., the ID symbols A/B/C/D associated with the options. To mitigate selection bias, we propose a new method called PriDe. PriDe first decomposes the observed model prediction distribution into an intrinsic prediction over option contents and a prior distribution over option IDs. It then estimates the prior by permutating option contents on a small number of test samples, which is used to debias the subsequent test samples. We demonstrate that, as a label-free, inference-time method, PriDe achieves a more effective and computation-efficient debiasing than strong baselines. We further show that the priors estimated by PriDe generalize well across different domains, highlighting its practical potential in broader scenarios.

Title: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. (arXiv:2309.03883v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03883
Code URL: null
Copy Paste: [[2309.03883]] DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models(http://arxiv.org/abs/2309.03883)
Summary:
Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.

segmentation

Title: MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation. (arXiv:2309.03329v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03329
Code URL: https://github.com/dinhhieuhoang/meganet
Copy Paste: [[2309.03329]] MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation(http://arxiv.org/abs/2309.03329)
Summary:
Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tissue) is difficult. To mitigate these challenges, we propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored specifically for polyp segmentation within colonoscopy images. This network draws inspiration from the fusion of a classical edge detection technique with an attention mechanism. By combining these techniques, MEGANet effectively preserves high-frequency information, notably edges and boundaries, which tend to erode as neural networks deepen. MEGANet is designed as an end-to-end framework, encompassing three key modules: an encoder, which is responsible for capturing and abstracting the features from the input image, a decoder, which focuses on salient features, and the Edge-Guided Attention module (EGA) that employs the Laplacian Operator to accentuate polyp boundaries. Extensive experiments, both qualitative and quantitative, on five benchmark datasets, demonstrate that our EGANet outperforms other existing SOTA methods under six evaluation metrics. Our code is available at \url{https://github.com/DinhHieuHoang/MEGANet}

Title: Using Neural Networks for Fast SAR Roughness Estimation of High Resolution Images. (arXiv:2309.03351v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03351
Code URL: null
Copy Paste: [[2309.03351]] Using Neural Networks for Fast SAR Roughness Estimation of High Resolution Images(http://arxiv.org/abs/2309.03351)
Summary:
The analysis of Synthetic Aperture Radar (SAR) imagery is an important step in remote sensing applications, and it is a challenging problem due to its inherent speckle noise. One typical solution is to model the data using the $G_I^0$ distribution and extract its roughness information, which in turn can be used in posterior imaging tasks, such as segmentation, classification and interpretation. This leads to the need of quick and reliable estimation of the roughness parameter from SAR data, especially with high resolution images. Unfortunately, traditional parameter estimation procedures are slow and prone to estimation failures. In this work, we proposed a neural network-based estimation framework that first learns how to predict underlying parameters of $G_I^0$ samples and then can be used to estimate the roughness of unseen data. We show that this approach leads to an estimator that is quicker, yields less estimation error and is less prone to failures than the traditional estimation procedures for this problem, even when we use a simple network. More importantly, we show that this same methodology can be generalized to handle image inputs and, even if trained on purely synthetic data for a few seconds, is able to perform real time pixel-wise roughness estimation for high resolution real SAR imagery.

Title: Temporal Collection and Distribution for Referring Video Object Segmentation. (arXiv:2309.03473v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03473
Code URL: null
Copy Paste: [[2309.03473]] Temporal Collection and Distribution for Referring Video Object Segmentation(http://arxiv.org/abs/2309.03473)
Summary:
Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.

Title: Instance Segmentation of Dislocations in TEM Images. (arXiv:2309.03499v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03499
Code URL: null
Copy Paste: [[2309.03499]] Instance Segmentation of Dislocations in TEM Images(http://arxiv.org/abs/2309.03499)
Summary:
Quantitative Transmission Electron Microscopy (TEM) during in-situ straining experiment is able to reveal the motion of dislocations -- linear defects in the crystal lattice of metals. In the domain of materials science, the knowledge about the location and movement of dislocations is important for creating novel materials with superior properties. A long-standing problem, however, is to identify the position and extract the shape of dislocations, which would ultimately help to create a digital twin of such materials. In this work, we quantitatively compare state-of-the-art instance segmentation methods, including Mask R-CNN and YOLOv8. The dislocation masks as the results of the instance segmentation are converted to mathematical lines, enabling quantitative analysis of dislocation length and geometry -- important information for the domain scientist, which we then propose to include as a novel length-aware quality metric for estimating the network performance. Our segmentation pipeline shows a high accuracy suitable for all domain-specific, further post-processing. Additionally, our physics-based metric turns out to perform much more consistently than typically used pixel-wise metrics.

Title: BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications. (arXiv:2309.03509v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03509
Code URL: null
Copy Paste: [[2309.03509]] BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications(http://arxiv.org/abs/2309.03509)
Summary:
Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data.

Title: Towards Comparable Knowledge Distillation in Semantic Image Segmentation. (arXiv:2309.03659v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03659
Code URL: null
Copy Paste: [[2309.03659]] Towards Comparable Knowledge Distillation in Semantic Image Segmentation(http://arxiv.org/abs/2309.03659)
Summary:
Knowledge Distillation (KD) is one proposed solution to large model sizes and slow inference speed in semantic segmentation. In our research we identify 25 proposed distillation loss terms from 14 publications in the last 4 years. Unfortunately, a comparison of terms based on published results is often impossible, because of differences in training configurations. A good illustration of this problem is the comparison of two publications from 2022. Using the same models and dataset, Structural and Statistical Texture Distillation (SSTKD) reports an increase of student mIoU of 4.54 and a final performance of 29.19, while Adaptive Perspective Distillation (APD) only improves student performance by 2.06 percentage points, but achieves a final performance of 39.25. The reason for such extreme differences is often a suboptimal choice of hyperparameters and a resulting underperformance of the student model used as reference point. In our work, we reveal problems of insufficient hyperparameter tuning by showing that distillation improvements of two widely accepted frameworks, SKD and IFVD, vanish when hyperparameters are optimized sufficiently. To improve comparability of future research in the field, we establish a solid baseline for three datasets and two student models and provide extensive information on hyperparameter tuning. We find that only two out of eight techniques can compete with our simple baseline on the ADE20K dataset.

Title: A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation. (arXiv:2309.03722v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03722
Code URL: null
Copy Paste: [[2309.03722]] A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation(http://arxiv.org/abs/2309.03722)
Summary:
Roof plane segmentation from airborne LiDAR point clouds is an important technology for 3D building model reconstruction. One of the key issues of plane segmentation is how to design powerful features that can exactly distinguish adjacent planar patches. The quality of point feature directly determines the accuracy of roof plane segmentation. Most of existing approaches use handcrafted features to extract roof planes. However, the abilities of these features are relatively low, especially in boundary area. To solve this problem, we propose a boundary-aware point clustering approach in Euclidean and embedding spaces constructed by a multi-task deep network for roof plane segmentation. We design a three-branch network to predict semantic labels, point offsets and extract deep embedding features. In the first branch, we classify the input data as non-roof, boundary and plane points. In the second branch, we predict point offsets for shifting each point toward its respective instance center. In the third branch, we constrain that points of the same plane instance should have the similar embeddings. We aim to ensure that points of the same plane instance are close as much as possible in both Euclidean and embedding spaces. However, although deep network has strong feature representative ability, it is still hard to accurately distinguish points near plane instance boundary. Therefore, we first group plane points into many clusters in the two spaces, and then we assign the rest boundary points to their closest clusters to generate final complete roof planes. In this way, we can effectively reduce the influence of unreliable boundary points. In addition, we construct a synthetic dataset and a real dataset to train and evaluate our approach. The experiments results show that the proposed approach significantly outperforms the existing state-of-the-art approaches.

Title: Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications. (arXiv:2309.03837v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03837
Code URL: null
Copy Paste: [[2309.03837]] Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications(http://arxiv.org/abs/2309.03837)
Summary:
Multi-task learning (MTL) is a powerful approach in deep learning that leverages the information from multiple tasks during training to improve model performance. In medical imaging, MTL has shown great potential to solve various tasks. However, existing MTL architectures in medical imaging are limited in sharing information across tasks, reducing the potential performance improvements of MTL. In this study, we introduce a novel attention-based MTL framework to better leverage inter-task interactions for various tasks from pixel-level to image-level predictions. Specifically, we propose a Cross-Task Attention Network (CTAN) which utilizes cross-task attention mechanisms to incorporate information by interacting across tasks. We validated CTAN on four medical imaging datasets that span different domains and tasks including: radiation treatment planning prediction using planning CT images of two different target cancers (Prostate, OpenKBP); pigmented skin lesion segmentation and diagnosis using dermatoscopic images (HAM10000); and COVID-19 diagnosis and severity prediction using chest CT scans (STOIC). Our study demonstrates the effectiveness of CTAN in improving the accuracy of medical imaging tasks. Compared to standard single-task learning (STL), CTAN demonstrated a 4.67% improvement in performance and outperformed both widely used MTL baselines: hard parameter sharing (HPS) with an average performance improvement of 3.22%; and multi-task attention network (MTAN) with a relative decrease of 5.38%. These findings highlight the significance of our proposed MTL framework in solving medical imaging tasks and its potential to improve their accuracy across domains.

Title: Tracking Anything with Decoupled Video Segmentation. (arXiv:2309.03903v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.03903
Code URL: null
Copy Paste: [[2309.03903]] Tracking Anything with Decoupled Video Segmentation(http://arxiv.org/abs/2309.03903)
Summary:
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Title: Word segmentation granularity in Korean. (arXiv:2309.03713v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.03713
Code URL: null
Copy Paste: [[2309.03713]] Word segmentation granularity in Korean(http://arxiv.org/abs/2309.03713)
Summary:
This paper describes word {segmentation} granularity in Korean language processing. From a word separated by blank space, which is termed an eojeol, to a sequence of morphemes in Korean, there are multiple possible levels of word segmentation granularity in Korean. For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized, because the agglutinative languages including Korean language have a one-to-one mapping between functional morpheme and syntactic category. Thus, we analyze these different granularity levels, presenting the examples of Korean language processing systems for future reference. Interestingly, the granularity by separating only functional morphemes including case markers and verbal endings, and keeping other suffixes for morphological derivation results in the optimal performance for phrase structure parsing. This contradicts previous best practices for Korean language processing, which has been the de facto standard for various applications that require separating all morphemes.