2024-07-02

Title: Visual Language Model based Cross-modal Semantic Communication Systems

Authors: Feibo Jiang, Chuanguo Tang, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan
Subjects: cs.CV, cs.AI, cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Visual Language Model based Cross-modal Semantic Communication Systems(https://arxiv.org/abs/)
Keywords: robust
Abstract: Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

Title: LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

Authors: Lang He, Kai Chen, Junnan Zhao, Yimeng Wang, Ercheng Pei, Haifeng Chen, Jiewei Jiang, Shiqing Zhang, Jie Zhang, Zhongmin Wang, Tao He, Prayag Tiwari
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild(https://arxiv.org/abs/)
Keywords: privacy, protect
Abstract: Depression can significantly impact many aspects of an individual's life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects' privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: this https URL.

Title: Provably Secure Non-interactive Key Exchange Protocol for Group-Oriented Applications in Scenarios with Low-Quality Networks

Authors: Rui Zhang, Lei Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Provably Secure Non-interactive Key Exchange Protocol for Group-Oriented Applications in Scenarios with Low-Quality Networks(https://arxiv.org/abs/)
Keywords: secure
Abstract: Non-interactive key exchange (NIKE) enables two or multiple parties (just knowing the public system parameters and each other's public key) to derive a (group) session key without the need for interaction. Recently, NIKE in multi-party settings has been attached importance. However, we note that most existing multi-party NIKE protocols, underlying costly cryptographic techniques (i.e., multilinear maps and indistinguishability obfuscation), lead to high computational costs once employed in practice. Therefore, it is a challenging task to achieve multi-party NIKE protocols by using more practical cryptographic primitives. In this paper, we propose a secure and efficient NIKE protocol for secure communications in dynamic groups, whose construction only bases on bilinear maps. This protocol allows multiple parties to negotiate asymmetric group keys (a public group encryption key and each party's decryption key) without any interaction among one another. Additionally, the protocol supports updating of group keys in an efficient and non-interactive way once any party outside a group or any group member joins or leaves the group. Further, any party called a sender (even outside a group) intending to connect with some or all of group members called receivers in a group, just needs to generate a ciphertext with constant size under the public group encryption key, and only the group member who is the real receiver can decrypt the ciphertext to obtain the session key. We prove our protocol captures the correctness and indistinguishability of session key under k-Bilinear Diffie-Hellman exponent (k-BDHE) assumption. Efficiency evaluation shows the efficiency of our protocol.

Title: Curriculum Learning with Quality-Driven Data Selection

Authors: Biao Wu, Fang Meng, Ling Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Curriculum Learning with Quality-Driven Data Selection(https://arxiv.org/abs/)
Keywords: large language model
Abstract: The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: \url{https://anonymous.4open.science/r/EHIT-31B4}

Title: AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

Authors: Iván Matas, Carmen Serrano, Francisca Silva, Amalia Serrano, Tomás Toledo-Pastrana, Begoña Acha
Subjects: cs.LG, cs.AI, cs.CV, cs.IR, eess.IV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability(https://arxiv.org/abs/)
Keywords: interpretability
Abstract: An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

Title: Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Authors: Yuqing Qian, Ziyu Zheng, Prayag Tiwari, Yijie Ding, Quan Zou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction(https://arxiv.org/abs/)
Keywords: robust
Abstract: Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

Title: UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Authors: Ilia Shumailov, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, Eugene Bagdasaryan
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI(https://arxiv.org/abs/)
Keywords: privacy, generative, large language model
Abstract: Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

Title: Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

Authors: Ben Fauber
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.

Title: Personalized Federated Continual Learning via Multi-granularity Prompt

Authors: Hao Yu, Xin Yang, Xin Gao, Yan Kang, Hao Wang, Junbo Zhang, Tianrui Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Personalized Federated Continual Learning via Multi-granularity Prompt(https://arxiv.org/abs/)
Keywords: federate
Abstract: Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning (PFL) or Federated Continual Learning (FCL), have overlooked the multi-granularity representation of knowledge, which can be utilized to overcome Spatial-Temporal Catastrophic Forgetting (STCF) and adopt generalized knowledge to itself by coarse-to-fine human cognitive mechanisms. Moreover, it allows more effectively to personalized shared knowledge, thus serving its own purpose. To this end, we propose a novel concept called multi-granularity prompt, i.e., coarse-grained global prompt acquired through the common model learning process, and fine-grained local prompt used to personalize the generalized representation. The former focuses on efficiently transferring shared global knowledge without spatial forgetting, and the latter emphasizes specific learning of personalized local knowledge to overcome temporal forgetting. In addition, we design a selective prompt fusion mechanism for aggregating knowledge of global prompts distilled from different clients. By the exclusive fusion of coarse-grained knowledge, we achieve the transmission and refinement of common knowledge among clients, further enhancing the performance of personalization. Extensive experiments demonstrate the effectiveness of the proposed method in addressing STCF as well as improving personalized performance. Our code now is available at this https URL.

Title: OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents(https://arxiv.org/abs/)
Keywords: transformer
Abstract: We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

Title: Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges

Authors: Mahmoud Ibrahim, Yasmina Al Khalil, Sina Amirrajab, Chang Suna, Marcel Breeuwer, Josien Pluim, Bart Elen, Gokhan Ertaylan, Michel Dumontiera
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges(https://arxiv.org/abs/)
Keywords: generative, segmentation
Abstract: This paper presents a comprehensive systematic review of generative models (GANs, VAEs, DMs, and LLMs) used to synthesize various medical data types, including imaging (dermoscopic, mammographic, ultrasound, CT, MRI, and X-ray), text, time-series, and tabular data (EHR). Unlike previous narrowly focused reviews, our study encompasses a broad array of medical data modalities and explores various generative models. Our search strategy queries databases such as Scopus, PubMed, and ArXiv, focusing on recent works from January 2021 to November 2023, excluding reviews and perspectives. This period emphasizes recent advancements beyond GANs, which have been extensively covered previously. The survey reveals insights from three key aspects: (1) Synthesis applications and purpose of synthesis, (2) generation techniques, and (3) evaluation methods. It highlights clinically valid synthesis applications, demonstrating the potential of synthetic data to tackle diverse clinical requirements. While conditional models incorporating class labels, segmentation masks and image translations are prevalent, there is a gap in utilizing prior clinical knowledge and patient-specific context, suggesting a need for more personalized synthesis approaches and emphasizing the importance of tailoring generative approaches to the unique characteristics of medical data. Additionally, there is a significant gap in using synthetic data beyond augmentation, such as for validation and evaluation of downstream medical AI models. The survey uncovers that the lack of standardized evaluation methodologies tailored to medical images is a barrier to clinical application, underscoring the need for in-depth evaluation approaches, benchmarking, and comparative studies to promote openness and collaboration.

Title: From Efficient Multimodal Models to World Models: A Survey

Authors: Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] From Efficient Multimodal Models to World Models: A Survey(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to enhance world simulation capabilities and propose incorporating external rule systems for improved reasoning and decision-making. Finally, we outline future research directions to address these challenges and advance the field.

Title: Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Authors: Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Cox, Salim Roukos, Luis Lastras, Pavan Kapanipathi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.

Title: RepAct: The Re-parameterizable Adaptive Activation Function

Authors: Xian Wu, Qingchuan Tao, Shuang Wang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] RepAct: The Re-parameterizable Adaptive Activation Function(https://arxiv.org/abs/)
Keywords: interpretability
Abstract: Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

Title: Towards Secure and Efficient Data Scheduling for Vehicular Social Networks

Authors: Youhua Xia, Tiehua Zhang, Jiong Jin, Ying He, Fei Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Secure and Efficient Data Scheduling for Vehicular Social Networks(https://arxiv.org/abs/)
Keywords: secure, security, privacy
Abstract: Efficient data transmission scheduling within vehicular environments poses a significant challenge due to the high mobility of such networks. Contemporary research predominantly centers on crafting cooperative scheduling algorithms tailored for vehicular networks. Notwithstanding, the intricacies of orchestrating scheduling in vehicular social networks both effectively and efficiently remain formidable. This paper introduces an innovative learning-based algorithm for scheduling data transmission that prioritizes efficiency and security within vehicular social networks. The algorithm first uses a specifically constructed neural network to enhance data processing capabilities. After this, it incorporates a Q-learning paradigm during the data transmission phase to optimize the information exchange, the privacy of which is safeguarded by differential privacy through the communication process. Comparative experiments demonstrate the superior performance of the proposed Q-learning enhanced scheduling algorithm relative to existing state-of-the-art scheduling algorithms in the context of vehicular social networks.

Title: Localizing Anomalies via Multiscale Score Matching Analysis

Authors: Ahsan Mahmood, Junier Oliva, Martin Styner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Localizing Anomalies via Multiscale Score Matching Analysis(https://arxiv.org/abs/)
Keywords: generative, segmentation
Abstract: Anomaly detection and localization in medical imaging remain critical challenges in healthcare. This paper introduces Spatial-MSMA (Multiscale Score Matching Analysis), a novel unsupervised method for anomaly localization in volumetric brain MRIs. Building upon the MSMA framework, our approach incorporates spatial information and conditional likelihoods to enhance anomaly detection capabilities. We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores. The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically developing children, with simulated lesions added to the test set. Spatial-MSMA significantly outperforms existing methods, including reconstruction-based, generative-based, and interpretation-based approaches, in lesion detection and segmentation tasks. Our model achieves superior performance in both distance-based metrics (99th percentile Hausdorff Distance: $7.05 \pm 0.61$, Mean Surface Distance: $2.10 \pm 0.43$) and component-wise metrics (True Positive Rate: $0.83 \pm 0.01$, Positive Predictive Value: $0.96 \pm 0.01$). These results demonstrate Spatial-MSMA's potential for accurate and interpretable anomaly localization in medical imaging, with implications for improved diagnosis and treatment planning in clinical settings. Our code is available at~\url{this https URL}.

Title: Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

Authors: Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
Subjects: cs.CL, cs.AI, cs.ET, cs.HC, cs.SI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach(https://arxiv.org/abs/)
Keywords: large language model
Abstract: In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users' subtle intentions that may elude human detection.

Title: Dataset Representativeness and Downstream Task Fairness

Authors: Victor Borza, Andrew Estornell, Chien-Ju Ho, Bradley Malin, Yevgeniy Vorobeychik
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Dataset Representativeness and Downstream Task Fairness(https://arxiv.org/abs/)
Keywords: fair
Abstract: Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers.

Title: MetaKP: On-Demand Keyphrase Generation

Authors: Di Wu, Xiaoxian Shen, Kai-Wei Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MetaKP: On-Demand Keyphrase Generation(https://arxiv.org/abs/)
Keywords: robust, large language model
Abstract: Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.

Title: PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

Authors: Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Zhongyi Shui, Kai Zhang, Jingxiong Li, Xingheng Lyu, Tao Lin, Lin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter, which provide limited, unscalable data with generally suboptimal image quality. In this work, we leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches. We then train a large multimodal model to generate captions for these images, creating PathGen-1.6M, a dataset containing 1.6 million high-quality image-caption pairs. Our approach involves multiple agent models collaborating to extract representative WSI patches, generating and refining captions to obtain high-quality image-text pairs. Extensive experiments show that integrating these generated pairs with existing datasets to train a pathology-specific CLIP model, PathGen-CLIP, significantly enhances its ability to analyze pathological images, with substantial improvements across nine pathology-related zero-shot image classification tasks and three whole-slide image tasks. Furthermore, we construct 200K instruction-tuning data based on PathGen-1.6M and integrate PathGen-CLIP with the Vicuna LLM to create more powerful multimodal models through instruction tuning. Overall, we provide a scalable pathway for high-quality data generation in pathology, paving the way for next-generation general pathology models.

Title: Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Authors: Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Evaluating Human Alignment and Model Faithfulness of LLM Rationale(https://arxiv.org/abs/)
Keywords: fair, large language model
Abstract: We study how well large language models (LLMs) explain their generations with rationales -- a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.

Title: Multimodal Prototyping for cancer survival prediction

Authors: Andrew H. Song, Richard J. Chen, Guillaume Jaume, Anurag J. Vaidya, Alexander S. Baras, Faisal Mahmood
Subjects: cs.CV, stat.AP
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Multimodal Prototyping for cancer survival prediction(https://arxiv.org/abs/)
Keywords: interpretability, transformer
Abstract: Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (>10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

Title: Transformer-based Image and Video Inpainting: Current Challenges and Future Directions

Authors: Omar Elharrouss, Rafat Damseh, Abdelkader Nasreddine Belkacem, Elarbi Badidi, Abderrahmane Lakas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Transformer-based Image and Video Inpainting: Current Challenges and Future Directions(https://arxiv.org/abs/)
Keywords: transformer, generative
Abstract: Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, visual transformers have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image or video inpainting approaches, with a specific focus on transformer-based techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image or video inpainting using visual transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

Title: EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models

Authors: João Matos, Jack Gallifant, Jian Pei, A. Ian Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models(https://arxiv.org/abs/)
Keywords: extraction, large language model
Abstract: Electronic health records (EHRs) contain vast amounts of complex data, but harmonizing and processing this information remains a challenging and costly task requiring significant clinical expertise. While large language models (LLMs) have shown promise in various healthcare applications, their potential for abstracting medical concepts from EHRs remains largely unexplored. We introduce EHRmonize, a framework leveraging LLMs to abstract medical concepts from EHR data. Our study uses medication data from two real-world EHR databases to evaluate five LLMs on two free-text extraction and six binary classification tasks across various prompting strategies. GPT-4o's with 10-shot prompting achieved the highest performance in all tasks, accompanied by Claude-3.5-Sonnet in a subset of tasks. GPT-4o achieved an accuracy of 97% in identifying generic route names, 82% for generic drug names, and 100% in performing binary classification of antibiotics. While EHRmonize significantly enhances efficiency, reducing annotation time by an estimated 60%, we emphasize that clinician oversight remains essential. Our framework, available as a Python package, offers a promising tool to assist clinicians in EHR data abstraction, potentially accelerating healthcare research and improving data harmonization processes.

Title: SBOM.EXE: Countering Dynamic Code Injection based on Software Bill of Materials in Java

Authors: Aman Sharma, Martin Wittlinger, Benoit Baudry, Martin Monperrus
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SBOM.EXE: Countering Dynamic Code Injection based on Software Bill of Materials in Java(https://arxiv.org/abs/)
Keywords: attack
Abstract: Software supply chain attacks have become a significant threat as software development increasingly relies on contributions from multiple, often unverified sources. The code from unverified sources does not pose a threat until it is executed. Log4Shell is a recent example of a supply chain attack that processed a malicious input at runtime, leading to remote code execution. It exploited the dynamic class loading facilities of Java to compromise the runtime integrity of the application. Traditional safeguards can mitigate supply chain attacks at build time, but they have limitations in mitigating runtime threats posed by dynamically loaded malicious classes. This calls for a system that can detect these malicious classes and prevent their execution at runtime. This paper introduces SBOM.EXE, a proactive system designed to safeguard Java applications against such threats. SBOM.EXE constructs a comprehensive allowlist of permissible classes based on the complete software supply chain of the application. This allowlist is enforced at runtime, blocking any unrecognized or tampered classes from executing. We assess SBOM.EXE's effectiveness by mitigating 3 critical CVEs based on the above threat. We run our tool with 3 open-source Java applications and report that our tool is compatible with real-world applications with minimal performance overhead. Our findings demonstrate that SBOM.EXE can effectively maintain runtime integrity with minimal performance impact, offering a novel approach to fortifying Java applications against dynamic classloading attacks.

Title: DiffuseDef: Improved Robustness to Adversarial Attacks

Authors: Zhenhao Li, Marek Rei, Lucia Specia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] DiffuseDef: Improved Robustness to Adversarial Attacks(https://arxiv.org/abs/)
Keywords: defense, attack, robust, diffusion
Abstract: Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

Title: Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

Authors: Jaydeep Borkar, David A. Smith
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription(https://arxiv.org/abs/)
Keywords: transformer
Abstract: Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

Title: Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review

Authors: Moseli Mots'oehli
Subjects: cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review(https://arxiv.org/abs/)
Keywords: segmentation
Abstract: While supervised learning has achieved significant success in computer vision tasks, acquiring high-quality annotated data remains a bottleneck. This paper explores both scholarly and non-scholarly works in AI-assistive deep learning image annotation systems that provide textual suggestions, captions, or descriptions of the input image to the annotator. This potentially results in higher annotation efficiency and quality. Our exploration covers annotation for a range of computer vision tasks including image classification, object detection, regression, instance, semantic segmentation, and pose estimation. We review various datasets and how they contribute to the training and evaluation of AI-assistive annotation systems. We also examine methods leveraging neuro-symbolic learning, deep active learning, and self-supervised learning algorithms that enable semantic image understanding and generate free-text output. These include image captioning, visual question answering, and multi-modal reasoning. Despite the promising potential, there is limited publicly available work on AI-assistive image annotation with textual output capabilities. We conclude by suggesting future research directions to advance this field, emphasizing the need for more publicly accessible datasets and collaborative efforts between academia and industry.

Title: A deep neural network framework for dynamic multi-valued mapping estimation and its applications

Authors: Geng Li, Di Qiu, Lok Ming Lui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A deep neural network framework for dynamic multi-valued mapping estimation and its applications(https://arxiv.org/abs/)
Keywords: generative
Abstract: This paper addresses the problem of modeling and estimating dynamic multi-valued mappings. While most mathematical models provide a unique solution for a given input, real-world applications often lack deterministic solutions. In such scenarios, estimating dynamic multi-valued mappings is necessary to suggest different reasonable solutions for each input. This paper introduces a deep neural network framework incorporating a generative network and a classification component. The objective is to model the dynamic multi-valued mapping between the input and output by providing a reliable uncertainty measurement. Generating multiple solutions for a given input involves utilizing a discrete codebook comprising finite variables. These variables are fed into a generative network along with the input, producing various output possibilities. The discreteness of the codebook enables efficient estimation of the output's conditional probability distribution for any given input using a classifier. By jointly optimizing the discrete codebook and its uncertainty estimation during training using a specially designed loss function, a highly accurate approximation is achieved. The effectiveness of our proposed framework is demonstrated through its application to various imaging problems, using both synthetic and real imaging data. Experimental results show that our framework accurately estimates the dynamic multi-valued mapping with uncertainty estimation.

Title: SolarSAM: Building-scale Photovoltaic Potential Assessment Based on Segment Anything Model (SAM) and Remote Sensing for Emerging City

Authors: Guohao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SolarSAM: Building-scale Photovoltaic Potential Assessment Based on Segment Anything Model (SAM) and Remote Sensing for Emerging City(https://arxiv.org/abs/)
Keywords: segmentation
Abstract: Driven by advancements in photovoltaic (PV) technology, solar energy has emerged as a promising renewable energy source, due to its ease of integration onto building rooftops, facades, and windows. For the emerging cities, the lack of detailed street-level data presents a challenge for effectively assessing the potential of building-integrated photovoltaic (BIPV). To address this, this study introduces SolarSAM, a novel BIPV evaluation method that leverages remote sensing imagery and deep learning techniques, and an emerging city in northern China is utilized to validate the model performance. During the process, SolarSAM segmented various building rooftops using text prompt guided semantic segmentation. Separate PV models were then developed for Rooftop PV, Facade-integrated PV, and PV windows systems, using this segmented data and local climate information. The potential for BIPV installation, solar power generation, and city-wide power self-sufficiency were assessed, revealing that the annual BIPV power generation potential surpassed the city's total electricity consumption by a factor of 2.5. Economic and environmental analysis were also conducted, including levelized cost of electricity and carbon reduction calculations, comparing different BIPV systems across various building categories. These findings demonstrated the model's performance and reveled the potential of BIPV power generation in the future.

Title: Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

Authors: Yangzhou Jiang, Yinxin Lin, Yaoming Wang, Teng Li, Bilian Ke, Bingbing Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck(https://arxiv.org/abs/)
Keywords: robust
Abstract: Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.

Title: OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Authors: Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

Title: LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Authors: Zhenhua Wang, Guang Xu, Ming Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods(https://arxiv.org/abs/)
Keywords: large language model
Abstract: With the ascent of large language models (LLM), natural language processing has witnessed enhancements, such as LLM-based data augmentation. Nonetheless, prior research harbors two primary concerns: firstly, a lack of contemplation regarding whether the natural language generated by LLM (LLMNL) truly aligns with human natural language (HNL), a critical foundational question; secondly, an oversight that augmented data is randomly generated by LLM, implying that not all data may possess equal training value, that could impede the performance of classifiers. To address these challenges, we introduce the scaling laws to intrinsically calculate LLMNL and HNL. Through extensive experiments, we reveal slight deviations (approximately 0.2 Mandelbrot exponent) from Mandelbrot's law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style. This establishes a solid foundation for LLM's expansion. Further, we introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws to make decisions about GPT-4 augmented data. Extensive experiments, conducted in real-world scenarios, confirms the effectiveness (improving F1 of Bert and RoBerta by 7-10%) and competitiveness (surpassing recent AugGPT and GENCO methods by about 2% accuracy on DeBerta) of ZGPTDA. In addition, we reveal some interesting insights, e.g., Hilberg's law and Taylor's law can impart more benefits to text classification, etc.

Title: Dual-view Aware Smart Contract Vulnerability Detection for Ethereum

Authors: Jiacheng Yao, Maolin Wang, Wanqi Chen, Chengxiang Jin, Jiajun Zhou, Shanqing Yu, Qi Xuan
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Dual-view Aware Smart Contract Vulnerability Detection for Ethereum(https://arxiv.org/abs/)
Keywords: security
Abstract: The wide application of Ethereum technology has brought technological innovation to traditional industries. As one of Ethereum's core applications, smart contracts utilize diverse contract codes to meet various functional needs and have gained widespread use. However, the non-tamperability of smart contracts, coupled with vulnerabilities caused by natural flaws or human errors, has brought unprecedented challenges to blockchain security. Therefore, in order to ensure the healthy development of blockchain technology and the stability of the blockchain community, it is particularly important to study the vulnerability detection techniques for smart contracts. In this paper, we propose a Dual-view Aware Smart Contract Vulnerability Detection Framework named DVDet. The framework initially converts the source code and bytecode of smart contracts into weighted graphs and control flow sequences, capturing potential risk features from these two perspectives and integrating them for analysis, ultimately achieving effective contract vulnerability detection. Comprehensive experiments on the Ethereum dataset show that our method outperforms others in detecting vulnerabilities.

Title: Iterative Data Augmentation with Large Language Models for Aspect-based Sentiment Analysis

Authors: Haiyun Li, Qihuang Zhong, Ke Zhu, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Iterative Data Augmentation with Large Language Models for Aspect-based Sentiment Analysis(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Aspect-based Sentiment Analysis (ABSA) is an important sentiment analysis task, which aims to determine the sentiment polarity towards an aspect in a sentence. Due to the expensive and limited labeled data, data augmentation (DA) has become the standard for improving the performance of ABSA. However, current DA methods usually have some shortcomings: 1) poor fluency and coherence, 2) lack of diversity of generated data, and 3) reliance on some existing labeled data, hindering its applications in real-world scenarios. In response to these problems, we propose a systematic Iterative Data augmentation framework, namely IterD, to boost the performance of ABSA. The core of IterD is to leverage the powerful ability of large language models (LLMs) to iteratively generate more fluent and diverse synthetic labeled data, starting from an unsupervised sentence corpus. Extensive experiments on 4 widely-used ABSA benchmarks show that IterD brings consistent and significant performance gains among 5 baseline ABSA models. More encouragingly, the synthetic data generated by IterD can achieve comparable or even better performance against the manually annotated data.

Title: Resource Allocation and Secure Wireless Communication in the Large Model-based Mobile Edge Computing System

Authors: Zefan Wang, Yitong Wang, Jun Zhao
Subjects: cs.CR, cs.SI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Resource Allocation and Secure Wireless Communication in the Large Model-based Mobile Edge Computing System(https://arxiv.org/abs/)
Keywords: secure, security, privacy
Abstract: With the rapid advancement of large models and mobile edge computing, transfer learning, particularly through fine-tuning, has become crucial for adapting models to downstream tasks. Traditionally, this requires users to share their data with model owners for fine-tuning, which is not only costly but also raises significant privacy concerns. Furthermore, fine-tuning large-scale models is computationally intensive and often impractical for many users. To tackle these challenges, we introduce a system that combines offsite-tuning with physical-layer security, which provides local data owners with a lightweight adapter and a compressed emulator. Data owners then fine-tune the adapter locally and securely send it back to the model owners through a confidential channel for integration, ensuring privacy and resource conservation. Our paper focuses on optimizing computational resource allocation among data owners and the large model owner deployed on edge, and on the compression ratio of adapters. We incorporate a secrecy uplink channel to maximize the utility that we defined while minimizing system costs like energy consumption and delay. The optimization uses the Dinkelbach algorithm, fractional programming, successive convex approximation and alternating optimization. Experiments demonstrate our algorithm's superiority over existing methods.

Title: PhyTracker: An Online Tracker for Phytoplankton

Authors: Yang Yu, Qingxuan Lv, Yuezun Li, Zhiqiang Wei, Junyu Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PhyTracker: An Online Tracker for Phytoplankton(https://arxiv.org/abs/)
Keywords: extraction
Abstract: Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

Title: Financial Knowledge Large Language Model

Authors: Cehao Yang, Chengjin Xu, Yiyan Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Financial Knowledge Large Language Model(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Artificial intelligence is making significant strides in the finance industry, revolutionizing how data is processed and interpreted. Among these technologies, large language models (LLMs) have demonstrated substantial potential to transform financial services by automating complex tasks, enhancing customer service, and providing detailed financial analysis. Firstly, we introduce IDEA-FinBench, an evaluation benchmark specifically tailored for assessing financial knowledge in large language models (LLMs). This benchmark utilizes questions from two globally respected and authoritative financial professional exams, aimimg to comprehensively evaluate the capability of LLMs to directly address exam questions pertinent to the finance sector. Secondly, we propose IDEA-FinKER, a Financial Knowledge Enhancement framework designed to facilitate the rapid adaptation of general LLMs to the financial domain, introducing a retrieval-based few-shot learning method for real-time context-level knowledge injection, and a set of high-quality financial knowledge instructions for fine-tuning any general LLM. Finally, we present IDEA-FinQA, a financial question-answering system powered by LLMs. This system is structured around a scheme of real-time knowledge injection and factual enhancement using external knowledge. IDEA-FinQA is comprised of three main modules: the data collector, the data querying module, and LLM-based agents tasked with specific functions.

Title: SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Authors: Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, Yinda Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix(https://arxiv.org/abs/)
Keywords: generative
Abstract: Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \url{this https URL}.

Title: How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

Authors: Jaeyoung Lee, Ximing Lu, Jack Hessel, Faeze Brahman, Youngjae Yu, Yonatan Bisk, Yejin Choi, Saadia Gabriel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-checkers, such efforts may be hampered by foundation model training data becoming outdated. In this work, we test the limits of improving foundation model performance without continual updating through an initial study of knowledge transfer using either existing intra- and inter- domain benchmarks or explanations generated from large language models (LLMs). We evaluate on 12 public benchmarks for fact-checking and misinformation detection as well as two other tasks relevant to content moderation -- toxicity and stance detection. Our results on two recent multi-modal fact-checking benchmarks, Mocheg and Fakeddit, indicate that knowledge transfer strategies can improve Fakeddit performance over the state-of-the-art by up to 1.7% and Mocheg performance by up to 2.9%.

Title: The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

Authors: Yixin Wan, Di Wu, Haoran Wang, Kai-Wei Chang
Subjects: cs.CL, cs.AI, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention(https://arxiv.org/abs/)
Keywords: fair, large language model
Abstract: Prompt-based "diversity interventions" are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3's generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

Title: Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Authors: Chao Zhou, Xiaowen Shi, Yuan-Gen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Query-Efficient Hard-Label Black-Box Attack against Vision Transformers(https://arxiv.org/abs/)
Keywords: security, attack, transformer
Abstract: Recent studies have revealed that vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs). However, directly applying attack methodology on CNNs to ViTs has been demonstrated to be ineffective since the ViTs typically work on patch-wise encoding. This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario, and proposes a novel query-efficient hard-label adversarial attack method called AdvViT. Specifically, considering that ViTs are highly sensitive to patch modification, we propose to optimize the adversarial perturbation on the individual patches. To reduce the dimension of perturbation search space, we modify only a handful of low-frequency components of each patch. Moreover, we design a weight mask matrix for all patches to further optimize the perturbation on different regions of a whole image. We test six mainstream ViT backbones on the ImageNet-1k dataset. Experimental results show that compared with the state-of-the-art attacks on CNNs, our AdvViT achieves much lower $L_2$-norm distortion under the same query budget, sufficiently validating the vulnerability of ViTs against adversarial attacks.

Title: Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Authors: Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, Weiming Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Advancing Process Verification for Large Language Models via Tree-Based Preference Learning(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.

Title: A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

Authors: Kausik Bhattacharya, Anubhab Majumder, Amaresh Chakrabarti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Authors: Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

Title: Parametric Primitive Analysis of CAD Sketches with Vision Transformer

Authors: Xiaogang Wang, Liang Wang, Hongyu Wu, Guoqiang Xiao, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Parametric Primitive Analysis of CAD Sketches with Vision Transformer(https://arxiv.org/abs/)
Keywords: interpretability, transformer
Abstract: The design and analysis of Computer-Aided Design (CAD) sketches play a crucial role in industrial product design, primarily involving CAD primitives and their inter-primitive constraints. To address challenges related to error accumulation in autoregressive models and the complexities associated with self-supervised model design for this task, we propose a two-stage network framework. This framework consists of a primitive network and a constraint network, transforming the sketch analysis task into a set prediction problem to enhance the effective handling of primitives and constraints. By decoupling target types from parameters, the model gains increased flexibility and optimization while reducing complexity. Additionally, the constraint network incorporates a pointer module to explicitly indicate the relationship between constraint parameters and primitive indices, enhancing interpretability and performance. Qualitative and quantitative analyses on two publicly available datasets demonstrate the superiority of this method.

Title: Explainability of Machine Learning Models under Missing Data

Authors: Tuan L. Vo, Thu Nguyen, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Explainability of Machine Learning Models under Missing Data(https://arxiv.org/abs/)
Keywords: robust, interpretability, explainability
Abstract: Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine learning models. We compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the interpretability of the model. Moreover, and that a lower test prediction mean square error (MSE) may not imply a lower MSE in Shapley values and vice versa. Also, while Xgboost is a method that could handle missing data directly, using Xgboost directly on missing data can seriously affect interpretability compared to imputing the data before training Xgboost. This study provides a comprehensive evaluation of imputation methods in the context of model interpretation, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

Title: Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs

Authors: Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker Salman Sayeed, Haz Sameen Shahgir
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model.

Title: Obtaining $(\epsilon,\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables

Authors: James Jackson, Robin Mitra, Brian Francis, Iain Dove
Subjects: cs.CR, stat.ME
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Obtaining $(\epsilon,\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables(https://arxiv.org/abs/)
Keywords: privacy, protect
Abstract: We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(\epsilon, \delta)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.

Title: eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Authors: Krzysztof Nowak, Jędrzej Ziębura, Krzysztof Wróbel, Aleksander Smywiński-Pohl
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey(https://arxiv.org/abs/)
Keywords: transformer
Abstract: This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

Title: Time Series Clustering with General State Space Models via Stochastic Variational Inference

Authors: Ryoichi Ishizuka, Takashi Imai, Kaoru Kawamoto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Time Series Clustering with General State Space Models via Stochastic Variational Inference(https://arxiv.org/abs/)
Keywords: interpretability
Abstract: In this paper, we propose a novel method of model-based time series clustering with mixtures of general state space models (MSSMs). Each component of MSSMs is associated with each cluster. An advantage of the proposed method is that it enables the use of time series models appropriate to the specific time series. This not only improves clustering and prediction accuracy but also enhances the interpretability of the estimated parameters. The parameters of the MSSMs are estimated using stochastic variational inference, a subtype of variational inference. The proposed method estimates the latent variables of an arbitrary state space model by using neural networks with a normalizing flow as a variational estimator. The number of clusters can be estimated using the Bayesian information criterion. In addition, to prevent MSSMs from converging to the local optimum, we propose several optimization tricks, including an additional penalty term called entropy annealing. Experiments on simulated datasets show that the proposed method is effective for clustering, parameter estimation, and estimating the number of clusters.

Title: A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Authors: Peiqin Lin, André F. T. Martins, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

Title: AI Age Discrepancy: A Novel Parameter for Frailty Assessment in Kidney Tumor Patients

Authors: Jayant Siva, Angelica Bartholomew, Clara Goebel, Gabriel Wallerstein-King, Beatriz López Morato, Nicholas Heller, Jason Scovell, Rebecca Campbell, Andrew Wood, Michal Ozery-Flato, Vesna Barros, Maria Gabrani, Michal Rosen-Zvi, Resha Tejpaul, Vidhyalakshmi Ramesh, Nikolaos Papanikolopoulos, Subodh Regmi, Ryan Ward, Robert Abouassaly, Steven C. Campbell, Erick Remer, Christopher Weight
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] AI Age Discrepancy: A Novel Parameter for Frailty Assessment in Kidney Tumor Patients(https://arxiv.org/abs/)
Keywords: segmentation
Abstract: Kidney cancer is a global health concern, and accurate assessment of patient frailty is crucial for optimizing surgical outcomes. This paper introduces AI Age Discrepancy, a novel metric derived from machine learning analysis of preoperative abdominal CT scans, as a potential indicator of frailty and postoperative risk in kidney cancer patients. This retrospective study of 599 patients from the 2023 Kidney Tumor Segmentation (KiTS) challenge dataset found that a higher AI Age Discrepancy is significantly associated with longer hospital stays and lower overall survival rates, independent of established factors. This suggests that AI Age Discrepancy may provide valuable insights into patient frailty and could thus inform clinical decision-making in kidney cancer treatment.

Title: Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models

Authors: Ryokan Ri, Shun Kiyono, Sho Takase
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method called Self-Translate-Train. It leverages the translation capability of a large language model to generate synthetic training data in the target language and fine-tunes the model with its own generated data. We evaluate the proposed method on a wide range of tasks and show substantial performance gains across several non-English languages.

Title: pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation

Authors: Luyuan Xie, Manqing Lin, Siyuan Liu, ChenMing Xu, Tianyu Luan, Cong Li, Yuejian Fang, Qingni Shen, Zhonghai Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation(https://arxiv.org/abs/)
Keywords: privacy, federate, segmentation
Abstract: In medical image segmentation, personalized cross-silo federated learning (FL) is becoming popular for utilizing varied data across healthcare settings to overcome data scarcity and privacy concerns. However, existing methods often suffer from client drift, leading to inconsistent performance and delayed training. We propose a new framework, Personalized Federated Learning via Feature Enhancement (pFLFE), designed to mitigate these challenges. pFLFE consists of two main stages: feature enhancement and supervised learning. The first stage improves differentiation between foreground and background features, and the second uses these enhanced features for learning from segmentation masks. We also design an alternative training approach that requires fewer communication rounds without compromising segmentation quality, even with limited communication resources. Through experiments on three medical segmentation tasks, we demonstrate that pFLFE outperforms the state-of-the-art methods.

Title: Open-Source Conversational AI with SpeechBrain 1.0

Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar, Jarod Duret, Salima Mdhaffar, Gaelle Laperriere, Renato De Mori, Yannick Esteve
Subjects: cs.LG, cs.AI, cs.CL, cs.HC, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Open-Source Conversational AI with SpeechBrain 1.0(https://arxiv.org/abs/)
Keywords: large language model
Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much this http URL promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

Title: BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

Authors: Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z. Li, Kaicheng Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle "Understanding Literature" into two atomic abilities, i) "Understanding" the unstructured text from research papers by performing scientific claim verification, and ii) Ability to interact with structured Knowledge-Graph Question-Answering (KGQA) as a form of "Literature" grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify the factual errors of existing large-scale knowledge graph databases. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task. Surprisingly, we discover that state-of-the-art agents, both daily scenarios and biomedical ones, have either failed or inferior performance on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach. The code and data are available at this https URL.

Title: VcLLM: Video Codecs are Secretly Tensor Codecs

Authors: Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, Lisa Wu Wills
Subjects: cs.LG, cs.DC, eess.IV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] VcLLM: Video Codecs are Secretly Tensor Codecs(https://arxiv.org/abs/)
Keywords: large language model
Abstract: As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs.

Title: MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

Title: MH-pFLGB: Model Heterogeneous personalized Federated Learning via Global Bypass for Medical Image Analysis

Authors: Luyuan Xie, Manqing Lin, ChenMing Xu, Tianyu Luan, Zhipeng Zeng, Wenjun Qian, Cong Li, Yuejian Fang, Qingni Shen, Zhonghai Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MH-pFLGB: Model Heterogeneous personalized Federated Learning via Global Bypass for Medical Image Analysis(https://arxiv.org/abs/)
Keywords: privacy, protect, federate
Abstract: In the evolving application of medical artificial intelligence, federated learning is notable for its ability to protect training data privacy. Federated learning facilitates collaborative model development without the need to share local data from healthcare institutions. Yet, the statistical and system heterogeneity among these institutions poses substantial challenges, which affects the effectiveness of federated learning and hampers the exchange of information between clients. To address these issues, we introduce a novel approach, MH-pFLGB, which employs a global bypass strategy to mitigate the reliance on public datasets and navigate the complexities of non-IID data distributions. Our method enhances traditional federated learning by integrating a global bypass model, which would share the information among the clients, but also serves as part of the network to enhance the performance on each client. Additionally, MH-pFLGB provides a feature fusion module to better combine the local and global features. We validate \model{}'s effectiveness and adaptability through extensive testing on different medical tasks, demonstrating superior performance compared to existing state-of-the-art methods.

Title: Large Language Models for Power Scheduling: A User-Centric Approach

Authors: Thomas Mongaillard, Samson Lasaulce, Othman Hicheur, Chao Zhang, Lina Bariah, Vineeth S. Varma, Hang Zou, Qiyang Zhao, Merouane Debbah
Subjects: cs.CL, eess.SY
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Large Language Models for Power Scheduling: A User-Centric Approach(https://arxiv.org/abs/)
Keywords: large language model
Abstract: While traditional optimization and scheduling schemes are designed to meet fixed, predefined system requirements, future systems are moving toward user-driven approaches and personalized services, aiming to achieve high quality-of-experience (QoE) and flexibility. This challenge is particularly pronounced in wireless and digitalized energy networks, where users' requirements have largely not been taken into consideration due to the lack of a common language between users and machines. The emergence of powerful large language models (LLMs) marks a radical departure from traditional system-centric methods into more advanced user-centric approaches by providing a natural communication interface between users and devices. In this paper, for the first time, we introduce a novel architecture for resource scheduling problems by constructing three LLM agents to convert an arbitrary user's voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an LLM OP solving agent. To evaluate system performance, we construct a database of typical VRQs in the context of electric vehicle (EV) charging. As a proof of concept, we primarily use Llama 3 8B. Through testing with different prompt engineering scenarios, the obtained results demonstrate the efficiency of the proposed architecture. The conducted performance analysis allows key insights to be extracted. For instance, having a larger set of candidate OPs to model the real-world problem might degrade the final performance because of a higher recognition/OP classification noise level. All results and codes are open source.

Title: Navigating the road to automotive cybersecurity compliance

Authors: Franco Oberti, Fabrizio Abrate, Alessandro Savino, Filippo Parisi, Stefano Di Carlo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Navigating the road to automotive cybersecurity compliance(https://arxiv.org/abs/)
Keywords: security, privacy, protect, attack, robust
Abstract: The automotive industry has evolved significantly since the introduction of the Ford Model T in 1908. Today's vehicles are not merely mechanical constructs; they are integral components of a complex digital ecosystem, equipped with advanced connectivity features powered by Artificial Intelligence and cloud computing technologies. This evolution has enhanced vehicle safety, efficiency, and the overall driving experience. However, it also introduces new challenges, notably in cybersecurity. With the increasing integration of digital technologies, vehicles have become more susceptible to cyber-attacks, prompting significant cybersecurity concerns. These concerns include securing sensitive data, protecting vehicles from unauthorized access, and ensuring user privacy. In response, the automotive industry is compelled to adopt robust cybersecurity measures to safeguard both vehicles and data against potential threats. Legislative frameworks such as UNR155 and UNR156 by the United Nations, along with other international regulations, aim to establish stringent cybersecurity mandates. These regulations require compliance with comprehensive cybersecurity management systems and necessitate regular updates and testing to cope with the evolving nature of cyber threats. The introduction of such regulations highlights the growing recognition of cybersecurity as a critical component of automotive safety and functionality. The future of automotive cybersecurity lies in the continuous development of advanced protective measures and collaborative efforts among all stakeholders, including manufacturers, policymakers, and cybersecurity professionals. Only through such concerted efforts can the industry hope to address the dual goals of innovation in vehicle functionality and stringent security measures against the backdrop of an increasingly interconnected digital landscape.

Title: Towards Massive Multilingual Holistic Bias

Authors: Xiaoqing Ellen Tan, Prangthip Hansanti, Carleigh Wood, Bokai Yu, Christophe Ropers, Marta R. Costa-jussà
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Massive Multilingual Holistic Bias(https://arxiv.org/abs/)
Keywords: robust
Abstract: In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight languages from the MASSIVE MULTILINGUAL HOLISTICBIAS (MMHB) dataset and benchmark consisting of approximately 6 million sentences representing 13 demographic axes. We propose an automatic construction methodology to further scale up MMHB sentences in terms of both language coverage and size, leveraging limited human annotation. Our approach utilizes placeholders in multilingual sentence construction and employs a systematic method to independently translate sentence patterns, nouns, and descriptors. Combined with human translation, this technique carefully designs placeholders to dynamically generate multiple sentence variations and significantly reduces the human translation workload. The translation process has been meticulously conducted to avoid an English-centric perspective and include all necessary morphological variations for languages that require them, improving from the original English HOLISTICBIAS. Finally, we utilize MMHB to report results on gender bias and added toxicity in machine translation tasks. On the gender analysis, MMHB unveils: (1) a lack of gender robustness showing almost +4 chrf points in average for masculine semantic sentences compared to feminine ones and (2) a preference to overgeneralize to masculine forms by reporting more than +12 chrf points in average when evaluating with masculine compared to feminine references. MMHB triggers added toxicity up to 2.3%.

Title: It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Authors: Bingdong Li, Zixiang Di, Yanting Yang, Hong Qian, Peng Yang, Hao Hao, Ke Tang, Aimin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization(https://arxiv.org/abs/)
Keywords: robust, large language model
Abstract: In this paper, we introduce a novel approach for large language model merging via black-box multi-objective optimization algorithms. The goal of model merging is to combine multiple models, each excelling in different tasks, into a single model that outperforms any of the individual source models. However, model merging faces two significant challenges: First, existing methods rely heavily on human intuition and customized strategies. Second, parameter conflicts often arise during merging, and while methods like DARE [1] can alleviate this issue, they tend to stochastically drop parameters, risking the loss of important delta parameters. To address these challenges, we propose the MM-MO method, which automates the search for optimal merging configurations using multi-objective optimization algorithms, eliminating the need for human intuition. During the configuration searching process, we use estimated performance across multiple diverse tasks as optimization objectives in order to alleviate the parameter conflicting between different source models without losing crucial delta parameters. We conducted comparative experiments with other mainstream model merging methods, demonstrating that our method consistently outperforms them. Moreover, our experiments reveal that even task types not explicitly targeted as optimization objectives show performance improvements, indicating that our method enhances the overall potential of the model rather than merely overfitting to specific task types. This approach provides a significant advancement in model merging techniques, offering a robust and plug-and-play solution for integrating diverse models into a unified, high-performing model.

Title: PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models

Authors: Kunquan Deng, Zeyu Huang, Chen Li, Chenghua Lin, Min Gao, Wenge Rong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large Language Models (LLMs) excel in fluency but risk producing inaccurate content, called "hallucinations." This paper outlines a standardized process for categorizing fine-grained hallucination types and proposes an innovative framework--the Progressive Fine-grained Model Editor (PFME)--specifically designed to detect and correct fine-grained hallucinations in LLMs. PFME consists of two collaborative modules: the Real-time Fact Retrieval Module and the Fine-grained Hallucination Detection and Editing Module. The former identifies key entities in the document and retrieves the latest factual evidence from credible sources. The latter further segments the document into sentence-level text and, based on relevant evidence and previously edited context, identifies, locates, and edits each sentence's hallucination type. Experimental results on FavaBench and FActScore demonstrate that PFME outperforms existing methods in fine-grained hallucination detection tasks. Particularly, when using the Llama3-8B-Instruct model, PFME's performance in fine-grained hallucination detection with external knowledge assistance improves by 8.7 percentage points (pp) compared to ChatGPT. In editing tasks, PFME further enhances the FActScore of FActScore-Alpaca13B and FActScore-ChatGPT datasets, increasing by 16.2pp and 4.6pp, respectively.

Title: Graph Neural Networks Gone Hogwild

Authors: Olga Solodova, Nick Richardson, Deniz Oktay, Ryan P. Adams
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Graph Neural Networks Gone Hogwild(https://arxiv.org/abs/)
Keywords: robust
Abstract: Message passing graph neural networks (GNNs) would appear to be powerful tools to learn distributed algorithms via gradient descent, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excludes these architectures from many potential applications, such as learning local communication policies between resource-constrained agents in, e.g., robotic swarms or sensor networks. In this work we explore why this failure occurs in common GNN architectures, and identify "implicitly-defined" GNNs as a class of architectures which is provably robust to partially asynchronous "hogwild" inference, adapting convergence guarantees from work in asynchronous and distributed optimization, e.g., Bertsekas (1982); Niu et al. (2011). We then propose a novel implicitly-defined GNN architecture, which we call an energy GNN. We show that this architecture outperforms other GNNs from this class on a variety of synthetic tasks inspired by multi-agent systems, and achieves competitive performance on real-world datasets.

Title: LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Authors: Jiahao Ying, Mingbao Lin, Yixin Cao, Wei Tang, Bo Wang, Qianru Sun, Xuanjing Huang, Shuicheng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement(https://arxiv.org/abs/)
Keywords: large language model
Abstract: This paper introduces the innovative "LLMs-as-Instructors" framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors. Our empirical studies, conducted with several open-source models, demonstrate significant improvements across multiple benchmarks, including mathematical reasoning, coding abilities, and factual knowledge. Notably, the refined Llama-3-8b-Instruction has outperformed ChatGPT, illustrating the effectiveness of our approach. By leveraging the strengths of both strategies, we have attained a more balanced performance improvement on both in-domain and out-of-domain benchmarks. Our code can be found at this https URL.

Title: ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Authors: Zhiyuan Wang, Jinhao Duan, Lu Cheng, Yue Zhang, Qingni Wang, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model's unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.

Title: Aeroengine performance prediction using a physical-embedded data-driven method

Authors: Tong Mo, Shiran Dai, An Fu, Xiaomeng Zhu, Shuxiao Li
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Aeroengine performance prediction using a physical-embedded data-driven method(https://arxiv.org/abs/)
Keywords: robust
Abstract: Accurate and efficient prediction of aeroengine performance is of paramount importance for engine design, maintenance, and optimization endeavours. However, existing methodologies often struggle to strike an optimal balance among predictive accuracy, computational efficiency, modelling complexity, and data dependency. To address these challenges, we propose a strategy that synergistically combines domain knowledge from both the aeroengine and neural network realms to enable real-time prediction of engine performance parameters. Leveraging aeroengine domain knowledge, we judiciously design the network structure and regulate the internal information flow. Concurrently, drawing upon neural network domain expertise, we devise four distinct feature fusion methods and introduce an innovative loss function formulation. To rigorously evaluate the effectiveness and robustness of our proposed strategy, we conduct comprehensive validation across two distinct datasets. The empirical results demonstrate :(1) the evident advantages of our tailored loss function; (2) our model's ability to maintain equal or superior performance with a reduced parameter count; (3) our model's reduced data dependency compared to generalized neural network architectures; (4)Our model is more interpretable than traditional black box machine learning methods.

Title: Toward a Diffusion-Based Generalist for Dense Vision Tasks

Authors: Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Toward a Diffusion-Based Generalist for Dense Vision Tasks(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

Title: Blockchain based Decentralized Petition System

Authors: Jagdeep Kaur, Kevin Antony, Nikhil Pujar, Ankit Jha
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Blockchain based Decentralized Petition System(https://arxiv.org/abs/)
Keywords: secure, security, fair
Abstract: A decentralized online petition system enables individuals or groups to create, sign, and share petitions without a central authority. Using blockchain technology, these systems ensure the integrity and transparency of the petition process by recording every signature or action on the blockchain, making alterations or deletions impossible. This provides a permanent, tamper-proof record of the petition's progress. Such systems allow users to bypass traditional intermediaries like government or social media platforms, fostering more democratic and transparent decision-making. This paper reviews research on petition systems, highlighting the shortcomings of existing systems such as lack of accountability, vulnerability to hacking, and security issues. The proposed blockchain-based implementation aims to overcome these challenges. Decentralized voting systems have garnered interest recently due to their potential to provide secure and transparent voting platforms without intermediaries, addressing issues like voter fraud, manipulation, and trust in the electoral process. We propose a decentralized voting system web application using blockchain technology to ensure the integrity and security of the voting process. This system aims to provide a transparent, decentralized decision-making process that counts every vote while eliminating the need for centralized authorities. The paper presents an overview of the system architecture, design considerations, and implementation details, along with the potential benefits and limitations. Finally, we discuss future research directions, examining the technical aspects of the application, including underlying algorithms and protocols. Our research aims to enhance the integrity and accessibility of democratic processes, improve security, and ensure fairness, transparency, and tamper-proofness.

Title: Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging

Authors: Kiarash Sedghighadikolaei, Attila A Yavuz
Subjects: cs.CR, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging(https://arxiv.org/abs/)
Keywords: security, privacy
Abstract: The shift towards efficient and automated data analysis through Machine Learning (ML) has notably impacted healthcare systems, particularly Radiomics. Radiomics leverages ML to analyze medical images accurately and efficiently for precision medicine. Current methods rely on Deep Learning (DL) to improve performance and accuracy (Deep Radiomics). Given the sensitivity of medical images, ensuring privacy throughout the Deep Radiomics pipeline-from data generation and collection to model training and inference-is essential, especially when outsourced. Thus, Privacy-Enhancing Technologies (PETs) are crucial tools for Deep Radiomics. Previous studies and systematization efforts have either broadly overviewed PETs and their applications or mainly focused on subsets of PETs for ML algorithms. In Deep Radiomics, where efficiency, accuracy, and privacy are crucial, many PETs, while theoretically applicable, may not be practical without specialized optimizations or hybrid designs. Additionally, not all DL models are suitable for Radiomics. Consequently, there is a need for specialized studies that investigate and systematize the effective and practical integration of PETs into the Deep Radiomics pipeline. This work addresses this research gap by (1) classifying existing PETs, presenting practical hybrid PETS constructions, and a taxonomy illustrating their potential integration with the Deep Radiomics pipeline, with comparative analyses detailing assumptions, architectural suitability, and security, (2) Offering technical insights, describing potential challenges and means of combining PETs into the Deep Radiomics pipeline, including integration strategies, subtilities, and potential challenges, (3) Proposing potential research directions, identifying challenges, and suggesting solutions to enhance the PETs in Deep Radiomics.

Title: Answering real-world clinical questions using large language model based systems

Authors: Yen Sia Low (1), Michael L. Jackson (1), Rebecca J. Hyde (1), Robert E. Brown (1), Neil M. Sanghavi (1), Julian D. Baldwin (1), C. William Pike (1), Jananee Muralidharan (1), Gavin Hui (1 and 2), Natasha Alexander (3), Hadeel Hassan (3), Rahul V. Nene (4), Morgan Pike (5), Courtney J. Pokrzywa (6), Shivam Vedak (7), Adam Paul Yan (3), Dong-han Yao (7), Amy R. Zipursky (3), Christina Dinh (1), Philip Ballentine (1), Dan C. Derieg (1), Vladimir Polony (1), Rehan N. Chawdry (1), Jordan Davies (1), Brigham B. Hyde (1), Nigam H. Shah (1 and 7), Saurabh Gombar (1 and 8) ((1) Atropos Health, New York NY, USA, (2) Department of Medicine, University of California, Los Angeles CA, USA, (3) Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada, (4) Department of Emergency Medicine, University of California, San Diego CA, USA, (5) Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA, (6) Department of Surgery, Columbia University, New York NY, USA, (7) Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA (8) Department of Pathology, Stanford University, Stanford CA, USA)
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Answering real-world clinical questions using large language model based systems(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

Title: Explaining Chest X-ray Pathology Models using Textual Concepts

Authors: Vijay Sadashivaiah, Mannudeep K. Kalra, Pingkun Yan, James A. Hendler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Explaining Chest X-ray Pathology Models using Textual Concepts(https://arxiv.org/abs/)
Keywords: interpretability
Abstract: Deep learning models have revolutionized medical imaging and diagnostics, yet their opaque nature poses challenges for clinical adoption and trust. Amongst approaches to improve model interpretability, concept-based explanations aim to provide concise and human understandable explanations of any arbitrary classifier. However, such methods usually require a large amount of manually collected data with concept annotation, which is often scarce in the medical domain. In this paper, we propose Conceptual Counterfactual Explanations for Chest X-ray (CoCoX) that leverage existing vision-language models (VLM) joint embedding space to explain black-box classifier outcomes without the need for annotated datasets. Specifically, we utilize textual concepts derived from chest radiography reports and a pre-trained chest radiography-based VLM to explain three common cardiothoracic pathologies. We demonstrate that the explanations generated by our method are semantically meaningful and faithful to underlying pathologies.

Title: Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations

Authors: Dibyajyoti Chakraborty, Seung Whan Chung, Romit Maulik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations(https://arxiv.org/abs/)
Keywords: robust
Abstract: Forecasting high-dimensional dynamical systems is a fundamental challenge in various fields, such as the geosciences and engineering. Neural Ordinary Differential Equations (NODEs), which combine the power of neural networks and numerical solvers, have emerged as a promising algorithm for forecasting complex nonlinear dynamical systems. However, classical techniques used for NODE training are ineffective for learning chaotic dynamical systems. In this work, we propose a novel NODE-training approach that allows for robust learning of chaotic dynamical systems. Our method addresses the challenges of non-convexity and exploding gradients associated with underlying chaotic dynamics. Training data trajectories from such systems are split into multiple, non-overlapping time windows. In addition to the deviation from the training data, the optimization loss term further penalizes the discontinuities of the predicted trajectory between the time windows. The window size is selected based on the fastest Lyapunov time scale of the system. Multi-step penalty(MP) method is first demonstrated on Lorenz equation, to illustrate how it improves the loss landscape and thereby accelerating the optimization convergence. MP method can optimize chaotic systems in a manner similar to least-squares shadowing with significantly lower computational costs. Our proposed algorithm, denoted the Multistep Penalty NODE(MP-NODE), is applied to chaotic systems such as the Kuramoto-Sivashinsky equation and the two-dimensional Kolmogorov flow. It is observed that MP-NODE provide viable performance for such chaotic systems, not only for short-term trajectory predictions but also for invariant statistics that are hallmarks of the chaotic nature of these dynamics.

Title: OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

Authors: Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration(https://arxiv.org/abs/)
Keywords: robust
Abstract: Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Mapping (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR's local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion mapping to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

Title: MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting

Authors: Al Nahian Bin Emran, Amrita Ganguly, Sadiya Sayara Chowdhury Puspo, Nishat Raihan, Dhiman Goswami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting(https://arxiv.org/abs/)
Keywords: secure, transformer
Abstract: In this paper, we present MasonTigers' participation in SemEval-2024 Task 10, a shared task aimed at identifying emotions and understanding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for Hindi-English code-mixed dialogues, emotion flip reasoning for Hindi-English code-mixed dialogues, and emotion flip reasoning for English dialogues. Our team, MasonTigers, contributed to each subtask, focusing on developing methods for accurate emotion recognition and reasoning. By leveraging our approaches, we attained impressive F1-scores of 0.78 for the first task and 0.79 for both the second and third tasks. This performance not only underscores the effectiveness of our methods across different aspects of the task but also secured us the top rank in the first and third subtasks, and the 2nd rank in the second subtask. Through extensive experimentation and analysis, we provide insights into our system's performance and contributions to each subtask.

Title: Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random Features

Authors: Oliver R. A. Dunbar, Nicholas H. Nelsen, Maya Mutic
Subjects: cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random Features(https://arxiv.org/abs/)
Keywords: robust
Abstract: Randomized algorithms exploit stochasticity to reduce computational complexity. One important example is random feature regression (RFR) that accelerates Gaussian process regression (GPR). RFR approximates an unknown function with a random neural network whose hidden weights and biases are sampled from a probability distribution. Only the final output layer is fit to data. In randomized algorithms like RFR, the hyperparameters that characterize the sampling distribution greatly impact performance, yet are not directly accessible from samples. This makes optimization of hyperparameters via standard (gradient-based) optimization tools inapplicable. Inspired by Bayesian ideas from GPR, this paper introduces a random objective function that is tailored for hyperparameter tuning of vector-valued random features. The objective is minimized with ensemble Kalman inversion (EKI). EKI is a gradient-free particle-based optimizer that is scalable to high-dimensions and robust to randomness in objective functions. A numerical study showcases the new black-box methodology to learn hyperparameter distributions in several problems that are sensitive to the hyperparameter selection: two global sensitivity analyses, integrating a chaotic dynamical system, and solving a Bayesian inverse problem from atmospheric dynamics. The success of the proposed EKI-based algorithm for RFR suggests its potential for automated optimization of hyperparameters arising in other randomized algorithms.

Title: Your Car Tells Me Where You Drove: A Novel Path Inference Attack via CAN Bus and OBD-II Data

Authors: Tommaso Bianchi, Alessandro Brighente, Mauro Conti, Andrea Valori
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Your Car Tells Me Where You Drove: A Novel Path Inference Attack via CAN Bus and OBD-II Data(https://arxiv.org/abs/)
Keywords: security, privacy, attack
Abstract: Despite its well-known security issues, the Controller Area Network (CAN) is still the main technology for in-vehicle communications. Attackers posing as diagnostic services or accessing the CAN bus can threaten the drivers' location privacy to know the exact location at a certain point in time or to infer the visited areas. This represents a serious threat to users' privacy, but also an advantage for police investigations to gather location-based evidence. In this paper, we present On Path Diagnostic - Intrusion \& Inference (OPD-II), a novel path inference attack leveraging a physical car model and a map matching algorithm to infer the path driven by a car based on CAN bus data. Differently from available attacks, our approach only requires the attacker to know the initial location and heading of the victim's car and is not limited by the availability of training data, road configurations, or the need to access other victim's devices (e.g., smartphones). We implement our attack on a set of four different cars and a total number of 41 tracks in different road and traffic scenarios. We achieve an average of 95% accuracy on reconstructing the coordinates of the recorded path by leveraging a dynamic map-matching algorithm that outperforms the 75% and 89% accuracy values of other proposals while removing their set of assumptions.

Title: GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Authors: Yisong Xiao, Aishan Liu, QianJia Cheng, Zhenfei Yin, Siyuan Liang, Jiapeng Li, Jing Shao, Xianglong Liu, Dacheng Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing(https://arxiv.org/abs/)
Keywords: fair, diffusion
Abstract: Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emph{VL} benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emph{VL} benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnote{The dataset and code are available at the \href{this https URL}{website}.}

Title: ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Authors: Quang P.M. Pham, Khoi T.N. Nguyen, Lan C. Ngo, Truong Do, Truong Son Hy
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding(https://arxiv.org/abs/)
Keywords: robust
Abstract: Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

Title: Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization

Authors: Dongxia Wu, Nikki Lijing Kuang, Ruijia Niu, Yi-An Ma, Rose Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Black-box optimization (BBO) aims to optimize an objective function by iteratively querying a black-box oracle. This process demands sample-efficient optimization due to the high computational cost of function evaluations. While prior studies focus on forward approaches to learn surrogates for the unknown objective function, they struggle with high-dimensional inputs where valid inputs form a small subspace (e.g., valid protein sequences), which is common in real-world tasks. Recently, diffusion models have demonstrated impressive capability in learning the high-dimensional data manifold. They have shown promising performance in black-box optimization tasks but only in offline settings. In this work, we propose diffusion-based inverse modeling for black-box optimization (Diff-BBO), the first inverse approach leveraging diffusion models for online BBO problem. Diff-BBO distinguishes itself from forward approaches through the design of acquisition function. Instead of proposing candidates in the design space, Diff-BBO employs a novel acquisition function Uncertainty-aware Exploration (UaE) to propose objective function values, which leverages the uncertainty of a conditional diffusion model to generate samples in the design space. Theoretically, we prove that using UaE leads to optimal optimization outcomes. Empirically, we redesign experiments on the Design-Bench benchmark for online settings and show that Diff-BBO achieves state-of-the-art performance.

Title: Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Authors: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu
Subjects: cs.LG, cs.AI, cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

Title: Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness

Authors: Yiquan Li, Zhongzhu Chen, Kun Jin, Jiongxiao Wang, Bo Li, Chaowei Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness(https://arxiv.org/abs/)
Keywords: robust, diffusion, generative
Abstract: Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM) offers an efficient single-step purification, it falls short in ensuring purified images reside on the data manifold. Conversely, the Stochastic Diffusion Model effectively places purified images on the data manifold but demands solving cumbersome stochastic differential equations, while its derivative, the Probability Flow Ordinary Differential Equation (PF-ODE), though solving simpler ordinary differential equations, still requires multiple computational steps. In this work, we demonstrated that an ideal purification pipeline should generate the purified images on the data manifold that are as much semantically aligned to the original images for effectiveness in one step for efficiency. Therefore, we introduced Consistency Purification, an efficiency-effectiveness Pareto superior purifier compared to the previous work. Consistency Purification employs the consistency model, a one-step generative model distilled from PF-ODE, thus can generate on-manifold purified images with a single network evaluation. However, the consistency model is designed not for purification thus it does not inherently ensure semantic alignment between purified and original images. To resolve this issue, we further refine it through Consistency Fine-tuning with LPIPS loss, which enables more aligned semantic meaning while keeping the purified images on data manifold. Our comprehensive experiments demonstrate that our Consistency Purification framework achieves state-of the-art certified robustness and efficiency compared to baseline methods.

Title: Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models

Authors: Sangwoong Yoon, Himchan Hwang, Dohyun Kwon, Yung-Kyun Noh, Frank C. Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: We present a maximum entropy inverse reinforcement learning (IRL) approach for improving the sample quality of diffusion generative models, especially when the number of generation time steps is small. Similar to how IRL trains a policy based on the reward function learned from expert demonstrations, we train (or fine-tune) a diffusion model using the log probability density estimated from training data. Since we employ an energy-based model (EBM) to represent the log density, our approach boils down to the joint training of a diffusion model and an EBM. Our IRL formulation, named Diffusion by Maximum Entropy IRL (DxMI), is a minimax problem that reaches equilibrium when both models converge to the data distribution. The entropy maximization plays a key role in DxMI, facilitating the exploration of the diffusion model and ensuring the convergence of the EBM. We also propose Diffusion by Dynamic Programming (DxDP), a novel reinforcement learning algorithm for diffusion models, as a subroutine in DxMI. DxDP makes the diffusion model update in DxMI efficient by transforming the original problem into an optimal control formulation where value functions replace back-propagation in time. Our empirical studies show that diffusion models fine-tuned using DxMI can generate high-quality samples in as few as 4 and 10 steps. Additionally, DxMI enables the training of an EBM without MCMC, stabilizing EBM training dynamics and enhancing anomaly detection performance.

Title: BAZAM: A Blockchain-Assisted Zero-Trust Authentication in Multi-UAV Wireless Networks

Authors: Mingyue Xie, Zheng Chang, Osama Alfarraj, Keping Yu, Tao Chen, Hongwei Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] BAZAM: A Blockchain-Assisted Zero-Trust Authentication in Multi-UAV Wireless Networks(https://arxiv.org/abs/)
Keywords: security, attack
Abstract: Unmanned aerial vehicles (UAVs) are vulnerable to interception and attacks when operated remotely without a unified and efficient identity authentication. Meanwhile, the openness of wireless communication environments potentially leads to data leakage and system paralysis. However, conventional authentication schemes in the UAV network are system-centric, failing to adapt to the diversity of UAVs identities and access, resulting in changes in network environments and connection statuses. Additionally, UAVs are not subjected to periodic identity compliance checks once authenticated, leading to difficulties in controlling access anomalies. Therefore, in this work, we consider a zero-trust framework for UAV network authentication, aiming to achieve UAVs identity authentication through the principle of ``never trust and always verify''. We introduce a blockchain-assisted zero-trust authentication scheme, namely BAZAM, designed for multi-UAV wireless networks. In this scheme, UAVs follow a key generation approach using physical unclonable functions (PUFs), and cryptographic technique helps verify registration and access requests of UAVs. The blockchain is applied to store UAVs authentication information in immutable storage. Through thorough security analysis and extensive evaluation, we demonstrate the effectiveness and efficiency of the proposed BAZAM.

Title: DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction

Authors: Ameya Pore, Riccardo Muradore, Diego Dall'Alba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction(https://arxiv.org/abs/)
Keywords: robust, segmentation
Abstract: Reinforcement Learning (RL) algorithms can learn robotic control tasks from visual observations, but they often require a large amount of data, especially when the visual scene is complex and unstructured. In this paper, we explore how the agent's knowledge of its shape can improve the sample efficiency of visual RL methods. We propose a novel method, Disentangled Environment and Agent Representations (DEAR), that uses the segmentation mask of the agent as supervision to learn disentangled representations of the environment and the agent through feature separation constraints. Unlike previous approaches, DEAR does not require reconstruction of visual observations. These representations are then used as an auxiliary loss to the RL objective, encouraging the agent to focus on the relevant features of the environment. We evaluate DEAR on two challenging benchmarks: Distracting DeepMind control suite and Franka Kitchen manipulation tasks. Our findings demonstrate that DEAR surpasses state-of-the-art methods in sample efficiency, achieving comparable or superior performance with reduced parameters. Our results indicate that integrating agent knowledge into visual RL methods has the potential to enhance their learning efficiency and robustness.

Title: DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

Authors: Stephen Meisenbacher, Maulik Chevli, Juraj Vladika, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] DP-MLM: Differentially Private Text Rewriting Using Masked Language Models(https://arxiv.org/abs/)
Keywords: privacy, generative
Abstract: The task of text privatization using Differential Privacy has recently taken the form of $\textit{text rewriting}$, in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose $\textbf{DP-MLM}$, a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar $\textit{and}$ obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower $\varepsilon$ levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for $\textbf{DP-MLM}$ public and reusable, found at this https URL .

Title: A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy

Authors: Stephen Meisenbacher, Maulik Chevli, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy(https://arxiv.org/abs/)
Keywords: privacy
Abstract: Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of $\textit{word-level}$ or $\textit{document-level}$ privatization. Recently, several word-level $\textit{Metric}$ Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating $\textit{between}$ the word and sentence levels, namely with $\textit{collocations}$. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.

Title: LegalTurk Optimized BERT for Multi-Label Text Classification and NER

Authors: Farnaz Zeidi, Mehmet Fatih Amasyali, Çiğdem Erol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LegalTurk Optimized BERT for Multi-Label Text Classification and NER(https://arxiv.org/abs/)
Keywords: transformer
Abstract: The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT's impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT's performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.

Title: Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

Authors: Yifei Zhang, Xintao Wang, Jiaqing Liang, Sirui Xia, Lida Chen, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large Language Models (LLMs) have exhibited impressive proficiency in various natural language processing (NLP) tasks, which involve increasingly complex reasoning. Knowledge reasoning, a primary type of reasoning, aims at deriving new knowledge from existing one.While it has been widely studied in the context of knowledge graphs (KGs), knowledge reasoning in LLMs remains underexplored. In this paper, we introduce Chain-of-Knowledge, a comprehensive framework for knowledge reasoning, including methodologies for both dataset construction and model learning. For dataset construction, we create KnowReason via rule mining on KGs. For model learning, we observe rule overfitting induced by naive training. Hence, we enhance CoK with a trial-and-error mechanism that simulates the human process of internal knowledge exploration. We conduct extensive experiments with KnowReason. Our results show the effectiveness of CoK in refining LLMs in not only knowledge reasoning, but also general reasoning benchmarkms.

Title: HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability

Authors: Yanfang Chen, Ding Chen, Shichao Song, Simin Niu, Hanyu Wang, Zeyun Tang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability(https://arxiv.org/abs/)
Keywords: explainability, large language model
Abstract: As people increasingly prioritize their health, the speed and breadth of health information dissemination on the internet have also grown. At the same time, the presence of false health information (health rumors) intermingled with genuine content poses a significant potential threat to public health. However, current research on Chinese health rumors still lacks a large-scale, public, and open-source dataset of health rumor information, as well as effective and reliable rumor detection methods. This paper addresses this gap by constructing a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions and a series of data processing steps. HealthRCN is the largest known dataset of Chinese health information rumors to date. Based on this dataset, we propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE). This model leverages retrieved relevant information to accurately determine whether the input health information is a rumor and provides explanatory responses, effectively aiding users in verifying the authenticity of health information. In evaluation experiments, we compared multiple models and found that HRDE outperformed them all, including GPT-4-1106-Preview, in rumor detection accuracy and answer quality. HRDE achieved an average accuracy of 91.04% and an F1 score of 91.58%.

Title: Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Authors: Yuchuan Tian, Jianhong Han, Hanting Chen, Yuanyuan Xi, Guoyang Zhang, Jie Hu, Chao Xu, Yunhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation(https://arxiv.org/abs/)
Keywords: diffusion, transformer
Abstract: Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at this https URL.

Title: UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips

Authors: Yuqiao Yang, Zhongjie Wu, Yongzhao Zhang, Ting Chen, Jun Li, Jie Yang, Wenhao Liu, Xiaosong Zhang, Ruicong Shi, Jingwei Li, Yu Jiang, Zhuo Su
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips(https://arxiv.org/abs/)
Keywords: secure, security, attack
Abstract: UWB ranging systems have been adopted in many critical and security sensitive applications due to its precise positioning and secure ranging capabilities. We present a practical jamming attack, namely UWBAD, against commercial UWB ranging systems, which exploits the vulnerability of the adoption of the normalized cross-correlation process in UWB ranging and can selectively and quickly block ranging sessions without prior knowledge of the configurations of the victim devices, potentially leading to severe consequences such as property loss, unauthorized access, or vehicle theft. UWBAD achieves more effective and less imperceptible jamming due to: (i) it efficiently blocks every ranging session by leveraging the field-level jamming, thereby exerting a tangible impact on commercial UWB ranging systems, and (ii) the compact, reactive, and selective system design based on COTS UWB chips, making it affordable and less imperceptible. We successfully conducted real attacks against commercial UWB ranging systems from the three largest UWB chip vendors on the market, e.g., Apple, NXP, and Qorvo. We reported our findings to Apple, related Original Equipment Manufacturers (OEM), and the Automotive Security Research Group, triggering internal security incident response procedures at Volkswagen, Audi, Bosch, and NXP. As of the writing of this paper, the related OEM has acknowledged this vulnerability in their automotive systems and has offered a $5,000 reward as a bounty.

Title: CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

Authors: Huawei Sun, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille
Subjects: cs.CV, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation(https://arxiv.org/abs/)
Keywords: robust
Abstract: Depth estimation is critical in autonomous driving for interpreting 3D scenes accurately. Recently, radar-camera depth estimation has become of sufficient interest due to the robustness and low-cost properties of radar. Thus, this paper introduces a two-stage, end-to-end trainable Confidence-aware Fusion Net (CaFNet) for dense depth estimation, combining RGB imagery with sparse and noisy radar point cloud data. The first stage addresses radar-specific challenges, such as ambiguous elevation and noisy measurements, by predicting a radar confidence map and a preliminary coarse depth map. A novel approach is presented for generating the ground truth for the confidence map, which involves associating each radar point with its corresponding object to identify potential projection surfaces. These maps, together with the initial radar input, are processed by a second encoder. For the final depth estimation, we innovate a confidence-aware gated fusion mechanism to integrate radar and image features effectively, thereby enhancing the reliability of the depth map by filtering out radar noise. Our methodology, evaluated on the nuScenes dataset, demonstrates superior performance, improving upon the current leading model by 3.2% in Mean Absolute Error (MAE) and 2.7% in Root Mean Square Error (RMSE).

Title: NourishNet: Proactive Severity State Forecasting of Food Commodity Prices for Global Warning Systems

Authors: Sydney Balboni, Grace Ivey, Brett Storoe, John Cisler, Tyge Plater, Caitlyn Grant, Ella Bruce, Benjamin Paulson
Subjects: cs.LG, cs.AI, econ.GN, math.NA
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] NourishNet: Proactive Severity State Forecasting of Food Commodity Prices for Global Warning Systems(https://arxiv.org/abs/)
Keywords: security, robust
Abstract: Price volatility in global food commodities is a critical signal indicating potential disruptions in the food market. Understanding forthcoming changes in these prices is essential for bolstering food security, particularly for nations at risk. The Food and Agriculture Organization of the United Nations (FAO) previously developed sophisticated statistical frameworks for the proactive prediction of food commodity prices, aiding in the creation of global early warning systems. These frameworks utilize food security indicators to produce accurate forecasts, thereby facilitating preparations against potential food shortages. Our research builds on these foundations by integrating robust price security indicators with cutting-edge deep learning (DL) methodologies to reveal complex interdependencies. DL techniques examine intricate dynamics among diverse factors affecting food prices. Through sophisticated time-series forecasting models coupled with a classification model, our approach enhances existing models to better support communities worldwide in advancing their food security initiatives.

Title: Scaling Technology Acceptance Analysis with Large Language Model (LLM) Annotation Systems

Authors: Pawel Robert Smolinski, Joseph Januszewicz, Jacek Winiarski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Scaling Technology Acceptance Analysis with Large Language Model (LLM) Annotation Systems(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Technology acceptance models effectively predict how users will adopt new technology products. Traditional surveys, often expensive and cumbersome, are commonly used for this assessment. As an alternative to surveys, we explore the use of large language models for annotating online user-generated content, like digital reviews and comments. Our research involved designing an LLM annotation system that transform reviews into structured data based on the Unified Theory of Acceptance and Use of Technology model. We conducted two studies to validate the consistency and accuracy of the annotations. Results showed moderate-to-strong consistency of LLM annotation systems, improving further by lowering the model temperature. LLM annotations achieved close agreement with human expert annotations and outperformed the agreement between experts for UTAUT variables. These results suggest that LLMs can be an effective tool for analyzing user sentiment, offering a practical alternative to traditional survey methods and enabling deeper insights into technology design and adoption.

Title: Detection of Dark Web Threats Using Machine Learning and Image Processing

Authors: Swetha Medipelly, Nasr Abosata
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Detection of Dark Web Threats Using Machine Learning and Image Processing(https://arxiv.org/abs/)
Keywords: protect
Abstract: This paper aimed to discover the risks associated with the dark web and to detect the threats related to human trafficking using image processing with OpenCV and Python. Apart from that, a development environment was set up by installing TensorFlow, OpenCV and Python. Through exploratory data analysis (EDA), significant insights into the distribution and interactions of dataset features were obtained, which are crucial for evaluating various cyberthreats. The construction and evaluation of logistic regression and support vector machine (SVM) models revealed that the SVM model outperforms logistic regression in accuracy. The paper delves into the intricacies of data preprocessing, EDA, and model development, offering valuable insights into network protection and cyberthreat response.

Title: Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data

Authors: Tuan L. Vo, Uyen Dang, Thu Nguyen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data(https://arxiv.org/abs/)
Keywords: explainability
Abstract: As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf{\textit{Weighted missing Linear Discriminant Analysis (WLDA)}}, to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets.

Title: A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning

Authors: Anqi Zhou, Yezheng Liu, Yidong Chai, Hongyi Zhu, Xinyue Ge, Yuanchun Jiang, Meng Wang
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning(https://arxiv.org/abs/)
Keywords: security, attack, robust, federate
Abstract: Federated Learning (FL) has garnered widespread adoption across various domains such as finance, healthcare, and cybersecurity. Nonetheless, FL remains under significant threat from backdoor attacks, wherein malicious actors insert triggers into trained models, enabling them to perform certain tasks while still meeting FL's primary objectives. In response, robust aggregation methods have been proposed, which can be divided into three types: ex-ante, ex-durante, and ex-post methods. Given the complementary nature of these methods, combining all three types is promising yet unexplored. Such a combination is non-trivial because it requires leveraging their advantages while overcoming their disadvantages. Our study proposes a novel whole-process certifiably robust aggregation (WPCRA) method for FL, which enhances robustness against backdoor attacks across three phases: ex-ante, ex-durante, and ex-post. Moreover, since the current geometric median estimation method fails to consider differences among clients, we propose a novel weighted geometric median estimation algorithm (WGME). This algorithm estimates the geometric median of model updates from clients based on each client's weight, further improving the robustness of WPCRA against backdoor attacks. We also theoretically prove that WPCRA offers improved certified robustness guarantees with a larger certified radius. We evaluate the advantages of our methods based on the task of loan status prediction. Comparison with baselines shows that our methods significantly improve FL's robustness against backdoor attacks. This study contributes to the literature with a novel WPCRA method and a novel WGME algorithm. Our code is available at this https URL.

Title: Large Language Models Struggle in Token-Level Clinical Named Entity Recognition

Authors: Qiuhao Lu, Rui Li, Andrew Wen, Jinlian Wang, Liwei Wang, Hongfang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Large Language Models Struggle in Token-Level Clinical Named Entity Recognition(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPT for token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.

Title: LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Authors: Mushui Liu, Yuhang Ma, Xinfeng Zhang, Yang Zhen, Zeng Zhao, Zhipeng Hu, Bai Liu, Changjie Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation(https://arxiv.org/abs/)
Keywords: diffusion, large language model
Abstract: Diffusion Models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts that involve multiple objects, attribute binding, and long descriptions. This paper proposes a framework called \textbf{LLM4GEN}, which enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of Large Language Models (LLMs). Through a specially designed Cross-Adapter Module (CAM) that combines the original text features of text-to-image models with LLM features, LLM4GEN can be easily incorporated into various diffusion models as a plug-and-play component and enhances text-to-image generation. Additionally, to facilitate the complex and dense prompts semantic understanding, we develop a LAION-refined dataset, consisting of 1 million (M) text-image pairs with improved image descriptions. We also introduce DensePrompts which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. With just 10\% of the training data required by recent ELLA, LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 7.69\% and 9.60\% in color on T2I-CompBench, respectively. The extensive experiments on DensePrompts also demonstrate that LLM4GEN surpasses existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The project website is at: \textcolor{magenta}{\url{this https URL}}

Title: Engineering an Efficient Object Tracker for Non-Linear Motion

Authors: Momir Adžemović, Predrag Tadić, Andrija Petrović, Mladen Nikolić
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Engineering an Efficient Object Tracker for Non-Linear Motion(https://arxiv.org/abs/)
Keywords: transformer
Abstract: The goal of multi-object tracking is to detect and track all objects in a scene while maintaining unique identifiers for each, by associating their bounding boxes across video frames. This association relies on matching motion and appearance patterns of detected objects. This task is especially hard in case of scenarios involving dynamic and non-linear motion patterns. In this paper, we introduce DeepMoveSORT, a novel, carefully engineered multi-object tracker designed specifically for such scenarios. In addition to standard methods of appearance-based association, we improve motion-based association by employing deep learnable filters (instead of the most commonly used Kalman filter) and a rich set of newly proposed heuristics. Our improvements to motion-based association methods are severalfold. First, we propose a new transformer-based filter architecture, TransFilter, which uses an object's motion history for both motion prediction and noise filtering. We further enhance the filter's performance by careful handling of its motion history and accounting for camera motion. Second, we propose a set of heuristics that exploit cues from the position, shape, and confidence of detected bounding boxes to improve association performance. Our experimental evaluation demonstrates that DeepMoveSORT outperforms existing trackers in scenarios featuring non-linear motion, surpassing state-of-the-art results on three such datasets. We also perform a thorough ablation study to evaluate the contributions of different tracker components which we proposed. Based on our study, we conclude that using a learnable filter instead of the Kalman filter, along with appearance-based association is key to achieving strong general tracking performance.

Title: Posterior Sampling with Denoising Oracles via Tilted Transport

Authors: Joan Bruna, Jiequn Han
Subjects: cs.LG, math.PR, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Posterior Sampling with Denoising Oracles via Tilted Transport(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Score-based diffusion models have significantly advanced high-dimensional data generation across various domains, by learning a denoising oracle (or score) from datasets. From a Bayesian perspective, they offer a realistic modeling of data priors and facilitate solving inverse problems through posterior sampling. Although many heuristic methods have been developed recently for this purpose, they lack the quantitative guarantees needed in many scientific applications. In this work, we introduce the \textit{tilted transport} technique, which leverages the quadratic structure of the log-likelihood in linear inverse problems in combination with the prior denoising oracle to transform the original posterior sampling problem into a new `boosted' posterior that is provably easier to sample from. We quantify the conditions under which this boosted posterior is strongly log-concave, highlighting the dependencies on the condition number of the measurement matrix and the signal-to-noise ratio. The resulting posterior sampling scheme is shown to reach the computational threshold predicted for sampling Ising models [Kunisky'23] with a direct analysis, and is further validated on high-dimensional Gaussian mixture models and scalar field $\varphi^4$ models.

Title: A Comparative Study of Quality Evaluation Methods for Text Summarization

Authors: Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Comparative Study of Quality Evaluation Methods for Text Summarization(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

Title: Physical Layer Deception with Non-Orthogonal Multiplexing

Authors: Wenwen Chen, Bin Han, Yao Zhu, Anke Schmeink, Giuseppe Caire, Hans D. Schotten
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Physical Layer Deception with Non-Orthogonal Multiplexing(https://arxiv.org/abs/)
Keywords: secure, security, protect
Abstract: Physical layer security (PLS) is a promising technology to secure wireless communications by exploiting the physical properties of the wireless channel. However, the passive nature of PLS creates a significant imbalance between the effort required by eavesdroppers and legitimate users to secure data. To address this imbalance, in this article, we propose a novel framework of physical layer deception (PLD), which combines PLS with deception technologies to actively counteract wiretapping attempts. Combining a two-stage encoder with randomized ciphering and non-orthogonal multiplexing, the PLD approach enables the wireless communication system to proactively counter eavesdroppers with deceptive messages. Relying solely on the superiority of the legitimate channel over the eavesdropping channel, the PLD framework can effectively protect the confidentiality of the transmitted messages, even against eavesdroppers who possess knowledge equivalent to that of the legitimate receiver. We prove the validity of the PLD framework with in-depth analyses and demonstrate its superiority over conventional PLS approaches with comprehensive numerical benchmarks.

Title: Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

Authors: Peng Huang, Xue Gao, Lihong Huang, Jing Jiao, Xiaokang Li, Yuanyuan Wang, Yi Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation(https://arxiv.org/abs/)
Keywords: diffusion, transformer
Abstract: Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.

Title: Improved Graph-based semi-supervised learning Schemes

Authors: Farid Bozorgnia
Subjects: cs.LG, math.AP
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Improved Graph-based semi-supervised learning Schemes(https://arxiv.org/abs/)
Keywords: robust
Abstract: In this work, we improve the accuracy of several known algorithms to address the classification of large datasets when few labels are available. Our framework lies in the realm of graph-based semi-supervised learning. With novel modifications on Gaussian Random Fields Learning and Poisson Learning algorithms, we increase the accuracy and create more robust algorithms. Experimental results demonstrate the efficiency and superiority of the proposed methods over conventional graph-based semi-supervised techniques, especially in the context of imbalanced datasets.

Title: Improving the performance of Stein variational inference through extreme sparsification of physically-constrained neural network models

Authors: Govinda Anantha Padmanabha, Jan Niklas Fuhg, Cosmin Safta, Reese E. Jones, Nikolaos Bouklas
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Improving the performance of Stein variational inference through extreme sparsification of physically-constrained neural network models(https://arxiv.org/abs/)
Keywords: robust
Abstract: Most scientific machine learning (SciML) applications of neural networks involve hundreds to thousands of parameters, and hence, uncertainty quantification for such models is plagued by the curse of dimensionality. Using physical applications, we show that $L_0$ sparsification prior to Stein variational gradient descent ($L_0$+SVGD) is a more robust and efficient means of uncertainty quantification, in terms of computational cost and performance than the direct application of SGVD or projected SGVD methods. Specifically, $L_0$+SVGD demonstrates superior resilience to noise, the ability to perform well in extrapolated regions, and a faster convergence rate to an optimal solution.

Title: Characterizing Stereotypical Bias from Privacy-preserving Pre-Training

Authors: Stefan Arnold, Rene Gröbner, Annika Schreiner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Characterizing Stereotypical Bias from Privacy-preserving Pre-Training(https://arxiv.org/abs/)
Keywords: privacy
Abstract: Differential Privacy (DP) can be applied to raw text by exploiting the spatial arrangement of words in an embedding space. We investigate the implications of such text privatization on Language Models (LMs) and their tendency towards stereotypical associations. Since previous studies documented that linguistic proficiency correlates with stereotypical bias, one could assume that techniques for text privatization, which are known to degrade language modeling capabilities, would cancel out undesirable biases. By testing BERT models trained on texts containing biased statements primed with varying degrees of privacy, our study reveals that while stereotypical bias generally diminishes when privacy is tightened, text privatization does not uniformly equate to diminishing bias across all social domains. This highlights the need for careful diagnosis of bias in LMs that undergo text privatization.

Title: Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Authors: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.

Title: Diffusion Models and Representation Learning: A Survey

Authors: Michael Fuest, Pingchuan Ma, Ming Gui, Johannes S. Fischer, Vincent Tao Hu, Bjorn Ommer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Diffusion Models and Representation Learning: A Survey(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: this https URL

Title: CSUM: A Novel Mechanism for Updating CubeSat while Preserving Authenticity and Integrity

Authors: Ankit Gangwal, Aashish Paliwal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CSUM: A Novel Mechanism for Updating CubeSat while Preserving Authenticity and Integrity(https://arxiv.org/abs/)
Keywords: security
Abstract: The recent rise of CubeSat has revolutionized global space explorations, as it offers cost-effective solutions for low-orbit space applications (including climate monitoring, weather measurements, communications, and earth observation). A salient feature of CubeSat is that applications currently on-boarded can either be updated or entirely replaced by new applications via software updates, which allows reusing in-orbit hardware, reduces space debris, and saves cost as well as time. Securing software updates employing traditional methods (e.g., encryption) remains impractical mainly due to the low-resource capabilities of CubeSat. Therefore, the security of software updates for CubeSats remains a critical issue. In this paper, we propose CubeSat Update Mechanism (CSUM), a lightweight scheme to provide integrity, authentication, and data freshness guarantees for software update broadcasts to CubeSats using a hash chain. We empirically evaluate our proof of concept implementation to demonstrate the feasibility and effectiveness of our approach. CSUM can validate 50,000 consecutive updates successfully in less than a second. We also perform a comparative analysis of different cryptographic primitives. Our empirical evaluations show that the hash-based approach is at least 61$\times$ faster than the conventional mechanisms, even in resource-constrained environments. Finally, we discuss the limitations, challenges, and potential future research directions for CubeSat software update procedures.

Title: InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

Authors: Haofan Wang, Peng Xing, Renyuan Huang, Hao Ai, Qixun Wang, Xu Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Style transfer is an inventive process designed to create an image that maintains the essence of the original while embracing the visual style of another. Although diffusion models have demonstrated impressive generative power in personalized subject-driven or style-driven applications, existing state-of-the-art methods still encounter difficulties in achieving a seamless balance between content preservation and style enhancement. For example, amplifying the style's influence can often undermine the structural integrity of the content. To address these challenges, we deconstruct the style transfer task into three core elements: 1) Style, focusing on the image's aesthetic characteristics; 2) Spatial Structure, concerning the geometric arrangement and composition of visual elements; and 3) Semantic Content, which captures the conceptual meaning of the image. Guided by these principles, we introduce InstantStyle-Plus, an approach that prioritizes the integrity of the original content while seamlessly integrating the target style. Specifically, our method accomplishes style injection through an efficient, lightweight process, utilizing the cutting-edge InstantStyle framework. To reinforce the content preservation, we initiate the process with an inverted content latent noise and a versatile plug-and-play tile ControlNet for preserving the original image's intrinsic layout. We also incorporate a global semantic adapter to enhance the semantic content's fidelity. To safeguard against the dilution of style information, a style extractor is employed as discriminator for providing supplementary style guidance. Codes will be available at this https URL.

Title: NAIST Simultaneous Speech Translation System for IWSLT 2024

Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] NAIST Simultaneous Speech Translation System for IWSLT 2024(https://arxiv.org/abs/)
Keywords: transformer
Abstract: This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.

Title: Towards Robust Speech Representation Learning for Thousands of Languages

Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Robust Speech Representation Learning for Thousands of Languages(https://arxiv.org/abs/)
Keywords: robust
Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.

Title: Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

Authors: Jiajun Zhu, Siqi Miao, Rex Ying, Pan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning(https://arxiv.org/abs/)
Keywords: interpretability
Abstract: The interpretability of machine learning models has gained increasing attention, particularly in scientific domains where high precision and accountability are crucial. This research focuses on distinguishing between two critical data patterns -- sensitive patterns (model-related) and decisive patterns (task-related) -- which are commonly used as model interpretations but often lead to confusion. Specifically, this study compares the effectiveness of two main streams of interpretation methods: post-hoc methods and self-interpretable methods, in detecting these patterns. Recently, geometric deep learning (GDL) has shown superior predictive performance in various scientific applications, creating an urgent need for principled interpretation methods. Therefore, we conduct our study using several representative GDL applications as case studies. We evaluate thirteen interpretation methods applied to three major GDL backbone models, using four scientific datasets to assess how well these methods identify sensitive and decisive patterns. Our findings indicate that post-hoc methods tend to provide interpretations better aligned with sensitive patterns, whereas certain self-interpretable methods exhibit strong and stable performance in detecting decisive patterns. Additionally, our study offers valuable insights into improving the reliability of these interpretation methods. For example, ensembling post-hoc interpretations from multiple models trained on the same task can effectively uncover the task's decisive patterns.

Title: SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs

Authors: Max Muzeau, Joana Frontera-Pons, Chengfang Ren, Jean-Philippe Ovarlez
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs(https://arxiv.org/abs/)
Keywords: robust, transformer, segmentation
Abstract: Due to its all-weather and day-and-night capabilities, Synthetic Aperture Radar imagery is essential for various applications such as disaster management, earth monitoring, change detection and target recognition. However, the scarcity of labeled SAR data limits the performance of most deep learning algorithms. To address this issue, we propose a novel self-supervised learning framework based on masked Siamese Vision Transformers to create a General SAR Feature Extractor coined SAFE. Our method leverages contrastive learning principles to train a model on unlabeled SAR data, extracting robust and generalizable features. SAFE is applicable across multiple SAR acquisition modes and resolutions. We introduce tailored data augmentation techniques specific to SAR imagery, such as sub-aperture decomposition and despeckling. Comprehensive evaluations on various downstream tasks, including few-shot classification, segmentation, visualization, and pattern detection, demonstrate the effectiveness and versatility of the proposed approach. Our network competes with or surpasses other state-of-the-art methods in few-shot classification and segmentation tasks, even without being trained on the sensors used for the evaluation.

Title: Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios

Authors: Connor Malone, Ankit Vora, Thierry Peynot, Michael Milford
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios(https://arxiv.org/abs/)
Keywords: robust
Abstract: Mobile robots and autonomous vehicles are often required to function in environments where critical position estimates from sensors such as GPS become uncertain or unreliable. Single image visual place recognition (VPR) provides an alternative for localization but often requires techniques such as sequence matching to improve robustness, which incurs additional computation and latency costs. Even then, the sequence length required to localize at an acceptable performance level varies widely; and simply setting overly long fixed sequence lengths creates unnecessary latency, computational overhead, and can even degrade performance. In these scenarios it is often more desirable to meet or exceed a set target performance at minimal expense. In this paper we present an approach which uses a calibration set of data to fit a model that modulates sequence length for VPR as needed to exceed a target localization performance. We make use of a coarse position prior, which could be provided by any other localization system, and capture the variation in appearance across this region. We use the correlation between appearance variation and sequence length to curate VPR features and fit a multilayer perceptron (MLP) for selecting the optimal length. We demonstrate that this method is effective at modulating sequence length to maximize the number of sections in a dataset which meet or exceed a target performance whilst minimizing the median length used. We show applicability across several datasets and reveal key phenomena like generalization capabilities, the benefits of curating features and the utility of non-state-of-the-art feature extractors with nuanced properties.

Title: Silver Linings in the Shadows: Harnessing Membership Inference for Machine Unlearning

Authors: Nexhi Sula, Abhinav Kumar, Jie Hou, Han Wang, Reza Tourani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Silver Linings in the Shadows: Harnessing Membership Inference for Machine Unlearning(https://arxiv.org/abs/)
Keywords: secure, security, privacy, attack, membership infer
Abstract: With the continued advancement and widespread adoption of machine learning (ML) models across various domains, ensuring user privacy and data security has become a paramount concern. In compliance with data privacy regulations, such as GDPR, a secure machine learning framework should not only grant users the right to request the removal of their contributed data used for model training but also facilitates the elimination of sensitive data fingerprints within machine learning models to mitigate potential attack - a process referred to as machine unlearning. In this study, we present a novel unlearning mechanism designed to effectively remove the impact of specific data samples from a neural network while considering the performance of the unlearned model on the primary task. In achieving this goal, we crafted a novel loss function tailored to eliminate privacy-sensitive information from weights and activation values of the target model by combining target classification loss and membership inference loss. Our adaptable framework can easily incorporate various privacy leakage approximation mechanisms to guide the unlearning process. We provide empirical evidence of the effectiveness of our unlearning approach with a theoretical upper-bound analysis through a membership inference mechanism as a proof of concept. Our results showcase the superior performance of our approach in terms of unlearning efficacy and latency as well as the fidelity of the primary task, across four datasets and four deep learning architectures.

Title: Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

Authors: Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, Yang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks(https://arxiv.org/abs/)
Keywords: attack, large language model
Abstract: We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.

Title: Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Authors: Ryan Louie (1), Ananjan Nandi (1), William Fang (1), Cheng Chang (1), Emma Brunskill (1), Diyi Yang (1) ((1) Stanford University)
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles(https://arxiv.org/abs/)
Keywords: privacy
Abstract: Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensitive interactions, such as in mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback, although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted roleplay. We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners for novice counselors. After uncovering issues in GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel principle-adherence prompting pipeline which shows 30\% improvements in response quality and principle following for the downstream task. Via a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that more faithfully resemble real patients, as judged by creators and third-party counselors.

Title: Privacy-First Crowdsourcing: Blockchain and Local Differential Privacy in Crowdsourced Drone Services

Authors: Junaid Akram, Ali Anaissi
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Privacy-First Crowdsourcing: Blockchain and Local Differential Privacy in Crowdsourced Drone Services(https://arxiv.org/abs/)
Keywords: privacy, protect, fair
Abstract: We introduce a privacy-preserving framework for integrating consumer-grade drones into bushfire management. This system creates a marketplace where bushfire management authorities obtain essential data from drone operators. Key features include local differential privacy to protect data providers and a blockchain-based solution ensuring fair data exchanges and accountability. The framework is validated through a proof-of-concept implementation, demonstrating its scalability and potential for various large-scale data collection scenarios. This approach addresses privacy concerns and compliance with regulations like Australia's Privacy Act 1988, offering a practical solution for enhancing bushfire detection and management through crowdsourced drone services.

Title: MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Authors: Tianhao Li, Shangjie Li, Binbin Xie, Deyi Xiong, Baosong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting(https://arxiv.org/abs/)
Keywords: large language model
Abstract: The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.

Title: Decentralized PKI Framework for Data Integrity in Spatial Crowdsourcing Drone Services

Authors: Junaid Akram, Ali Anaissi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Decentralized PKI Framework for Data Integrity in Spatial Crowdsourcing Drone Services(https://arxiv.org/abs/)
Keywords: secure, security, defense
Abstract: In the domain of spatial crowdsourcing drone services, which includes tasks like delivery, surveillance, and data collection, secure communication is paramount. The Public Key Infrastructure (PKI) ensures this by providing a system for digital certificates that authenticate the identities of entities involved, securing data and command transmissions between drones and their operators. However, the centralized trust model of traditional PKI, dependent on Certificate Authorities (CAs), presents a vulnerability due to its single point of failure, risking security breaches. To counteract this, the paper presents D2XChain, a blockchain-based PKI framework designed for the Internet of Drone Things (IoDT). By decentralizing the CA infrastructure, D2XChain eliminates this single point of failure, thereby enhancing the security and reliability of drone communications. Fully compatible with the X.509 standard, it integrates seamlessly with existing PKI systems, supporting all key operations such as certificate registration, validation, verification, and revocation in a distributed manner. This innovative approach not only strengthens the defense of drone services against various security threats but also showcases its practical application through deployment on a private Ethereum testbed, representing a significant advancement in addressing the unique security challenges of drone-based services and ensuring their trustworthy operation in critical tasks.

Title: How to Leverage Digit Embeddings to Represent Numbers?

Authors: Jasivan Alex Sivakumar, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] How to Leverage Digit Embeddings to Represent Numbers?(https://arxiv.org/abs/)
Keywords: transformer
Abstract: Apart from performing arithmetic operations, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model and require only a few lines of code, which we have made publicly available.

Title: From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

Authors: Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models' ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.

Title: Learning Robust 3D Representation from CLIP via Dual Denoising

Authors: Shuqing Luo, Bowen Qu, Wei Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Learning Robust 3D Representation from CLIP via Dual Denoising(https://arxiv.org/abs/)
Keywords: attack, robust
Abstract: In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at this https URL.

Title: FineSurE: Fine-grained Summarization Evaluation using LLMs

Authors: Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] FineSurE: Fine-grained Summarization Evaluation using LLMs(https://arxiv.org/abs/)
Keywords: large language model
Abstract: Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at this https URL.

Title: SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

Authors: Oguzhan Baser, Kaan Kale, Sandeep P. Chinchali
Subjects: cs.CR, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures(https://arxiv.org/abs/)
Keywords: secure, security, privacy, protect, defense
Abstract: Advancements in DeepFake (DF) audio models pose a significant threat to voice authentication systems, leading to unauthorized access and the spread of misinformation. We introduce a defense mechanism, SecureSpectra, addressing DF threats by embedding orthogonal, irreversible signatures within audio. SecureSpectra leverages the inability of DF models to replicate high-frequency content, which we empirically identify across diverse datasets and DF models. Integrating differential privacy into the pipeline protects signatures from reverse engineering and strikes a delicate balance between enhanced security and minimal performance compromises. Our evaluations on Mozilla Common Voice, LibriSpeech, and VoxCeleb datasets showcase SecureSpectra's superior performance, outperforming recent works by up to 71% in detection accuracy. We open-source SecureSpectra to benefit the research community.

Title: Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis

Authors: Xinhao Deng, Qi Li, Ke Xu
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis(https://arxiv.org/abs/)
Keywords: privacy, defense, attack, robust
Abstract: Website Fingerprinting (WF) attacks identify the websites visited by users by performing traffic analysis, compromising user privacy. Particularly, DL-based WF attacks demonstrate impressive attack performance. However, the effectiveness of DL-based WF attacks relies on the collected complete and pure traffic during the page loading, which impacts the practicality of these attacks. The WF performance is rather low under dynamic network conditions and various WF defenses, particularly when the analyzed traffic is only a small part of the complete traffic. In this paper, we propose Holmes, a robust and reliable early-stage WF attack. Holmes utilizes temporal and spatial distribution analysis of website traffic to effectively identify websites in the early stages of page loading. Specifically, Holmes develops adaptive data augmentation based on the temporal distribution of website traffic and utilizes a supervised contrastive learning method to extract the correlations between the early-stage traffic and the pre-collected complete traffic. Holmes accurately identifies traffic in the early stages of page loading by computing the correlation of the traffic with the spatial distribution information, which ensures robust and reliable detection according to early-stage traffic. We extensively evaluate Holmes using six datasets. Compared to nine existing DL-based WF attacks, Holmes improves the F1-score of identifying early-stage traffic by an average of 169.18%. Furthermore, we replay the traffic of visiting real-world dark web websites. Holmes successfully identifies dark web websites when the ratio of page loading on average is only 21.71%, with an average precision improvement of 169.36% over the existing WF attacks.

Title: PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis

Authors: Qiang Zheng, Yafei Qi, Chen Wang, Chao Zhang, Jian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis(https://arxiv.org/abs/)
Keywords: segmentation
Abstract: In the domain of point cloud analysis, despite the significant capabilities of Graph Neural Networks (GNNs) in managing complex 3D datasets, existing approaches encounter challenges like high computational costs and scalability issues with extensive scenarios. These limitations restrict the practical deployment of GNNs, notably in resource-constrained environments. To address these issues, this study introduce Point<\b> Vi<\b>sion G<\b>NN (PointViG), an efficient framework for point cloud analysis. PointViG incorporates a lightweight graph convolutional module to efficiently aggregate local features and mitigate over-smoothing. For large-scale point cloud scenes, we propose an adaptive dilated graph convolution technique that searches for sparse neighboring nodes within a dilated neighborhood based on semantic correlation, thereby expanding the receptive field and ensuring computational efficiency. Experiments demonstrate that PointViG achieves performance comparable to state-of-the-art models while balancing performance and complexity. On the ModelNet40 classification task, PointViG achieved 94.3% accuracy with 1.5M parameters. For the S3DIS segmentation task, it achieved an mIoU of 71.7% with 5.3M parameters. These results underscore the potential and efficiency of PointViG in point cloud analysis.

Title: EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction

Authors: Jingheng Ye, Shang Qin, Yinghui Li, Xuxin Cheng, Libo Qin, Hai-Tao Zheng, Peng Xing, Zishan Xu, Guo Cheng, Zhao Wei

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction(https://arxiv.org/abs/)

Keywords: explainability

Abstract: Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We benchmark several series of LLMs in multiple settings, covering post-explaining and pre-explaining. To promote the development of the task, we introduce a comprehensive suite of automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. All the codes and data will be released after the review.

Title: FoldGPT: Simple and Effective Large Language Model Compression Scheme

Authors: Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen

Subjects: cs.LG, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] FoldGPT: Simple and Effective Large Language Model Compression Scheme(https://arxiv.org/abs/)

Keywords: security, large language model

Abstract: The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.

Title: CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

Authors: Jingheng Ye, Zishan Xu, Yinghui Li, Xuxin Cheng, Linlin Song, Qingyu Zhou, Hai-Tao Zheng, Ying Shen, Xin Su

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction(https://arxiv.org/abs/)

Keywords: robust, interpretability

Abstract: The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.

Title: Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Authors: Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

Subjects: cs.LG, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining(https://arxiv.org/abs/)

Keywords: generative

Abstract: In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at this https URL.

Title: Large Language Model Enhanced Knowledge Representation Learning: A Survey

Authors: Xin Wang, Zirui Chen, Haofen Wang, Leong Hou U, Zhao Li, Wenbin Guo

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Large Language Model Enhanced Knowledge Representation Learning: A Survey(https://arxiv.org/abs/)

Keywords: transformer, large language model

Abstract: The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.

Title: MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities

Authors: Naiming Liu, Shashank Sonkar, Myco Le, Richard Baraniuk

Subjects: cs.CL, cs.CY

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities(https://arxiv.org/abs/)

Keywords: large language model

Abstract: This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. We focus on the incorrect answer rationales, termed "malgorithms", which highlights flawed reasoning steps leading to incorrect answers and offers valuable insights into erroneous thought processes. We also propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. The task is challenging since state-of-the-art LLMs exhibit significant drops in MIA as compared to AIA. Moreover, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA, but can also lead to underperformance compared to simple prompting. These findings hold significant implications for the development of more cognitively-inspired LLMs to improve their counterfactual reasoning abilities, particularly through a pedagogical perspective where understanding and rectifying student misconceptions are crucial.

Title: Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction

Authors: Bin Huang, Xubiao Liu, Lei Fang, Qiegen Liu, Bingxuan Li

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction(https://arxiv.org/abs/)

Keywords: diffusion, transformer

Abstract: Positron emission tomography (PET) is an advanced medical imaging technique that plays a crucial role in non-invasive clinical diagnosis. However, while reducing radiation exposure through low-dose PET scans is beneficial for patient safety, it often results in insufficient statistical data. This scarcity of data poses significant challenges for accurately reconstructing high-quality images, which are essential for reliable diagnostic outcomes. In this research, we propose a diffusion transformer model (DTM) guided by joint compact prior (JCP) to enhance the reconstruction quality of low-dose PET imaging. In light of current research findings, we present a pioneering PET reconstruction model that integrates diffusion and transformer models for joint optimization. This model combines the powerful distribution mapping abilities of diffusion models with the capacity of transformers to capture long-range dependencies, offering significant advantages for low-dose PET reconstruction. Additionally, the incorporation of the lesion refining block and penalized weighted least squares (PWLS) enhance the recovery capability of lesion regions and preserves detail information, solving blurring problems in lesion areas and texture details of most deep learning frameworks. Experimental results demonstrate the effectiveness of DTM in enhancing image quality and preserving critical clinical information for low-dose PET scans. Our approach not only reduces radiation exposure risks but also provides a more reliable PET imaging tool for early disease detection and patient management.

Title: Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral $8\times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at this https URL.

Title: The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

Authors: Tanush Chopra, Michael Li

Subjects: cs.CL, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs(https://arxiv.org/abs/)

Keywords: fair, large language model

Abstract: We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Title: SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

Authors: Yanheng Wang, Xiaohan Yu, Yongsheng Gao, Jianjun Sha, Jian Wang, Lianru Gao, Yonggang Zhang, Xianhui Rong

Subjects: cs.CV, eess.IV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection(https://arxiv.org/abs/)

Keywords: transformer

Abstract: It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at this https URL.

Title: SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

Authors: Zheng Lin, Xuanjie Hu, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Ang Li, Praneeth Vepakomma, Yue Gao

Subjects: cs.LG, cs.CL, cs.DC

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models(https://arxiv.org/abs/)

Keywords: federate, large language model

Abstract: The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation's gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at this https URL.

Title: Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension

Authors: Gautam Chandrasekaran, Adam Klivans, Vasilis Kontonis, Raghu Meka, Konstantinos Stavropoulos

Subjects: cs.LG, cs.CC

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension(https://arxiv.org/abs/)

Keywords: robust

Abstract: In traditional models of supervised learning, the goal of a learner -- given examples from an arbitrary joint distribution on $\mathbb{R}^d \times \{\pm 1\}$ -- is to output a hypothesis that is competitive (to within $\epsilon$) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes, we introduce a smoothed-analysis framework that requires a learner to compete only with the best classifier that is robust to small random Gaussian perturbation. This subtle change allows us to give a wide array of learning results for any concept that (1) depends on a low-dimensional subspace (aka multi-index model) and (2) has a bounded Gaussian surface area. This class includes functions of halfspaces and (low-dimensional) convex sets, cases that are only known to be learnable in non-smoothed settings with respect to highly structured distributions such as Gaussians. Surprisingly, our analysis also yields new results for traditional non-smoothed frameworks such as learning with margin. In particular, we obtain the first algorithm for agnostically learning intersections of $k$-halfspaces in time $k^{poly(\frac{\log k}{\epsilon \gamma}) }$ where $\gamma$ is the margin parameter. Before our work, the best-known runtime was exponential in $k$ (Arriaga and Vempala, 1999).

Title: Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model

Authors: Sepehr Salem Ghahfarokhi, Tyrell To, Julie Jorns, Tina Yen, Bing Yu, Dong Hye Ye

Subjects: cs.CV, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: Data limitation is a significant challenge in applying deep learning to medical images. Recently, the diffusion probabilistic model (DPM) has shown the potential to generate high-quality images by converting Gaussian random noise into realistic images. In this paper, we apply the DPM to augment the deep ultraviolet fluorescence (DUV) image dataset with an aim to improve breast cancer classification for intraoperative margin assessment. For classification, we divide the whole surface DUV image into small patches and extract convolutional features for each patch by utilizing the pre-trained ResNet. Then, we feed them into an XGBoost classifier for patch-level decisions and then fuse them with a regional importance map computed by Grad-CAM++ for whole surface-level prediction. Our experimental results show that augmenting the training dataset with the DPM significantly improves breast cancer detection performance in DUV images, increasing accuracy from 93% to 97%, compared to using Affine transformations and ProGAN.

Title: How Does Overparameterization Affect Features?

Authors: Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] How Does Overparameterization Affect Features?(https://arxiv.org/abs/)

Keywords: transformer

Abstract: Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.

Title: FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing

Authors: Donghyun Kim, Seil Kang, Seong Jae Hwang

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing(https://arxiv.org/abs/)

Keywords: robust

Abstract: Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing quality metrics, neglecting the need for real-time applicability in time-sensitive domains like autonomous driving. This work introduces FALCON (Frequency Adjoint Link with CONtinuous density mask), a single-image dehazing system achieving state-of-the-art performance on both quality and speed. Particularly, we develop a novel bottleneck module, namely, Frequency Adjoint Link, operating in the frequency space to globally expand the receptive field with minimal growth in network size. Further, we leverage the underlying haze distribution based on the atmospheric scattering model via a Continuous Density Mask (CDM) which serves as a continuous-valued mask input prior and a differentiable auxiliary loss. Comprehensive experiments involving multiple state-of-the-art methods and ablation analysis demonstrate FALCON's exceptional performance in both dehazing quality and speed (i.e., >$180 frames-per-second), quantified by metrics such as FPS, PSNR, and SSIM.

Title: Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Authors: Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval(https://arxiv.org/abs/)

Keywords: extraction, transformer

Abstract: In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

Title: FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

Authors: Ruinan Jin, Zikang Xu, Yuan Zhong, Qiongsong Yao, Qi Dou, S. Kevin Zhou, Xiaoxiao Li

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models(https://arxiv.org/abs/)

Keywords: fair, segmentation

Abstract: The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

Title: LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

Authors: Longchao Da, Tiejin Chen, Lu Cheng, Hua Wei

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model's hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work's semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.

Title: Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?

Authors: Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani

Subjects: cs.CL, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Small Language Models (SLMs) are generally considered to be more compact versions of large language models (LLMs), typically having fewer than 7 billion parameters. This study investigates the ability of small language models to learn, retain, and subsequently eliminate noise that is typically not found on the internet, where most pretraining datasets are sourced. For this, four pre-trained SLMs were utilized: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned without noise and tested for task execution with in-context learning. Afterward, noise patterns were introduced to evaluate the models' learning and unlearning capabilities. We evaluated the models' performance at various training levels. Phi consistently excelled with word-level noise but performed the worst with character-level noise. Despite being the smallest with approximately 1 billion parameters, Olmo performed consistently well on tasks.

Title: Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components

Authors: Phillip Schneider, Wessel Poelman, Michael Rovatsos, Florian Matthes

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users' information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research.

Title: Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

Authors: Wenqiang Zu, Shenghao Xie, Qing Zhao, Guoqi Li, Lei Ma

Subjects: cs.CV, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images(https://arxiv.org/abs/)

Keywords: transformer

Abstract: Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbf{Prompt tuning is a distribution calibrator.} And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

Title: CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect

Authors: Jiehui Zhou, Linxiao Yang, Xingyu Liu, Xinyue Gu, Liang Sun, Wei Chen

Subjects: cs.LG, stat.ME

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect(https://arxiv.org/abs/)

Keywords: interpretability

Abstract: In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strategic intervention management. In this paper, we propose CURLS, a novel rule learning method leveraging HTE, which can effectively describe subgroups with significant treatment effects. Specifically, we frame causal rule learning as a discrete optimization problem, finely balancing treatment effect with variance and considering the rule interpretability. We design an iterative procedure based on the minorize-maximization algorithm and solve a submodular lower bound as an approximation for the original. Quantitative experiments and qualitative case studies verify that compared with state-of-the-art methods, CURLS can find subgroups where the estimated and true effects are 16.1% and 13.8% higher and the variance is 12.0% smaller, while maintaining similar or better estimation accuracy and rule interpretability. Code is available at this https URL.

Title: GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

Authors: Huijie Fan, Tinghui Zhao, Qiang Wang, Baojie Fan, Yandong Tang, LianQing Liu

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking(https://arxiv.org/abs/)

Keywords: robust, extraction, transformer

Abstract: In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

Title: DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Authors: Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, Haizhou Li

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods, thereby optimizing both efficiency and effectiveness. We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: 'Fast', designated for tasks where the LLM quickly identifies a high-confidence solution, and 'Slow', allocated for tasks that the LLM perceives as complex and for which it has low confidence in immediate solutions as well as requiring more reasoning paths to verify. Experiments on five popular reasoning benchmarks demonstrated the superiority of the DynaThink over baselines.

Title: An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

Authors: Weimin Bai, Yifei Wang, Wenzheng Chen, He Sun

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: Diffusion models excel in solving imaging inverse problems due to their ability to model complex image priors. However, their reliance on large, clean datasets for training limits their practical use where clean data is scarce. In this paper, we propose EMDiffusion, an expectation-maximization (EM) approach to train diffusion models from corrupted observations. Our method alternates between reconstructing clean images from corrupted data using a known diffusion model (E-step) and refining diffusion model weights based on these reconstructions (M-step). This iterative process leads the learned diffusion model to gradually converge to the true clean data distribution. We validate our method through extensive experiments on diverse computational imaging tasks, including random inpainting, denoising, and deblurring, achieving new state-of-the-art performance.

Title: Augmenting Document-level Relation Extraction with Efficient Multi-Supervision

Authors: Xiangyu Lin, Weijia Jia, Zhiguo Gong

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Augmenting Document-level Relation Extraction with Efficient Multi-Supervision(https://arxiv.org/abs/)

Keywords: robust, extraction

Abstract: Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.

Title: Blind Inversion using Latent Diffusion Priors

Authors: Weimin Bai, Siyi Chen, Wenzheng Chen, He Sun

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Blind Inversion using Latent Diffusion Priors(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM's performance on representative 2D blind deblurring and 3D sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts.

Title: EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

Authors: Chenxin Li, Brandon Y. Feng, Yifan Liu, Hengyu Liu, Cheng Wang, Weihao Yu, Yixuan Yuan

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting(https://arxiv.org/abs/)

Keywords: robust

Abstract: 3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually the case in real-world clinical scenarios. To tackle this {sparsity} challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process, dubbed as \textit{EndoSparse}. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against state-of-the-art methods, \textit{EndoSparse} achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to sparse-view limitations in endoscopic reconstruction. \textit{EndoSparse} signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: this https URL.

Title: PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

Authors: Dan Peng, Zhihui Fu, Jun Wang

Subjects: cs.LG, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs(https://arxiv.org/abs/)

Keywords: privacy, large language model

Abstract: Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.

Title: Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Authors: Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, Paul F Jaeger

Subjects: cs.LG, cs.CV, stat.ME

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Overcoming Common Flaws in the Evaluation of Selective Classification Systems(https://arxiv.org/abs/)

Keywords: interpretability

Abstract: Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

Title: SE(3)-Hyena Operator for Scalable Equivariant Learning

Authors: Artem Moskalev, Mangal Prakash, Rui Liao, Tommaso Mansi

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SE(3)-Hyena Operator for Scalable Equivariant Learning(https://arxiv.org/abs/)

Keywords: transformer

Abstract: Modeling global geometric context while maintaining equivariance is crucial for accurate predictions in many fields such as biology, chemistry, or vision. Yet, this is challenging due to the computational demands of processing high-dimensional data at scale. Existing approaches such as equivariant self-attention or distance-based message passing, suffer from quadratic complexity with respect to sequence length, while localized methods sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, in this work, we introduce SE(3)-Hyena operator, an equivariant long-convolutional model based on the Hyena operator. The SE(3)-Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on equivariant associative recall and n-body modeling, SE(3)-Hyena matches or outperforms equivariant self-attention while requiring significantly less memory and computational resources for long sequences. Our model processes the geometric context of 20k tokens x3.5 times faster than the equivariant transformer and allows x175 longer a context within the same memory budget.

Title: Improve ROI with Causal Learning and Conformal Prediction

Authors: Meng Ai, Zhuo Chen, Jibin Wang, Jing Shang, Tao Tao, Zhen Li

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Improve ROI with Causal Learning and Conformal Prediction(https://arxiv.org/abs/)

Keywords: robust

Abstract: In the commercial sphere, such as operations and maintenance, advertising, and marketing recommendations, intelligent decision-making utilizing data mining and neural network technologies is crucial, especially in resource allocation to optimize ROI. This study delves into the Cost-aware Binary Treatment Assignment Problem (C-BTAP) across different industries, with a focus on the state-of-the-art Direct ROI Prediction (DRP) method. However, the DRP model confronts issues like covariate shift and insufficient training data, hindering its real-world effectiveness. Addressing these challenges is essential for ensuring dependable and robust predictions in varied operational contexts. This paper presents a robust Direct ROI Prediction (rDRP) method, designed to address challenges in real-world deployment of neural network-based uplift models, particularly under conditions of covariate shift and insufficient training data. The rDRP method, enhancing the standard DRP model, does not alter the model's structure or require retraining. It utilizes conformal prediction and Monte Carlo dropout for interval estimation, adapting to model uncertainty and data distribution shifts. A heuristic calibration method, inspired by a Kaggle competition, combines point and interval estimates. The effectiveness of these approaches is validated through offline tests and online A/B tests in various settings, demonstrating significant improvements in target rewards compared to the state-of-the-art method.

Title: Multimodal Conditional 3D Face Geometry Generation

Authors: Christopher Otto, Prashanth Chandran, Sebastian Weiss, Markus Gross, Gaspard Zoss, Derek Bradley

Subjects: cs.CV, cs.GR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Multimodal Conditional 3D Face Geometry Generation(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

Title: Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

Authors: Yunqi Xu, Tianchi Cai, Jiyan Jiang, Xierui Song

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{this https URL}\label{link_face4rag}}

Title: Min P Sampling: Balancing Creativity and Coherence at High Temperature

Authors: Minh Nguyen, Andrew Baker, Andreas Kirsch, Clement Neo

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Min P Sampling: Balancing Creativity and Coherence at High Temperature(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top-$p$ sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min-$p$, a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min-$p$ improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top-$p$ and other sampling methods. As of writing, min-$p$ has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.

Title: Rethinking LLM-based Preference Evaluation

Authors: Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Jieyu Zhao, Hui Xiong

Subjects: cs.LG, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Rethinking LLM-based Preference Evaluation(https://arxiv.org/abs/)

Keywords: fair, large language model

Abstract: Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.

Title: M2QA: Multi-domain Multilingual Question Answering

Authors: Leon Engländer, Hannah Sterz, Clifton Poth, Jonas Pfeiffer, Ilia Kuznetsov, Iryna Gurevych

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] M2QA: Multi-domain Multilingual Question Answering(https://arxiv.org/abs/)

Keywords: robust

Abstract: Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at this https URL.

Title: Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

Authors: Ivan Drokin

Subjects: cs.CV, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

Title: IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

Authors: Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, Kai Yu

Subjects: cs.CL, cs.AI, cs.MA

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at this https URL.

Title: Eliminating Position Bias of Language Models: A Mechanistic Approach

Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji

Subjects: cs.CL, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Eliminating Position Bias of Language Models: A Mechanistic Approach(https://arxiv.org/abs/)

Keywords: robust

Abstract: Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.

Title: BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

Authors: David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant

Subjects: cs.CL, cs.IR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] BERGEN: A Benchmarking Library for Retrieval-Augmented Generation(https://arxiv.org/abs/)

Keywords: generative, large language model

Abstract: Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{this https URL}.

Title: Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal

Authors: Ziqi Zeng, Chen Zhao, Weiling Cai, Chenyu Dong

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal(https://arxiv.org/abs/)

Keywords: diffusion, generative

Abstract: Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex mapping between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method.

Title: SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

Authors: Christoforus Yoga Haryanto, Minh Hieu Vu, Trung Duc Nguyen, Emily Lomempow, Yulia Nurliana, Sona Taheri

Subjects: cs.CR, cs.AI, cs.CY, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest(https://arxiv.org/abs/)

Keywords: secure, security, privacy, attack, robust, generative

Abstract: The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework's novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests.

Title: Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods

Authors: Andrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

Subjects: cs.LG, stat.ML

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods(https://arxiv.org/abs/)

Keywords: interpretability

Abstract: Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific 'random effects' from cluster-invariant 'fixed effects' have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.

Title: Comprehensive Dataset for Urban Streetlight Analysis

Authors: Eliza Femi Sherley S, Sanjay T, Shri Kaanth P, Jeffrey Samuel S

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Comprehensive Dataset for Urban Streetlight Analysis(https://arxiv.org/abs/)

Keywords: robust

Abstract: This article includes a comprehensive collection of over 800 high-resolution streetlight images taken systematically from India's major streets, primarily in the Chennai region. The images were methodically collected following standardized methods to assure uniformity and quality. Each image has been labelled and grouped into directories based on binary class labels, which indicate whether each streetlight is functional or not. This organized dataset is intended to make it easier to train and evaluate deep neural networks, allowing for the creation of pre-trained models that have robust feature representations. Such models have several potential uses, such as improving smart city surveillance systems, automating street infrastructure monitoring, and increasing urban management efficiency. The availability of this dataset is intended to inspire future research and development in computer vision and smart city technologies, supporting innovation and practical solutions to urban infrastructure concerns. The dataset can be accessed at this https URL.

Title: Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

Authors: Guillermo Marco, Julio Gonzalo, Ramón del Castillo, María Teresa Mateo Girona

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?(https://arxiv.org/abs/)

Keywords: large language model

Abstract: It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent's. Then, we prepared an evaluation rubric inspired by Boden's definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.

Title: Calibrated Large Language Models for Binary Question Answering

Authors: Patrizio Giovannotti, Alexander Gammerman

Subjects: cs.CL, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Calibrated Large Language Models for Binary Question Answering(https://arxiv.org/abs/)

Keywords: interpretability, large language model

Abstract: Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model's predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.

Title: Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation

Authors: Nadezhda Chirkova, Vassilina Nikoulina, Jean-Luc Meunier, Alexandre Bérard

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation(https://arxiv.org/abs/)

Keywords: robust, transformer

Abstract: We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.

Title: RMS-FlowNet++: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds

Authors: Ramy Battrawy, René Schuster, Didier Stricker

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] RMS-FlowNet++: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds(https://arxiv.org/abs/)

Keywords: robust

Abstract: The proposed RMS-FlowNet++ is a novel end-to-end learning-based architecture for accurate and efficient scene flow estimation that can operate on high-density point clouds. For hierarchical scene f low estimation, existing methods rely on expensive Farthest-Point-Sampling (FPS) to sample the scenes, must find large correspondence sets across the consecutive frames and/or must search for correspondences at a full input resolution. While this can improve the accuracy, it reduces the overall efficiency of these methods and limits their ability to handle large numbers of points due to memory requirements. In contrast to these methods, our architecture is based on an efficient design for hierarchical prediction of multi-scale scene flow. To this end, we develop a special flow embedding block that has two advantages over the current methods: First, a smaller correspondence set is used, and second, the use of Random-Sampling (RS) is possible. In addition, our architecture does not need to search for correspondences at a full input resolution. Exhibiting high accuracy, our RMS-FlowNet++ provides a faster prediction than state-of-the-art methods, avoids high memory requirements and enables efficient scene flow on dense point clouds of more than 250K points at once. Our comprehensive experiments verify the accuracy of RMS FlowNet++ on the established FlyingThings3D data set with different point cloud densities and validate our design choices. Furthermore, we demonstrate that our model has a competitive ability to generalize to the real-world scenes of the KITTI data set without fine-tuning.

Title: An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification

Authors: Kassem Sabeh, Robert Litschko, Mouna Kacimi, Barbara Plank, Johann Gamper

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification(https://arxiv.org/abs/)

Keywords: generative

Abstract: Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: this https URL

Title: Integrated feature analysis for deep learning interpretation and class activation maps

Authors: Yanli Li, Tahereh Hassanzadeh, Denis P. Shamonin, Monique Reijnierse, Annette H.M. van der Helm-van Mil, Berend C. Stoel

Subjects: cs.CV, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Integrated feature analysis for deep learning interpretation and class activation maps(https://arxiv.org/abs/)

Keywords: interpretability

Abstract: Understanding the decisions of deep learning (DL) models is essential for the acceptance of DL to risk-sensitive applications. Although methods, like class activation maps (CAMs), give a glimpse into the black box, they do miss some crucial information, thereby limiting its interpretability and merely providing the considered locations of objects. To provide more insight into the models and the influence of datasets, we propose an integrated feature analysis method, which consists of feature distribution analysis and feature decomposition, to look closer into the intermediate features extracted by DL models. This integrated feature analysis could provide information on overfitting, confounders, outliers in datasets, model redundancies and principal features extracted by the models, and provide distribution information to form a common intensity scale, which are missing in current CAM algorithms. The integrated feature analysis was applied to eight different datasets for general validation: photographs of handwritten digits, two datasets of natural images and five medical datasets, including skin photography, ultrasound, CT, X-rays and MRIs. The method was evaluated by calculating the consistency between the CAMs average class activation levels and the logits of the model. Based on the eight datasets, the correlation coefficients through our method were all very close to 100%, and based on the feature decomposition, 5%-25% of features could generate equally informative saliency maps and obtain the same model performances as using all features. This proves the reliability of the integrated feature analysis. As the proposed methods rely on very few assumptions, this is a step towards better model interpretation and a useful extension to existing CAM algorithms. Codes: this https URL

Title: CPT: Consistent Proxy Tuning for Black-box Optimization

Authors: Yuanyang He, Zitong Huang, Xinxing Xu, Rick Siow Mong Goh, Salman Khan, Wangmeng Zuo, Yong Liu, Chun-Mei Feng

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] CPT: Consistent Proxy Tuning for Black-box Optimization(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets. The code is available at this https URL.

Title: Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

Authors: Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu

Subjects: cs.CV, cs.AI, cs.CL, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models(https://arxiv.org/abs/)

Keywords: attack, robust

Abstract: Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100\% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbf{Warning: the text data used in this paper are toxic in nature and may be offensive to some readers.}

Title: Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

Authors: Takyoung Kim, Kyungjae Lee, Young Rok Jang, Ji Yong Cho, Gangwoo Kim, Minseok Cho, Moontae Lee

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Interactions with billion-scale large language models typically yield long-form responses due to their extensive parametric capacities, along with retrieval-augmented features. While detailed responses provide insightful viewpoint of a specific subject, they frequently generate redundant and less engaging content that does not meet user interests. In this work, we focus on the role of query outlining (i.e., selected sequence of queries) in scenarios that users request a specific range of information, namely coverage-conditioned ($C^2$) scenarios. For simulating $C^2$ scenarios, we construct QTree, 10K sets of information-seeking queries decomposed with various perspectives on certain topics. By utilizing QTree, we train QPlanner, a 7B language model generating customized query outlines that follow coverage-conditioned queries. We analyze the effectiveness of generated outlines through automatic and human evaluation, targeting on retrieval-augmented generation (RAG). Moreover, the experimental results demonstrate that QPlanner with alignment training can further provide outlines satisfying diverse user interests. Our resources are available at this https URL.

Title: Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid

Authors: Kalibinuer Tiliwalidi, Chengyin Hu, Weiwen Shi

Subjects: cs.CV, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid(https://arxiv.org/abs/)

Keywords: security, defense, attack, robust, steal

Abstract: While extensive research exists on physical adversarial attacks within the visible spectrum, studies on such techniques in the infrared spectrum are limited. Infrared object detectors are vital in modern technological applications but are susceptible to adversarial attacks, posing significant security threats. Previous studies using physical perturbations like light bulb arrays and aerogels for white-box attacks, or hot and cold patches for black-box attacks, have proven impractical or limited in multi-view support. To address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which models perturbations in a grid format and uses a genetic algorithm for black-box optimization. These perturbations are cyclically applied to various parts of a pedestrian's clothing to facilitate multi-view black-box physical attacks on infrared pedestrian detectors. Extensive experiments validate AdvGrid's effectiveness, stealthiness, and robustness. The method achieves attack success rates of 80.00\% in digital environments and 91.86\% in physical environments, outperforming baseline methods. Additionally, the average attack success rate exceeds 50\% against mainstream detectors, demonstrating AdvGrid's robustness. Our analyses include ablation studies, transfer attacks, and adversarial defenses, confirming the method's superiority.

Title: $\text{Memory}^3$: Language Modeling with Explicit Memory

Authors: Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, Linpeng Tang, Weinan E

Subjects: cs.CL, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] $\text{Memory}^3$: Language Modeling with Explicit Memory(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

Title: Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

Authors: Francesco Barbato, Umberto Michieli, Jijoong Moon, Pietro Zanuttigh, Mete Ozay

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection(https://arxiv.org/abs/)

Keywords: privacy

Abstract: Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.

Title: A Learned Generalized Geodesic Distance Function-Based Approach for Node Feature Augmentation on Graphs

Authors: Amitoz Azad, Yuan Fang

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] A Learned Generalized Geodesic Distance Function-Based Approach for Node Feature Augmentation on Graphs(https://arxiv.org/abs/)

Keywords: robust

Abstract: Geodesic distances on manifolds have numerous applications in image processing, computer graphics and computer vision. In this work, we introduce an approach called `LGGD' (Learned Generalized Geodesic Distances). This method involves generating node features by learning a generalized geodesic distance function through a training pipeline that incorporates training data, graph topology and the node content features. The strength of this method lies in the proven robustness of the generalized geodesic distances to noise and outliers. Our contributions encompass improved performance in node classification tasks, competitive results with state-of-the-art methods on real-world graph datasets, the demonstration of the learnability of parameters within the generalized geodesic equation on graph, and dynamic inclusion of new labels.

Title: SCIF: A Language for Compositional Smart Contract Security

Authors: Siqiu Yao, Haobin Ni, Andrew C. Myers, Ethan Cecchetti

Subjects: cs.CR, cs.PL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SCIF: A Language for Compositional Smart Contract Security(https://arxiv.org/abs/)

Keywords: secure, security, protect, attack

Abstract: Securing smart contracts remains a fundamental challenge. At its core, it is about building software that is secure in composition with untrusted code, a challenge that extends far beyond blockchains. We introduce SCIF, a language for building smart contracts that are compositionally secure. SCIF is based on the fundamentally compositional principle of secure information flow, but extends this core mechanism to include protection against reentrancy attacks, confused deputy attacks, and improper error handling, even in the presence of malicious contracts that do not follow SCIF's rules. SCIF supports a rich ecosystem of interacting principals with partial trust through its mechanisms for dynamic trust management. SCIF has been implemented as a compiler to Solidity. We describe the SCIF language, including its static checking rules and runtime. Finally, we implement several applications with intricate security reasoning, showing how SCIF supports building complex smart contracts securely and gives programmer accurate diagnostics about potential security bugs.

Title: Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

Authors: Zongshuo Li, Ding Huo, Markus Meurer, Thomas Bergs

Subjects: cs.CV, cs.AI, cs.LG, eess.IV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach's advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

Title: EconNLI: Evaluating Large Language Models on Economics Reasoning

Authors: Yue Guo, Yi Yang

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] EconNLI: Evaluating Large Language Models on Economics Reasoning(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Large Language Models (LLMs) are widely used for writing economic analysis reports or providing financial advice, but their ability to understand economic knowledge and reason about potential results of specific economic events lacks systematic evaluation. To address this gap, we propose a new dataset, natural language inference on economic events (EconNLI), to evaluate LLMs' knowledge and reasoning abilities in the economic domain. We evaluate LLMs on (1) their ability to correctly classify whether a premise event will cause a hypothesis event and (2) their ability to generate reasonable events resulting from a given premise. Our experiments reveal that LLMs are not sophisticated in economic reasoning and may generate wrong or hallucinated answers. Our study raises awareness of the limitations of using LLMs for critical decision-making involving economic reasoning and analysis. The dataset and codes are available at this https URL.

Title: Searching for Best Practices in Retrieval-Augmented Generation

Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Searching for Best Practices in Retrieval-Augmented Generation(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy.

Title: Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Authors: Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

Title: DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

Authors: Crispian Morris, Nantheera Anantrasirichai, Fan Zhang, David Bull

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution(https://arxiv.org/abs/)

Keywords: transformer, segmentation

Abstract: In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model's learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at this https URL

Title: MIRAI: Evaluating LLM Agents for Event Forecasting

Authors: Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] MIRAI: Evaluating LLM Agents for Event Forecasting(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Title: A Fingerprint for Large Language Models

Authors: Zhiguang Yang, Hanzhou Wu

Subjects: cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] A Fingerprint for Large Language Models(https://arxiv.org/abs/)

Keywords: protect, attack, robust, large language model

Abstract: Recent advances show that scaling a pre-trained language model could achieve state-of-the-art performance on many downstream tasks, prompting large language models (LLMs) to become a hot research topic in the field of artificial intelligence. However, due to the resource-intensive nature of training LLMs from scratch, it is urgent and crucial to protect the intellectual property of LLMs against infringement. This has motivated the authors in this paper to propose a novel black-box fingerprinting technique for LLMs, which requires neither model training nor model fine-tuning. We first demonstrate that the outputs of LLMs span a unique vector space associated with each model. We model the problem of ownership authentication as the task of evaluating the similarity between the victim model's space and the output's space of the suspect model. To deal with this problem, we propose two solutions, where the first solution involves verifying whether the outputs of the suspected large model are in the same space as those of the victim model, enabling rapid identification of model infringement, and the second one reconstructs the union of the vector spaces for LLM outputs and the victim model to address situations where the victim model has undergone the Parameter-Efficient Fine-Tuning (PEFT) attacks. Experimental results indicate that the proposed technique achieves superior performance in ownership verification and robustness against PEFT attacks. This work reveals inherent characteristics of LLMs and provides a promising solution for ownership verification of LLMs in black-box scenarios, ensuring efficiency, generality and practicality.

Title: SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

Authors: Ao Liang, Wenyu Chen, Jian Fang, Huaici Zhao

Subjects: cs.CV, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism(https://arxiv.org/abs/)

Keywords: robust

Abstract: The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through dropping points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by dropping points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI \textit{test} set, SGCCNet achieves $80.82\%$ for the metric of $AP_{3D}$ on the \textit{Moderate} level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by $2.35\%$ and $3.42\%$, respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors

Title: CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

Authors: Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation(https://arxiv.org/abs/)

Keywords: robust

Abstract: In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.

Title: QUEEN: Query Unlearning against Model Extraction

Authors: Huajie Chen, Tianqing Zhu, Lefeng Zhang, Bo Liu, Derui Wang, Wanlei Zhou, Minhui Xue

Subjects: cs.CR, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] QUEEN: Query Unlearning against Model Extraction(https://arxiv.org/abs/)

Keywords: security, privacy, protect, defense, attack, steal, extraction, watermark

Abstract: Model extraction attacks currently pose a non-negligible threat to the security and privacy of deep learning models. By querying the model with a small dataset and usingthe query results as the ground-truth labels, an adversary can steal a piracy model with performance comparable to the original model. Two key issues that cause the threat are, on the one hand, accurate and unlimited queries can be obtained by the adversary; on the other hand, the adversary can aggregate the query results to train the model step by step. The existing defenses usually employ model watermarking or fingerprinting to protect the ownership. However, these methods cannot proactively prevent the violation from happening. To mitigate the threat, we propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks from the very beginning. To limit the potential threat, QUEEN has sensitivity measurement and outputs perturbation that prevents the adversary from training a piracy model with high performance. In sensitivity measurement, QUEEN measures the single query sensitivity by its distance from the center of its cluster in the feature space. To reduce the learning accuracy of attacks, for the highly sensitive query batch, QUEEN applies query unlearning, which is implemented by gradient reverse to perturb the softmax output such that the piracy model will generate reverse gradients to worsen its performance unconsciously. Experiments show that QUEEN outperforms the state-of-the-art defenses against various model extraction attacks with a relatively low cost to the model accuracy. The artifact is publicly available at https://anonymous.4open.science/r/queen implementation-5408/.

Title: DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking

Authors: Alsharif Abuadbba, Nicholas Rhodes, Kristen Moore, Bushra Sabir, Shuo Wang, Yansong Gao

Subjects: cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking(https://arxiv.org/abs/)

Keywords: security, defense, attack, robust, watermark

Abstract: Deep learning solutions in critical domains like autonomous vehicles, facial recognition, and sentiment analysis require caution due to the severe consequences of errors. Research shows these models are vulnerable to adversarial attacks, such as data poisoning and neural trojaning, which can covertly manipulate model behavior, compromising reliability and safety. Current defense strategies like watermarking have limitations: they fail to detect all model modifications and primarily focus on attacks on CNNs in the image domain, neglecting other critical architectures like RNNs. To address these gaps, we introduce DeepiSign-G, a versatile watermarking approach designed for comprehensive verification of leading DNN architectures, including CNNs and RNNs. DeepiSign-G enhances model security by embedding an invisible watermark within the Walsh-Hadamard transform coefficients of the model's parameters. This watermark is highly sensitive and fragile, ensuring prompt detection of any modifications. Unlike traditional hashing techniques, DeepiSign-G allows substantial metadata incorporation directly within the model, enabling detailed, self-contained tracking and verification. We demonstrate DeepiSign-G's applicability across various architectures, including CNN models (VGG, ResNets, DenseNet) and RNNs (Text sentiment classifier). We experiment with four popular datasets: VGG Face, CIFAR10, GTSRB Traffic Sign, and Large Movie Review. We also evaluate DeepiSign-G under five potential attacks. Our comprehensive evaluation confirms that DeepiSign-G effectively detects these attacks without compromising CNN and RNN model performance, highlighting its efficacy as a robust security measure for deep learning applications. Detection of integrity breaches is nearly perfect, while hiding only a bit in approximately 1% of the Walsh-Hadamard coefficients.

Title: Complementary Fusion of Deep Network and Tree Model for ETA Prediction

Authors: YuRui Huang, Jie Zhang, HengDa Bao, Yang Yang, Jian Yang

Subjects: cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Complementary Fusion of Deep Network and Tree Model for ETA Prediction(https://arxiv.org/abs/)

Keywords: robust

Abstract: Estimated time of arrival (ETA) is a very important factor in the transportation system. It has attracted increasing attentions and has been widely used as a basic service in navigation systems and intelligent transportation systems. In this paper, we propose a novel solution to the ETA estimation problem, which is an ensemble on tree models and neural networks. We proved the accuracy and robustness of the solution on the A/B list and finally won first place in the SIGSPATIAL 2021 GISCUP competition.

Title: The African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases

Authors: Serene Lim

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] The African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases(https://arxiv.org/abs/)

Keywords: large language model

Abstract: This study investigates the subtle and often concealed biases present in Large Language Models (LLMs), which, despite passing explicit bias tests, can still exhibit implicit biases akin to those observed in humans who profess egalitarian beliefs yet demonstrate underlying prejudices. The challenge of measuring such biases is exacerbated as LLMs become increasingly proprietary, restricting access to their internal mechanisms such as embeddings, which are crucial for applying traditional bias measures. To tackle these issues, this study introduces innovative measures of bias inspired by psychological methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias. The LLM IAT Bias is a prompt-based method designed to unearth implicit biases by simulating the well-known psychological IAT but adapted for use with LLMs. The LLM Decision Bias measure is developed to detect subtle discrimination in decision-making tasks, focusing on how LLMs choose between individuals in various scenarios. Open-ended generation is also utilised through thematic analysis of word generations and storytelling. The experiments revealed biases across gender and racial domains, from discriminatory categorisations to exoticisation. Our findings indicate that the prompt-based measure of implicit bias not only correlates with traditional embedding-based methods but also more effectively predicts downstream behaviors, which are crucially measured by the LLM Decision Bias. This relationship underscores the importance of relative, rather than absolute, evaluations in assessing implicit biases, reflecting psychological insights into human bias assessment. This research contributes to the broader understanding of AI ethics and provides suggestions for continually assessing and mitigating biases in advanced AI systems, emphasising the need for more qualitative and downstream focus.

Title: Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

Authors: Andrew Zamai, Andrea Zugarini, Leonardo Rigutini, Marco Ernandes, Marco Maggini

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER(https://arxiv.org/abs/)

Keywords: robust, large language model

Abstract: Recently, several specialized instruction-tuned Large Language Models (LLMs) for Named Entity Recognition (NER) have emerged. Compared to traditional NER approaches, these models have strong generalization capabilities. Existing LLMs mainly focus on zero-shot NER in out-of-domain distributions, being fine-tuned on an extensive number of entity classes that often highly or completely overlap with test sets. In this work instead, we propose SLIMER, an approach designed to tackle never-seen-before named entity tags by instructing the model on fewer examples, and by leveraging a prompt enriched with definition and guidelines. Experiments demonstrate that definition and guidelines yield better performance, faster and more robust learning, particularly when labelling unseen Named Entities. Furthermore, SLIMER performs comparably to state-of-the-art approaches in out-of-domain zero-shot NER, while being trained on a reduced tag set.

Title: Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints

Authors: Xiaoliang Sun, Liangchao Guo, Wenlong Zhang, Zi Wang, Qifeng Yu

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints(https://arxiv.org/abs/)

Keywords: robust

Abstract: Factors, such as rapid relative motion, clutter background, etc., make robust small aerial target detection for airborne infrared detection systems a challenge. Existing methods are facing difficulties when dealing with such cases. We consider that a continuous and smooth trajectory is critical in boosting small infrared aerial target detection performance. A simple and effective small aerial target detection method for airborne infrared detection system using light gradient boosting model (LightGBM) and trajectory constraints is proposed in this article. First, we simply formulate target candidate detection as a binary classification problem. Target candidates in every individual frame are detected via interesting pixel detection and a trained LightGBM model. Then, the local smoothness and global continuous characteristic of the target trajectory are modeled as short-strict and long-loose constraints. The trajectory constraints are used efficiently for detecting the true small infrared aerial targets from numerous target candidates. Experiments on public datasets demonstrate that the proposed method performs better than other existing methods. Furthermore, a public dataset for small aerial target detection in airborne infrared detection systems is constructed. To the best of our knowledge, this dataset has the largest data scale and richest scene types within this field.

Title: Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space

Authors: Menglin Yang, Harshit Verma, Delvin Ce Zhang, Jiahong Liu, Irwin King, Rex Ying

Subjects: cs.LG, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space(https://arxiv.org/abs/)

Keywords: transformer

Abstract: Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of Hypformer across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.

Title: Formal Verification of Object Detection

Authors: Avraham Raviv, Yizhak Y. Elboher, Michelle Aluf-Medina, Yael Leibovich Weiss, Omer Cohen, Roy Assa, Guy Katz, Hillel Kugler

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Formal Verification of Object Detection(https://arxiv.org/abs/)

Keywords: attack, robust

Abstract: Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

Title: Preserving Full Degradation Details for Blind Image Super-Resolution

Authors: Hongda Liu, Longguang Wang, Ye Zhang, Kaiwen Xue, Shunbo Zhou, Yulan Guo

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Preserving Full Degradation Details for Blind Image Super-Resolution(https://arxiv.org/abs/)

Keywords: robust

Abstract: The performance of image super-resolution relies heavily on the accuracy of degradation information, especially under blind settings. Due to absence of true degradation models in real-world scenarios, previous methods learn distinct representations by distinguishing different degradations in a batch. However, the most significant degradation differences may provide shortcuts for the learning of representations such that subtle difference may be discarded. In this paper, we propose an alternative to learn degradation representations through reproducing degraded low-resolution (LR) images. By guiding the degrader to reconstruct input LR images, full degradation information can be encoded into the representations. In addition, we develop an energy distance loss to facilitate the learning of the degradation representations by introducing a bounded constraint. Experiments show that our representations can extract accurate and highly robust degradation information. Moreover, evaluations on both synthetic and real images demonstrate that our ReDSR achieves state-of-the-art performance for the blind SR tasks.

Title: Collaborative Performance Prediction for Large Language Models

Authors: Qiyuan Zhang, Fuyuan Lyu, Xue Liu, Chen Ma

Subjects: cs.CL, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Collaborative Performance Prediction for Large Language Models(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

Title: GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting

Authors: Chenxin Li, Hengyu Liu, Zhiwen Fan, Wuyang Li, Yifan Liu, Panwang Pan, Yixuan Yuan

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting(https://arxiv.org/abs/)

Keywords: extraction, generative

Abstract: Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue remains unexplored for emerging generative 3D formats like Gaussian Splatting. We present GaussianStego, a method for embedding steganographic information in the rendering of generated 3D assets. Our approach employs an optimization framework that enables the accurate extraction of hidden information from images rendered using Gaussian assets derived from large models, while maintaining their original visual quality. We conduct preliminary evaluations of our method across several potential deployment scenarios and discuss issues identified through analysis. GaussianStego represents an initial exploration into the novel challenge of embedding customizable, imperceptible, and recoverable information within the renders produced by current 3D generative models, while ensuring minimal impact on the rendered content's quality.

Title: Robot Instance Segmentation with Few Annotations for Grasping

Authors: Moshe Kimhi, David Vainshtein, Chaim Baskin, Dotan Di Castro

Subjects: cs.CV, cs.AI, cs.RO

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Robot Instance Segmentation with Few Annotations for Grasping(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an $\text{AP}_{50}$ of $86.37$, almost a $20\%$ improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an $\text{AP}_{50}$ score of $84.89$ with just $1 \%$ of annotated data compared to $72$ presented in ARMBench on the fully annotated counterpart.

Title: Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

Authors: Chenxi Li, Abhinav Kumar, Zhen Guo, Jie Hou, Reza Tourani

Subjects: cs.LG, cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability(https://arxiv.org/abs/)

Keywords: privacy, attack, membership infer, explainability

Abstract: The increasing prominence of deep learning applications and reliance on personalized data underscore the urgent need to address privacy vulnerabilities, particularly Membership Inference Attacks (MIAs). Despite numerous MIA studies, significant knowledge gaps persist, particularly regarding the impact of hidden features (in isolation) on attack efficacy and insufficient justification for the root causes of attacks based on raw data features. In this paper, we aim to address these knowledge gaps by first exploring statistical approaches to identify the most informative neurons and quantifying the significance of the hidden activations from the selected neurons on attack accuracy, in isolation and combination. Additionally, we propose an attack-driven explainable framework by integrating the target and attack models to identify the most influential features of raw data that lead to successful membership inference attacks. Our proposed MIA shows an improvement of up to 26% on state-of-the-art MIA.

Title: Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

Authors: Perusha Moodley, Pramod Kaushik, Dhillu Thambi, Mark Trovinger, Praveen Paruchuri, Xia Hong, Benjamin Rosman

Subjects: cs.LG, cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces(https://arxiv.org/abs/)

Keywords: interpretability, transformer

Abstract: Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model's performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT's performance and, in some cases, even improves it.

Title: Evaluating Model Performance Under Worst-case Subpopulations

Authors: Mike Li, Hongseok Namkoong, Shangzhou Xia

Subjects: cs.LG, cs.CY, stat.ML

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Evaluating Model Performance Under Worst-case Subpopulations(https://arxiv.org/abs/)

Keywords: robust

Abstract: The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.

Title: Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

Authors: Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescós

Subjects: cs.CV, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks(https://arxiv.org/abs/)

Keywords: transformer, segmentation

Abstract: In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

Title: CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

Authors: Danial Qashqai, Emad Mousavian, Shahriar Baradaran Shokouhi, Sattar Mirzakuchaki

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at this https URL.

Title: Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction

Authors: Jiangbei Hu, Yanggeng Li, Fei Hou, Junhui Hou, Zhebin Zhang, Shengfa Wang, Na Lei, Ying He

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction(https://arxiv.org/abs/)

Keywords: robust

Abstract: Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large datasets of 3D shapes, which is costly and often necessitates hyperparameter adjustments for new datasets. This paper presents a novel neural framework, LoSF-UDF, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns within localized areas, prompting us to create a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. This method enables efficient and robust surface reconstruction from point clouds without the need for shape-specific training. Additionally, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We present comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to validate our method's efficacy.

Title: Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

Authors: Jayneel Parekh, Quentin Bouniot, Pavlo Mozharovskyi, Alasdair Newson, Florence d'Alché-Buc

Subjects: cs.CV, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models(https://arxiv.org/abs/)

Keywords: generative

Abstract: Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

Title: Protecting Privacy in Classifiers by Token Manipulation

Authors: Re'em Harel, Yair Elboher, Yuval Pinter

Subjects: cs.CL, cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Protecting Privacy in Classifiers by Token Manipulation(https://arxiv.org/abs/)

Keywords: privacy, protect, attack

Abstract: Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.

Title: PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

Authors: Xuan Yu, Yili Liu, Chenrui Han, Sitong Mao, Shunbo Zhou, Rong Xiong, Yiyi Liao, Yue Wang

Subjects: cs.CV, cs.RO

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction(https://arxiv.org/abs/)

Keywords: segmentation

Abstract: Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

Title: Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models

Authors: Xiaolin Xing, Zhiwei He, Haoyu Xu, Xing Wang, Rui Wang, Yu Hong

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models(https://arxiv.org/abs/)

Keywords: robust, interpretability, large language model

Abstract: This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of this http URL address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models' performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs' cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.

Title: TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

Authors: André Sacilotti, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation(https://arxiv.org/abs/)

Keywords: transformer

Abstract: Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

Title: Badllama 3: removing safety finetuning from Llama 3 in minutes

Authors: Dmitrii Volkov

Subjects: cs.LG, cs.AI, cs.CL, cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Badllama 3: removing safety finetuning from Llama 3 in minutes(https://arxiv.org/abs/)

Keywords: attack

Abstract: We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Title: Free-text Rationale Generation under Readability Level Control

Authors: Yi-Sheng Hsu, Nils Feldhus, Sherzod Hakimov

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Free-text Rationale Generation under Readability Level Control(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs' ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.

Title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Authors: Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

Subjects: cs.LG, cs.CV, cs.RO

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion(https://arxiv.org/abs/)

Keywords: diffusion, generative

Abstract: This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

Title: Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Authors: Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká

Subjects: cs.CV, cs.CL, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

Title: GalLoP: Learning Global and Local Prompts for Vision-Language Models

Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Audebert, Nicolas Thome

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] GalLoP: Learning Global and Local Prompts for Vision-Language Models(https://arxiv.org/abs/)

Keywords: robust

Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

Title: Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters

Authors: Daniil Gurgurov, Mareike Hartmann, Simon Ostermann

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters(https://arxiv.org/abs/)

Keywords: large language model

Abstract: This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs -- Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala -- and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyse their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios.

Title: Dynamic Few-Shot Learning for Knowledge Graph Question Answering

Authors: Jacopo D'Abramo, Andrea Zugarini, Paolo Torroni

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Dynamic Few-Shot Learning for Knowledge Graph Question Answering(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Large language models present opportunities for innovative Question Answering over Knowledge Graphs (KGQA). However, they are not inherently designed for query generation. To bridge this gap, solutions have been proposed that rely on fine-tuning or ad-hoc architectures, achieving good results but limited out-of-domain distribution generalization. In this study, we introduce a novel approach called Dynamic Few-Shot Learning (DFSL). DFSL integrates the efficiency of in-context learning and semantic similarity and provides a generally applicable solution for KGQA with state-of-the-art performance. We run an extensive evaluation across multiple benchmark datasets and architecture configurations.

Title: HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling

Authors: Jesus-German Ortiz-Barajas, Helena Gomez-Adorno, Thamar Solorio

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling(https://arxiv.org/abs/)

Keywords: transformer

Abstract: We present HyperLoader, a simple approach that combines different parameter-efficient fine-tuning methods in a multi-task setting. To achieve this goal, our model uses a hypernetwork to generate the weights of these modules based on the task, the transformer layer, and its position within this layer. Our method combines the benefits of multi-task learning by capturing the structure of all tasks while reducing the task interference problem by encapsulating the task-specific knowledge in the generated weights and the benefits of combining different parameter-efficient methods to outperform full-fine tuning. We provide empirical evidence that HyperLoader outperforms previous approaches in most datasets and obtains the best average performance across tasks in high-resource and low-resource scenarios.

Title: A Global-Local Attention Mechanism for Relation Classification

Authors: Yiping Sun

Subjects: cs.CL, cs.IR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] A Global-Local Attention Mechanism for Relation Classification(https://arxiv.org/abs/)

Keywords: extraction

Abstract: Relation classification, a crucial component of relation extraction, involves identifying connections between two entities. Previous studies have predominantly focused on integrating the attention mechanism into relation classification at a global scale, overlooking the importance of the local context. To address this gap, this paper introduces a novel global-local attention mechanism for relation classification, which enhances global attention with a localized focus. Additionally, we propose innovative hard and soft localization mechanisms to identify potential keywords for local attention. By incorporating both hard and soft localization strategies, our approach offers a more nuanced and comprehensive understanding of the contextual cues that contribute to effective relation classification. Our experimental results on the SemEval-2010 Task 8 dataset highlight the superior performance of our method compared to previous attention-based approaches in relation classification.

Title: FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Authors: Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Luming Liang

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] FORA: Fast-Forward Caching in Diffusion Transformer Acceleration(https://arxiv.org/abs/)

Keywords: diffusion, transformer

Abstract: Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: this https URL.

Title: Maximizing Blockchain Performance: Mitigating Conflicting Transactions through Parallelism and Dependency Management

Authors: Faisal Haque Bappy, Tariqul Islam, Tarannum Shaila Zaman, Md Sajidul Islam Sajid, Mir Mehedi Ahsan Pritom

Subjects: cs.CR, cs.DC

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Maximizing Blockchain Performance: Mitigating Conflicting Transactions through Parallelism and Dependency Management(https://arxiv.org/abs/)

Keywords: secure, security

Abstract: While blockchains initially gained popularity in the realm of cryptocurrencies, their widespread adoption is expanding beyond conventional applications, driven by the imperative need for enhanced data security. Despite providing a secure network, blockchains come with certain tradeoffs, including high latency, lower throughput, and an increased number of transaction failures. A pivotal issue contributing to these challenges is the improper management of "conflicting transactions", commonly referred to as "contention". When a number of pending transactions within a blockchain collide with each other, this results in a state of contention. This situation worsens network latency, leads to the wastage of system resources, and ultimately contributes to reduced throughput and higher transaction failures. In response to this issue, in this work, we present a novel blockchain scheme that integrates transaction parallelism and an intelligent dependency manager aiming to reduce the occurrence of conflicting transactions within blockchain networks. In terms of effectiveness and efficiency, experimental results show that our scheme not only mitigates the challenges posed by conflicting transactions, but also outperforms both existing parallel and non-parallel Hyperledger Fabric blockchain networks achieving higher transaction success rate, throughput, and latency. The integration of our scheme with Hyperledger Fabric appears to be a promising solution for improving the overall performance and stability of blockchain networks in real-world applications.

Title: POST: Email Archival, Processing and Flagging Stack for Incident Responders

Authors: Jeffrey Fairbanks

Subjects: cs.CR, cs.IR, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] POST: Email Archival, Processing and Flagging Stack for Incident Responders(https://arxiv.org/abs/)

Keywords: security

Abstract: Phishing is one of the main points of compromise, with email security and awareness being estimated at \$50-100B in 2022. There is great need for email forensics capability to quickly search for malicious content. A novel solution POST is proposed. POST is an API driven serverless email archival, processing, and flagging workflow for both large and small organizations that collects and parses all email, flags emails using state of the art Natural Language Processing and Machine Learning, allows full email searching on every aspect of an email, and provides a cost savings of up to 68.6%.

Title: Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision

Authors: Balaji VS, Mahi AR, Anirudh Ganapathy PS, Manju M

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision(https://arxiv.org/abs/)

Keywords: protect, robust

Abstract: Agriculture faces a growing challenge with wildlife wreaking havoc on crops, threatening sustainability. The project employs advanced object detection, the system utilizes the Mobile Net SSD model for real-time animal classification. The methodology initiates with the creation of a dataset, where each animal is represented by annotated images. The SSD Mobile Net architecture facilitates the use of a model for image classification and object detection. The model undergoes fine-tuning and optimization during training, enhancing accuracy for precise animal classification. Real-time detection is achieved through a webcam and the OpenCV library, enabling prompt identification and categorization of approaching animals. By seamlessly integrating intelligent scarecrow technology with object detection, this system offers a robust solution to field protection, minimizing crop damage and promoting precision farming. It represents a valuable contribution to agricultural sustainability, addressing the challenge of wildlife interference with crops. The implementation of the Intelligent Scarecrow Monitoring System stands as a progressive tool for proactive field management and protection, empowering farmers with an advanced solution for precision agriculture. Keywords: Machine learning, Deep Learning, Computer Vision, MobileNet SSD

Title: AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

Authors: Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen

Subjects: cs.CV, cs.RO

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction(https://arxiv.org/abs/)

Keywords: robust

Abstract: In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.

Title: Needle in the Haystack for Memory Based Large Language Models

Authors: Subhajit Chaudhury, Soham Dan, Payel Das, Georgios Kollias, Elliot Nelson

Subjects: cs.CL, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Needle in the Haystack for Memory Based Large Language Models(https://arxiv.org/abs/)

Keywords: large language model

Abstract: In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.

Title: TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind

Authors: Guiyang Hou, Wenqi Zhang, Yongliang Shen, Linjuan Wu, Weiming Lu

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind(https://arxiv.org/abs/)

Keywords: robust, large language model

Abstract: Theory of Mind (ToM)-the cognitive ability to reason about mental states of ourselves and others, is the foundation of social interaction. Although ToM comes naturally to humans, it poses a significant challenge to even the most advanced Large Language Models (LLMs). Due to the complex logical chains in ToM reasoning, especially in higher-order ToM questions, simply utilizing reasoning methods like Chain of Thought (CoT) will not improve the ToM capabilities of LLMs. We present TimeToM, which constructs a temporal space and uses it as the foundation to improve the ToM capabilities of LLMs in multiple scenarios. Specifically, within the temporal space, we construct Temporal Belief State Chain (TBSC) for each character and inspired by the cognition perspective of the social world model, we divide TBSC into self-world beliefs and social world beliefs, aligning with first-order ToM (first-order beliefs) and higher-order ToM (higher-order beliefs) questions, respectively. Moreover, we design a novel tool-belief solver that, by considering belief communication between characters in temporal space, can transform a character's higher-order beliefs into another character's first-order beliefs under belief communication period. Experimental results indicate that TimeToM can dramatically improve the reasoning performance of LLMs on ToM questions while taking a big step towards coherent and robust ToM reasoning.

Title: Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

Authors: Jibang Wu, Siyu Chen, Mengdi Wang, Huazheng Wang, Haifeng Xu

Subjects: cs.LG, cs.AI, cs.GT, econ.TH

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Contractual Reinforcement Learning: Pulling Arms with Invisible Hands(https://arxiv.org/abs/)

Keywords: robust

Abstract: The agency problem emerges in today's large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emph{contractual reinforcement learning}, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent's action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve $\tilde{O}(\sqrt{T})$ regret. We also present an algorithm with $\tilde{O}(T^{2/3})$ for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

Title: Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

Authors: Zisu Huang, Xiaohua Wang, Feiran Zhang, Zhibo Xu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement(https://arxiv.org/abs/)

Keywords: attack, robust, large language model

Abstract: The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: this https URL .

Title: Retrieval-augmented generation in multilingual settings

Authors: Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Retrieval-augmented generation in multilingual settings(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at this https URL.

Title: DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Authors: Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Title: Survey and Analysis of IoT Operating Systems: A Comparative Study on the Effectiveness and Acquisition Time of Open Source Digital Forensics Tools

Authors: Jeffrey Fairbanks, Md Mashrur Arifin, Sadia Afreen, Alex Curtis

Subjects: cs.CR

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Survey and Analysis of IoT Operating Systems: A Comparative Study on the Effectiveness and Acquisition Time of Open Source Digital Forensics Tools(https://arxiv.org/abs/)

Keywords: security

Abstract: The main goal of this research project is to evaluate the effectiveness and speed of open-source forensic tools for digital evidence collecting from various Internet-of-Things (IoT) devices. The project will create and configure many IoT environments, across popular IoT operating systems, and run common forensics tasks in order to accomplish this goal. To validate these forensic analysis operations, a variety of open-source forensic tools covering four standard digital forensics tasks. These tasks will be utilized across each sample IoT operating system and will have its time spent on record carefully tracked down and examined, allowing for a thorough evaluation of the effectiveness and speed for performing forensics on each type of IoT device. The research also aims to offer recommendations to IoT security experts and digital forensic practitioners about the most efficient open-source tools for forensic investigations with IoT devices while maintaining the integrity of gathered evidence and identifying challenges that exist with these new device types. The results will be shared widely and well-documented in order to provide significant contributions to the field of internet-of-things device makers and digital forensics.

Title: LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Authors: Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

Subjects: cs.CL, cs.AI, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models' internal biases, calibration and generations' textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear "neutral". which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity.

Title: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Authors: Siwei Li, Yifan Yang, Yifei Shen, Fangyun Wei, Zongqing Lu, Lili Qiu, Yuqing Yang

Subjects: cs.CL, cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning(https://arxiv.org/abs/)

Keywords: robust

Abstract: Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.

Title: RegMix: Data Mixture as Regression for Language Model Pre-training

Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] RegMix: Data Mixture as Regression for Language Model Pre-training(https://arxiv.org/abs/)

Keywords: large language model

Abstract: The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at this https URL.

Title: Self-Cognition in Large Language Models: An Exploratory Study

Authors: Dongping Chen, Jiawen Shi, Yao Wan, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

Subjects: cs.CL, cs.AI

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Self-Cognition in Large Language Models: An Exploratory Study(https://arxiv.org/abs/)

Keywords: large language model

Abstract: While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs' self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena--specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core--demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.

Title: MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

Subjects: cs.CV, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs(https://arxiv.org/abs/)

Keywords: large language model

Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

Title: E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

Authors: Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness(https://arxiv.org/abs/)

Keywords: robust, diffusion

Abstract: Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

Title: DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

Authors: Chang-Han Yeh, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Ting-Hsuan Chen, Yu-Lun Liu

Subjects: cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8$\times$ super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results at this https URL.

Title: Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

Authors: Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, Yang Song

Subjects: cs.LG, cs.AI, cs.CV

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing(https://arxiv.org/abs/)

Keywords: diffusion

Abstract: Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

Title: Empowering 3D Visual Grounding with Reasoning Capabilities

Authors: Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu

Subjects: cs.CV, cs.AI, cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Empowering 3D Visual Grounding with Reasoning Capabilities(https://arxiv.org/abs/)

Keywords: large language model

Abstract: Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

Title: Scalable Nested Optimization for Deep Learning

Authors: Jonathan Lorraine

Subjects: cs.LG, cs.AI, cs.NE, math.OC, stat.ML

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] Scalable Nested Optimization for Deep Learning(https://arxiv.org/abs/)

Keywords: generative

Abstract: Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.

Title: KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Authors: Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu

Subjects: cs.CL

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches(https://arxiv.org/abs/)

Keywords: transformer, large language model

Abstract: Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches -- such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures -- have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights -- as well as a friendly workbench -- for the future development of long context-capable LLMs. The source code will be available at this https URL

Title: On the Abuse and Detection of Polyglot Files

Authors: Luke Koch, Sean Oesch, Amul Chaulagain, Jared Dixon, Matthew Dixon, Mike Huettal, Amir Sadovnik, Cory Watson, Brian Weber, Jacob Hartman, Richard Patulski

Subjects: cs.CR, cs.LG

Abstract URL: https://arxiv.org/abs/

Pdf URL: https://arxiv.org/pdf/

Copy Paste: [[]] On the Abuse and Detection of Polyglot Files(https://arxiv.org/abs/)

Keywords: attack, robust

Abstract: A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.