
Title: Visual Language Model based Cross-modal Semantic Communication Systems

Title: LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

Title: Provably Secure Non-interactive Key Exchange Protocol for Group-Oriented Applications in Scenarios with Low-Quality Networks

Title: Curriculum Learning with Quality-Driven Data Selection

Title: AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

Title: Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Title: UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Title: Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

Title: Personalized Federated Continual Learning via Multi-granularity Prompt

Title: OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Title: Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges

Title: From Efficient Multimodal Models to World Models: A Survey

Title: Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Title: RepAct: The Re-parameterizable Adaptive Activation Function

Title: Towards Secure and Efficient Data Scheduling for Vehicular Social Networks

Title: Localizing Anomalies via Multiscale Score Matching Analysis

Title: Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

Title: Dataset Representativeness and Downstream Task Fairness

Title: MetaKP: On-Demand Keyphrase Generation

Title: PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

Title: Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Title: Multimodal Prototyping for cancer survival prediction

Title: Transformer-based Image and Video Inpainting: Current Challenges and Future Directions

Title: EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models

Title: SBOM.EXE: Countering Dynamic Code Injection based on Software Bill of Materials in Java

Title: DiffuseDef: Improved Robustness to Adversarial Attacks

Title: Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

Title: Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review

Title: A deep neural network framework for dynamic multi-valued mapping estimation and its applications

Title: SolarSAM: Building-scale Photovoltaic Potential Assessment Based on Segment Anything Model (SAM) and Remote Sensing for Emerging City

Title: Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

Title: OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Title: LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods

Title: Dual-view Aware Smart Contract Vulnerability Detection for Ethereum

Title: Iterative Data Augmentation with Large Language Models for Aspect-based Sentiment Analysis

Title: Resource Allocation and Secure Wireless Communication in the Large Model-based Mobile Edge Computing System

Title: PhyTracker: An Online Tracker for Phytoplankton

Title: Financial Knowledge Large Language Model

Title: SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Title: How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

Title: The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

Title: Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Title: Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Title: A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

Title: Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Title: Parametric Primitive Analysis of CAD Sketches with Vision Transformer

Title: Explainability of Machine Learning Models under Missing Data

Title: Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs

Title: Obtaining $(\epsilon,\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables

Title: eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Title: Time Series Clustering with General State Space Models via Stochastic Variational Inference

Title: A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Title: AI Age Discrepancy: A Novel Parameter for Frailty Assessment in Kidney Tumor Patients

Title: Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models

Title: pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation

Title: Open-Source Conversational AI with SpeechBrain 1.0

Title: BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

Title: VcLLM: Video Codecs are Secretly Tensor Codecs

Title: MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Title: MH-pFLGB: Model Heterogeneous personalized Federated Learning via Global Bypass for Medical Image Analysis

Title: Large Language Models for Power Scheduling: A User-Centric Approach

Title: Navigating the road to automotive cybersecurity compliance

Title: Towards Massive Multilingual Holistic Bias

Title: It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization

Title: PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models

Title: Graph Neural Networks Gone Hogwild

Title: LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Title: ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Title: Aeroengine performance prediction using a physical-embedded data-driven method

Title: Toward a Diffusion-Based Generalist for Dense Vision Tasks

Title: Blockchain based Decentralized Petition System

Title: Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging

Title: Answering real-world clinical questions using large language model based systems

Title: Explaining Chest X-ray Pathology Models using Textual Concepts

Title: Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations

Title: OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

Title: MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting

Title: Hyperparameter Optimization for Randomized Algorithms: A Case Study for Random Features

Title: Your Car Tells Me Where You Drove: A Novel Path Inference Attack via CAN Bus and OBD-II Data

Title: GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Title: ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Title: Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization

Title: Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Title: Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness

Title: Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models

Title: BAZAM: A Blockchain-Assisted Zero-Trust Authentication in Multi-UAV Wireless Networks

Title: DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction

Title: DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

Title: A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy

Title: LegalTurk Optimized BERT for Multi-Label Text Classification and NER

Title: Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

Title: HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability

Title: Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Title: UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips

Title: CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

Title: NourishNet: Proactive Severity State Forecasting of Food Commodity Prices for Global Warning Systems

Title: Scaling Technology Acceptance Analysis with Large Language Model (LLM) Annotation Systems

Title: Detection of Dark Web Threats Using Machine Learning and Image Processing

Title: Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data

Title: A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning

Title: Large Language Models Struggle in Token-Level Clinical Named Entity Recognition

Title: LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Title: Engineering an Efficient Object Tracker for Non-Linear Motion

Title: Posterior Sampling with Denoising Oracles via Tilted Transport

Title: A Comparative Study of Quality Evaluation Methods for Text Summarization

Title: Physical Layer Deception with Non-Orthogonal Multiplexing

Title: Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

Title: Improved Graph-based semi-supervised learning Schemes

Title: Improving the performance of Stein variational inference through extreme sparsification of physically-constrained neural network models

Title: Characterizing Stereotypical Bias from Privacy-preserving Pre-Training

Title: Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Title: Diffusion Models and Representation Learning: A Survey

Title: CSUM: A Novel Mechanism for Updating CubeSat while Preserving Authenticity and Integrity

Title: InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

Title: NAIST Simultaneous Speech Translation System for IWSLT 2024

Title: Towards Robust Speech Representation Learning for Thousands of Languages

Title: Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

Title: SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs

Title: Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios

Title: Silver Linings in the Shadows: Harnessing Membership Inference for Machine Unlearning

Title: Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

Title: Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles

Title: Privacy-First Crowdsourcing: Blockchain and Local Differential Privacy in Crowdsourced Drone Services

Title: MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Title: Decentralized PKI Framework for Data Integrity in Spatial Crowdsourcing Drone Services

Title: How to Leverage Digit Embeddings to Represent Numbers?

Title: From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

Title: Learning Robust 3D Representation from CLIP via Dual Denoising

Title: FineSurE: Fine-grained Summarization Evaluation using LLMs

Title: SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent Signatures

Title: Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis

Title: PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis

Title: EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction

  • Authors: Jingheng Ye, Shang Qin, Yinghui Li, Xuxin Cheng, Libo Qin, Hai-Tao Zheng, Peng Xing, Zishan Xu, Guo Cheng, Zhao Wei
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction(https://arxiv.org/abs/)
  • Keywords: explainability
  • Abstract: Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We benchmark several series of LLMs in multiple settings, covering post-explaining and pre-explaining. To promote the development of the task, we introduce a comprehensive suite of automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. All the codes and data will be released after the review.

Title: FoldGPT: Simple and Effective Large Language Model Compression Scheme

  • Authors: Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen
  • Subjects: cs.LG, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] FoldGPT: Simple and Effective Large Language Model Compression Scheme(https://arxiv.org/abs/)
  • Keywords: security, large language model
  • Abstract: The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.

Title: CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

  • Authors: Jingheng Ye, Zishan Xu, Yinghui Li, Xuxin Cheng, Linlin Song, Qingyu Zhou, Hai-Tao Zheng, Ying Shen, Xin Su
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction(https://arxiv.org/abs/)
  • Keywords: robust, interpretability
  • Abstract: The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.

Title: Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

  • Authors: Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
  • Subjects: cs.LG, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining(https://arxiv.org/abs/)
  • Keywords: generative
  • Abstract: In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at this https URL.

Title: Large Language Model Enhanced Knowledge Representation Learning: A Survey

  • Authors: Xin Wang, Zirui Chen, Haofen Wang, Leong Hou U, Zhao Li, Wenbin Guo
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Large Language Model Enhanced Knowledge Representation Learning: A Survey(https://arxiv.org/abs/)
  • Keywords: transformer, large language model
  • Abstract: The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.

Title: MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities

  • Authors: Naiming Liu, Shashank Sonkar, Myco Le, Richard Baraniuk
  • Subjects: cs.CL, cs.CY
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. We focus on the incorrect answer rationales, termed "malgorithms", which highlights flawed reasoning steps leading to incorrect answers and offers valuable insights into erroneous thought processes. We also propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. The task is challenging since state-of-the-art LLMs exhibit significant drops in MIA as compared to AIA. Moreover, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA, but can also lead to underperformance compared to simple prompting. These findings hold significant implications for the development of more cognitively-inspired LLMs to improve their counterfactual reasoning abilities, particularly through a pedagogical perspective where understanding and rectifying student misconceptions are crucial.

Title: Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction

  • Authors: Bin Huang, Xubiao Liu, Lei Fang, Qiegen Liu, Bingxuan Li
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction(https://arxiv.org/abs/)
  • Keywords: diffusion, transformer
  • Abstract: Positron emission tomography (PET) is an advanced medical imaging technique that plays a crucial role in non-invasive clinical diagnosis. However, while reducing radiation exposure through low-dose PET scans is beneficial for patient safety, it often results in insufficient statistical data. This scarcity of data poses significant challenges for accurately reconstructing high-quality images, which are essential for reliable diagnostic outcomes. In this research, we propose a diffusion transformer model (DTM) guided by joint compact prior (JCP) to enhance the reconstruction quality of low-dose PET imaging. In light of current research findings, we present a pioneering PET reconstruction model that integrates diffusion and transformer models for joint optimization. This model combines the powerful distribution mapping abilities of diffusion models with the capacity of transformers to capture long-range dependencies, offering significant advantages for low-dose PET reconstruction. Additionally, the incorporation of the lesion refining block and penalized weighted least squares (PWLS) enhance the recovery capability of lesion regions and preserves detail information, solving blurring problems in lesion areas and texture details of most deep learning frameworks. Experimental results demonstrate the effectiveness of DTM in enhancing image quality and preserving critical clinical information for low-dose PET scans. Our approach not only reduces radiation exposure risks but also provides a more reliable PET imaging tool for early disease detection and patient management.

Title: Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

  • Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral $8\times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at this https URL.

Title: The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs

  • Authors: Tanush Chopra, Michael Li
  • Subjects: cs.CL, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] The House Always Wins: A Framework for Evaluating Strategic Deception in LLMs(https://arxiv.org/abs/)
  • Keywords: fair, large language model
  • Abstract: We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the "house." Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.

Title: SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

  • Authors: Yanheng Wang, Xiaohan Yu, Yongsheng Gao, Jianjun Sha, Jian Wang, Lianru Gao, Yonggang Zhang, Xianhui Rong
  • Subjects: cs.CV, eess.IV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at this https URL.

Title: SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

  • Authors: Zheng Lin, Xuanjie Hu, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Ang Li, Praneeth Vepakomma, Yue Gao
  • Subjects: cs.LG, cs.CL, cs.DC
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models(https://arxiv.org/abs/)
  • Keywords: federate, large language model
  • Abstract: The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation's gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at this https URL.

Title: Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension

  • Authors: Gautam Chandrasekaran, Adam Klivans, Vasilis Kontonis, Raghu Meka, Konstantinos Stavropoulos
  • Subjects: cs.LG, cs.CC
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: In traditional models of supervised learning, the goal of a learner -- given examples from an arbitrary joint distribution on $\mathbb{R}^d \times \{\pm 1\}$ -- is to output a hypothesis that is competitive (to within $\epsilon$) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes, we introduce a smoothed-analysis framework that requires a learner to compete only with the best classifier that is robust to small random Gaussian perturbation. This subtle change allows us to give a wide array of learning results for any concept that (1) depends on a low-dimensional subspace (aka multi-index model) and (2) has a bounded Gaussian surface area. This class includes functions of halfspaces and (low-dimensional) convex sets, cases that are only known to be learnable in non-smoothed settings with respect to highly structured distributions such as Gaussians. Surprisingly, our analysis also yields new results for traditional non-smoothed frameworks such as learning with margin. In particular, we obtain the first algorithm for agnostically learning intersections of $k$-halfspaces in time $k^{poly(\frac{\log k}{\epsilon \gamma}) }$ where $\gamma$ is the margin parameter. Before our work, the best-known runtime was exponential in $k$ (Arriaga and Vempala, 1999).

Title: Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model

  • Authors: Sepehr Salem Ghahfarokhi, Tyrell To, Julie Jorns, Tina Yen, Bing Yu, Dong Hye Ye
  • Subjects: cs.CV, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: Data limitation is a significant challenge in applying deep learning to medical images. Recently, the diffusion probabilistic model (DPM) has shown the potential to generate high-quality images by converting Gaussian random noise into realistic images. In this paper, we apply the DPM to augment the deep ultraviolet fluorescence (DUV) image dataset with an aim to improve breast cancer classification for intraoperative margin assessment. For classification, we divide the whole surface DUV image into small patches and extract convolutional features for each patch by utilizing the pre-trained ResNet. Then, we feed them into an XGBoost classifier for patch-level decisions and then fuse them with a regional importance map computed by Grad-CAM++ for whole surface-level prediction. Our experimental results show that augmenting the training dataset with the DPM significantly improves breast cancer detection performance in DUV images, increasing accuracy from 93% to 97%, compared to using Affine transformations and ProGAN.

Title: How Does Overparameterization Affect Features?

  • Authors: Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] How Does Overparameterization Affect Features?(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine the expressivity of the features of these models, and show that the feature space of overparameterized networks cannot be spanned by concatenating many underparameterized features, and vice versa. This reveals that both overparameterized and underparameterized networks acquire some distinctive features. We then evaluate the performance of these models, and find that overparameterized networks outperform underparameterized networks, even when many of the latter are concatenated. We corroborate these findings using a VGG-16 and ResNet18 on CIFAR-10 and a Transformer on the MNLI classification dataset. Finally, we propose a toy setting to explain how overparameterized networks can learn some important features that the underparamaterized networks cannot learn.

Title: FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing

  • Authors: Donghyun Kim, Seil Kang, Seong Jae Hwang
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing quality metrics, neglecting the need for real-time applicability in time-sensitive domains like autonomous driving. This work introduces FALCON (Frequency Adjoint Link with CONtinuous density mask), a single-image dehazing system achieving state-of-the-art performance on both quality and speed. Particularly, we develop a novel bottleneck module, namely, Frequency Adjoint Link, operating in the frequency space to globally expand the receptive field with minimal growth in network size. Further, we leverage the underlying haze distribution based on the atmospheric scattering model via a Continuous Density Mask (CDM) which serves as a continuous-valued mask input prior and a differentiable auxiliary loss. Comprehensive experiments involving multiple state-of-the-art methods and ablation analysis demonstrate FALCON's exceptional performance in both dehazing quality and speed (i.e., >$180 frames-per-second), quantified by metrics such as FPS, PSNR, and SSIM.

Title: Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

  • Authors: Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval(https://arxiv.org/abs/)
  • Keywords: extraction, transformer
  • Abstract: In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

Title: FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

  • Authors: Ruinan Jin, Zikang Xu, Yuan Zhong, Qiongsong Yao, Qi Dou, S. Kevin Zhou, Xiaoxiao Li
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models(https://arxiv.org/abs/)
  • Keywords: fair, segmentation
  • Abstract: The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

Title: LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

  • Authors: Longchao Da, Tiejin Chen, Lu Cheng, Hua Wei
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model's hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work's semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.

Title: Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?

  • Authors: Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani
  • Subjects: cs.CL, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Small Language Models (SLMs) are generally considered to be more compact versions of large language models (LLMs), typically having fewer than 7 billion parameters. This study investigates the ability of small language models to learn, retain, and subsequently eliminate noise that is typically not found on the internet, where most pretraining datasets are sourced. For this, four pre-trained SLMs were utilized: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned without noise and tested for task execution with in-context learning. Afterward, noise patterns were introduced to evaluate the models' learning and unlearning capabilities. We evaluated the models' performance at various training levels. Phi consistently excelled with word-level noise but performed the worst with character-level noise. Despite being the smallest with approximately 1 billion parameters, Olmo performed consistently well on tasks.

Title: Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components

  • Authors: Phillip Schneider, Wessel Poelman, Michael Rovatsos, Florian Matthes
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users' information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research.

Title: Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

  • Authors: Wenqiang Zu, Shenghao Xie, Qing Zhao, Guoqi Li, Lei Ma
  • Subjects: cs.CV, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbf{Prompt tuning is a distribution calibrator.} And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

Title: CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect

  • Authors: Jiehui Zhou, Linxiao Yang, Xingyu Liu, Xinyue Gu, Liang Sun, Wei Chen
  • Subjects: cs.LG, stat.ME
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect(https://arxiv.org/abs/)
  • Keywords: interpretability
  • Abstract: In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strategic intervention management. In this paper, we propose CURLS, a novel rule learning method leveraging HTE, which can effectively describe subgroups with significant treatment effects. Specifically, we frame causal rule learning as a discrete optimization problem, finely balancing treatment effect with variance and considering the rule interpretability. We design an iterative procedure based on the minorize-maximization algorithm and solve a submodular lower bound as an approximation for the original. Quantitative experiments and qualitative case studies verify that compared with state-of-the-art methods, CURLS can find subgroups where the estimated and true effects are 16.1% and 13.8% higher and the variance is 12.0% smaller, while maintaining similar or better estimation accuracy and rule interpretability. Code is available at this https URL.

Title: GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

  • Authors: Huijie Fan, Tinghui Zhao, Qiang Wang, Baojie Fan, Yandong Tang, LianQing Liu
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking(https://arxiv.org/abs/)
  • Keywords: robust, extraction, transformer
  • Abstract: In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

Title: DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

  • Authors: Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, Haizhou Li
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods, thereby optimizing both efficiency and effectiveness. We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: 'Fast', designated for tasks where the LLM quickly identifies a high-confidence solution, and 'Slow', allocated for tasks that the LLM perceives as complex and for which it has low confidence in immediate solutions as well as requiring more reasoning paths to verify. Experiments on five popular reasoning benchmarks demonstrated the superiority of the DynaThink over baselines.

Title: An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

  • Authors: Weimin Bai, Yifei Wang, Wenzheng Chen, He Sun
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: Diffusion models excel in solving imaging inverse problems due to their ability to model complex image priors. However, their reliance on large, clean datasets for training limits their practical use where clean data is scarce. In this paper, we propose EMDiffusion, an expectation-maximization (EM) approach to train diffusion models from corrupted observations. Our method alternates between reconstructing clean images from corrupted data using a known diffusion model (E-step) and refining diffusion model weights based on these reconstructions (M-step). This iterative process leads the learned diffusion model to gradually converge to the true clean data distribution. We validate our method through extensive experiments on diverse computational imaging tasks, including random inpainting, denoising, and deblurring, achieving new state-of-the-art performance.

Title: Augmenting Document-level Relation Extraction with Efficient Multi-Supervision

  • Authors: Xiangyu Lin, Weijia Jia, Zhiguo Gong
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Augmenting Document-level Relation Extraction with Efficient Multi-Supervision(https://arxiv.org/abs/)
  • Keywords: robust, extraction
  • Abstract: Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.

Title: Blind Inversion using Latent Diffusion Priors

  • Authors: Weimin Bai, Siyi Chen, Wenzheng Chen, He Sun
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Blind Inversion using Latent Diffusion Priors(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM's performance on representative 2D blind deblurring and 3D sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts.

Title: EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

  • Authors: Chenxin Li, Brandon Y. Feng, Yifan Liu, Hengyu Liu, Cheng Wang, Weihao Yu, Yixuan Yuan
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: 3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually the case in real-world clinical scenarios. To tackle this {sparsity} challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process, dubbed as \textit{EndoSparse}. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against state-of-the-art methods, \textit{EndoSparse} achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to sparse-view limitations in endoscopic reconstruction. \textit{EndoSparse} signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: this https URL.

Title: PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs

  • Authors: Dan Peng, Zhihui Fu, Jun Wang
  • Subjects: cs.LG, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs(https://arxiv.org/abs/)
  • Keywords: privacy, large language model
  • Abstract: Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.

Title: Overcoming Common Flaws in the Evaluation of Selective Classification Systems

  • Authors: Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, Paul F Jaeger
  • Subjects: cs.LG, cs.CV, stat.ME
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Overcoming Common Flaws in the Evaluation of Selective Classification Systems(https://arxiv.org/abs/)
  • Keywords: interpretability
  • Abstract: Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

Title: SE(3)-Hyena Operator for Scalable Equivariant Learning

  • Authors: Artem Moskalev, Mangal Prakash, Rui Liao, Tommaso Mansi
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SE(3)-Hyena Operator for Scalable Equivariant Learning(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: Modeling global geometric context while maintaining equivariance is crucial for accurate predictions in many fields such as biology, chemistry, or vision. Yet, this is challenging due to the computational demands of processing high-dimensional data at scale. Existing approaches such as equivariant self-attention or distance-based message passing, suffer from quadratic complexity with respect to sequence length, while localized methods sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, in this work, we introduce SE(3)-Hyena operator, an equivariant long-convolutional model based on the Hyena operator. The SE(3)-Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on equivariant associative recall and n-body modeling, SE(3)-Hyena matches or outperforms equivariant self-attention while requiring significantly less memory and computational resources for long sequences. Our model processes the geometric context of 20k tokens x3.5 times faster than the equivariant transformer and allows x175 longer a context within the same memory budget.

Title: Improve ROI with Causal Learning and Conformal Prediction

  • Authors: Meng Ai, Zhuo Chen, Jibin Wang, Jing Shang, Tao Tao, Zhen Li
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Improve ROI with Causal Learning and Conformal Prediction(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: In the commercial sphere, such as operations and maintenance, advertising, and marketing recommendations, intelligent decision-making utilizing data mining and neural network technologies is crucial, especially in resource allocation to optimize ROI. This study delves into the Cost-aware Binary Treatment Assignment Problem (C-BTAP) across different industries, with a focus on the state-of-the-art Direct ROI Prediction (DRP) method. However, the DRP model confronts issues like covariate shift and insufficient training data, hindering its real-world effectiveness. Addressing these challenges is essential for ensuring dependable and robust predictions in varied operational contexts. This paper presents a robust Direct ROI Prediction (rDRP) method, designed to address challenges in real-world deployment of neural network-based uplift models, particularly under conditions of covariate shift and insufficient training data. The rDRP method, enhancing the standard DRP model, does not alter the model's structure or require retraining. It utilizes conformal prediction and Monte Carlo dropout for interval estimation, adapting to model uncertainty and data distribution shifts. A heuristic calibration method, inspired by a Kaggle competition, combines point and interval estimates. The effectiveness of these approaches is validated through offline tests and online A/B tests in various settings, demonstrating significant improvements in target rewards compared to the state-of-the-art method.

Title: Multimodal Conditional 3D Face Geometry Generation

  • Authors: Christopher Otto, Prashanth Chandran, Sebastian Weiss, Markus Gross, Gaspard Zoss, Derek Bradley
  • Subjects: cs.CV, cs.GR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Multimodal Conditional 3D Face Geometry Generation(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

Title: Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

  • Authors: Yunqi Xu, Tianchi Cai, Jiyan Jiang, Xierui Song
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{this https URL}\label{link_face4rag}}

Title: Min P Sampling: Balancing Creativity and Coherence at High Temperature

  • Authors: Minh Nguyen, Andrew Baker, Andreas Kirsch, Clement Neo
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Min P Sampling: Balancing Creativity and Coherence at High Temperature(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top-$p$ sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min-$p$, a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min-$p$ improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top-$p$ and other sampling methods. As of writing, min-$p$ has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.

Title: Rethinking LLM-based Preference Evaluation

  • Authors: Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Jieyu Zhao, Hui Xiong
  • Subjects: cs.LG, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Rethinking LLM-based Preference Evaluation(https://arxiv.org/abs/)
  • Keywords: fair, large language model
  • Abstract: Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.

Title: M2QA: Multi-domain Multilingual Question Answering

  • Authors: Leon Engländer, Hannah Sterz, Clifton Poth, Jonas Pfeiffer, Ilia Kuznetsov, Iryna Gurevych
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] M2QA: Multi-domain Multilingual Question Answering(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at this https URL.

Title: Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

  • Authors: Ivan Drokin
  • Subjects: cs.CV, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

Title: IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

  • Authors: Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, Kai Yu
  • Subjects: cs.CL, cs.AI, cs.MA
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at this https URL.

Title: Eliminating Position Bias of Language Models: A Mechanistic Approach

  • Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji
  • Subjects: cs.CL, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Eliminating Position Bias of Language Models: A Mechanistic Approach(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.

Title: BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

  • Authors: David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant
  • Subjects: cs.CL, cs.IR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] BERGEN: A Benchmarking Library for Retrieval-Augmented Generation(https://arxiv.org/abs/)
  • Keywords: generative, large language model
  • Abstract: Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \url{this https URL}.

Title: Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal

  • Authors: Ziqi Zeng, Chen Zhao, Weiling Cai, Chenyu Dong
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal(https://arxiv.org/abs/)
  • Keywords: diffusion, generative
  • Abstract: Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex mapping between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method.

Title: SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

  • Authors: Christoforus Yoga Haryanto, Minh Hieu Vu, Trung Duc Nguyen, Emily Lomempow, Yulia Nurliana, Sona Taheri
  • Subjects: cs.CR, cs.AI, cs.CY, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest(https://arxiv.org/abs/)
  • Keywords: secure, security, privacy, attack, robust, generative
  • Abstract: The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework's novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests.

Title: Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods

  • Authors: Andrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
  • Subjects: cs.LG, stat.ML
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods(https://arxiv.org/abs/)
  • Keywords: interpretability
  • Abstract: Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific 'random effects' from cluster-invariant 'fixed effects' have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo methods to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.

Title: Comprehensive Dataset for Urban Streetlight Analysis

  • Authors: Eliza Femi Sherley S, Sanjay T, Shri Kaanth P, Jeffrey Samuel S
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Comprehensive Dataset for Urban Streetlight Analysis(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: This article includes a comprehensive collection of over 800 high-resolution streetlight images taken systematically from India's major streets, primarily in the Chennai region. The images were methodically collected following standardized methods to assure uniformity and quality. Each image has been labelled and grouped into directories based on binary class labels, which indicate whether each streetlight is functional or not. This organized dataset is intended to make it easier to train and evaluate deep neural networks, allowing for the creation of pre-trained models that have robust feature representations. Such models have several potential uses, such as improving smart city surveillance systems, automating street infrastructure monitoring, and increasing urban management efficiency. The availability of this dataset is intended to inspire future research and development in computer vision and smart city technologies, supporting innovation and practical solutions to urban infrastructure concerns. The dataset can be accessed at this https URL.

Title: Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

  • Authors: Guillermo Marco, Julio Gonzalo, Ramón del Castillo, María Teresa Mateo Girona
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent's. Then, we prepared an evaluation rubric inspired by Boden's definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.

Title: Calibrated Large Language Models for Binary Question Answering

  • Authors: Patrizio Giovannotti, Alexander Gammerman
  • Subjects: cs.CL, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Calibrated Large Language Models for Binary Question Answering(https://arxiv.org/abs/)
  • Keywords: interpretability, large language model
  • Abstract: Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model's predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.

Title: Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation

  • Authors: Nadezhda Chirkova, Vassilina Nikoulina, Jean-Luc Meunier, Alexandre Bérard
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation(https://arxiv.org/abs/)
  • Keywords: robust, transformer
  • Abstract: We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.

Title: RMS-FlowNet++: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds

  • Authors: Ramy Battrawy, René Schuster, Didier Stricker
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] RMS-FlowNet++: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: The proposed RMS-FlowNet++ is a novel end-to-end learning-based architecture for accurate and efficient scene flow estimation that can operate on high-density point clouds. For hierarchical scene f low estimation, existing methods rely on expensive Farthest-Point-Sampling (FPS) to sample the scenes, must find large correspondence sets across the consecutive frames and/or must search for correspondences at a full input resolution. While this can improve the accuracy, it reduces the overall efficiency of these methods and limits their ability to handle large numbers of points due to memory requirements. In contrast to these methods, our architecture is based on an efficient design for hierarchical prediction of multi-scale scene flow. To this end, we develop a special flow embedding block that has two advantages over the current methods: First, a smaller correspondence set is used, and second, the use of Random-Sampling (RS) is possible. In addition, our architecture does not need to search for correspondences at a full input resolution. Exhibiting high accuracy, our RMS-FlowNet++ provides a faster prediction than state-of-the-art methods, avoids high memory requirements and enables efficient scene flow on dense point clouds of more than 250K points at once. Our comprehensive experiments verify the accuracy of RMS FlowNet++ on the established FlyingThings3D data set with different point cloud densities and validate our design choices. Furthermore, we demonstrate that our model has a competitive ability to generalize to the real-world scenes of the KITTI data set without fine-tuning.

Title: An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification

  • Authors: Kassem Sabeh, Robert Litschko, Mouna Kacimi, Barbara Plank, Johann Gamper
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification(https://arxiv.org/abs/)
  • Keywords: generative
  • Abstract: Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: this https URL

Title: Integrated feature analysis for deep learning interpretation and class activation maps

  • Authors: Yanli Li, Tahereh Hassanzadeh, Denis P. Shamonin, Monique Reijnierse, Annette H.M. van der Helm-van Mil, Berend C. Stoel
  • Subjects: cs.CV, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Integrated feature analysis for deep learning interpretation and class activation maps(https://arxiv.org/abs/)
  • Keywords: interpretability
  • Abstract: Understanding the decisions of deep learning (DL) models is essential for the acceptance of DL to risk-sensitive applications. Although methods, like class activation maps (CAMs), give a glimpse into the black box, they do miss some crucial information, thereby limiting its interpretability and merely providing the considered locations of objects. To provide more insight into the models and the influence of datasets, we propose an integrated feature analysis method, which consists of feature distribution analysis and feature decomposition, to look closer into the intermediate features extracted by DL models. This integrated feature analysis could provide information on overfitting, confounders, outliers in datasets, model redundancies and principal features extracted by the models, and provide distribution information to form a common intensity scale, which are missing in current CAM algorithms. The integrated feature analysis was applied to eight different datasets for general validation: photographs of handwritten digits, two datasets of natural images and five medical datasets, including skin photography, ultrasound, CT, X-rays and MRIs. The method was evaluated by calculating the consistency between the CAMs average class activation levels and the logits of the model. Based on the eight datasets, the correlation coefficients through our method were all very close to 100%, and based on the feature decomposition, 5%-25% of features could generate equally informative saliency maps and obtain the same model performances as using all features. This proves the reliability of the integrated feature analysis. As the proposed methods rely on very few assumptions, this is a step towards better model interpretation and a useful extension to existing CAM algorithms. Codes: this https URL

Title: CPT: Consistent Proxy Tuning for Black-box Optimization

  • Authors: Yuanyang He, Zitong Huang, Xinxing Xu, Rick Siow Mong Goh, Salman Khan, Wangmeng Zuo, Yong Liu, Chun-Mei Feng
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] CPT: Consistent Proxy Tuning for Black-box Optimization(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets. The code is available at this https URL.

Title: Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

  • Authors: Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu
  • Subjects: cs.CV, cs.AI, cs.CL, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models(https://arxiv.org/abs/)
  • Keywords: attack, robust
  • Abstract: Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100\% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbf{Warning: the text data used in this paper are toxic in nature and may be offensive to some readers.}

Title: Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation

  • Authors: Takyoung Kim, Kyungjae Lee, Young Rok Jang, Ji Yong Cho, Gangwoo Kim, Minseok Cho, Moontae Lee
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Interactions with billion-scale large language models typically yield long-form responses due to their extensive parametric capacities, along with retrieval-augmented features. While detailed responses provide insightful viewpoint of a specific subject, they frequently generate redundant and less engaging content that does not meet user interests. In this work, we focus on the role of query outlining (i.e., selected sequence of queries) in scenarios that users request a specific range of information, namely coverage-conditioned ($C^2$) scenarios. For simulating $C^2$ scenarios, we construct QTree, 10K sets of information-seeking queries decomposed with various perspectives on certain topics. By utilizing QTree, we train QPlanner, a 7B language model generating customized query outlines that follow coverage-conditioned queries. We analyze the effectiveness of generated outlines through automatic and human evaluation, targeting on retrieval-augmented generation (RAG). Moreover, the experimental results demonstrate that QPlanner with alignment training can further provide outlines satisfying diverse user interests. Our resources are available at this https URL.

Title: Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid

  • Authors: Kalibinuer Tiliwalidi, Chengyin Hu, Weiwen Shi
  • Subjects: cs.CV, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid(https://arxiv.org/abs/)
  • Keywords: security, defense, attack, robust, steal
  • Abstract: While extensive research exists on physical adversarial attacks within the visible spectrum, studies on such techniques in the infrared spectrum are limited. Infrared object detectors are vital in modern technological applications but are susceptible to adversarial attacks, posing significant security threats. Previous studies using physical perturbations like light bulb arrays and aerogels for white-box attacks, or hot and cold patches for black-box attacks, have proven impractical or limited in multi-view support. To address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which models perturbations in a grid format and uses a genetic algorithm for black-box optimization. These perturbations are cyclically applied to various parts of a pedestrian's clothing to facilitate multi-view black-box physical attacks on infrared pedestrian detectors. Extensive experiments validate AdvGrid's effectiveness, stealthiness, and robustness. The method achieves attack success rates of 80.00\% in digital environments and 91.86\% in physical environments, outperforming baseline methods. Additionally, the average attack success rate exceeds 50\% against mainstream detectors, demonstrating AdvGrid's robustness. Our analyses include ablation studies, transfer attacks, and adversarial defenses, confirming the method's superiority.

Title: $\text{Memory}^3$: Language Modeling with Explicit Memory

  • Authors: Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, Linpeng Tang, Weinan E
  • Subjects: cs.CL, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] $\text{Memory}^3$: Language Modeling with Explicit Memory(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

Title: Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

  • Authors: Francesco Barbato, Umberto Michieli, Jijoong Moon, Pietro Zanuttigh, Mete Ozay
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection(https://arxiv.org/abs/)
  • Keywords: privacy
  • Abstract: Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.

Title: A Learned Generalized Geodesic Distance Function-Based Approach for Node Feature Augmentation on Graphs

  • Authors: Amitoz Azad, Yuan Fang
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] A Learned Generalized Geodesic Distance Function-Based Approach for Node Feature Augmentation on Graphs(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Geodesic distances on manifolds have numerous applications in image processing, computer graphics and computer vision. In this work, we introduce an approach called `LGGD' (Learned Generalized Geodesic Distances). This method involves generating node features by learning a generalized geodesic distance function through a training pipeline that incorporates training data, graph topology and the node content features. The strength of this method lies in the proven robustness of the generalized geodesic distances to noise and outliers. Our contributions encompass improved performance in node classification tasks, competitive results with state-of-the-art methods on real-world graph datasets, the demonstration of the learnability of parameters within the generalized geodesic equation on graph, and dynamic inclusion of new labels.

Title: SCIF: A Language for Compositional Smart Contract Security

  • Authors: Siqiu Yao, Haobin Ni, Andrew C. Myers, Ethan Cecchetti
  • Subjects: cs.CR, cs.PL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SCIF: A Language for Compositional Smart Contract Security(https://arxiv.org/abs/)
  • Keywords: secure, security, protect, attack
  • Abstract: Securing smart contracts remains a fundamental challenge. At its core, it is about building software that is secure in composition with untrusted code, a challenge that extends far beyond blockchains. We introduce SCIF, a language for building smart contracts that are compositionally secure. SCIF is based on the fundamentally compositional principle of secure information flow, but extends this core mechanism to include protection against reentrancy attacks, confused deputy attacks, and improper error handling, even in the presence of malicious contracts that do not follow SCIF's rules. SCIF supports a rich ecosystem of interacting principals with partial trust through its mechanisms for dynamic trust management. SCIF has been implemented as a compiler to Solidity. We describe the SCIF language, including its static checking rules and runtime. Finally, we implement several applications with intricate security reasoning, showing how SCIF supports building complex smart contracts securely and gives programmer accurate diagnostics about potential security bugs.

Title: Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

  • Authors: Zongshuo Li, Ding Huo, Markus Meurer, Thomas Bergs
  • Subjects: cs.CV, cs.AI, cs.LG, eess.IV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach's advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

Title: EconNLI: Evaluating Large Language Models on Economics Reasoning

  • Authors: Yue Guo, Yi Yang
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] EconNLI: Evaluating Large Language Models on Economics Reasoning(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Large Language Models (LLMs) are widely used for writing economic analysis reports or providing financial advice, but their ability to understand economic knowledge and reason about potential results of specific economic events lacks systematic evaluation. To address this gap, we propose a new dataset, natural language inference on economic events (EconNLI), to evaluate LLMs' knowledge and reasoning abilities in the economic domain. We evaluate LLMs on (1) their ability to correctly classify whether a premise event will cause a hypothesis event and (2) their ability to generate reasonable events resulting from a given premise. Our experiments reveal that LLMs are not sophisticated in economic reasoning and may generate wrong or hallucinated answers. Our study raises awareness of the limitations of using LLMs for critical decision-making involving economic reasoning and analysis. The dataset and codes are available at this https URL.

Title: Searching for Best Practices in Retrieval-Augmented Generation

  • Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Searching for Best Practices in Retrieval-Augmented Generation(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy.

Title: Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

  • Authors: Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

Title: DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

  • Authors: Crispian Morris, Nantheera Anantrasirichai, Fan Zhang, David Bull
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution(https://arxiv.org/abs/)
  • Keywords: transformer, segmentation
  • Abstract: In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model's learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at this https URL

Title: MIRAI: Evaluating LLM Agents for Event Forecasting

  • Authors: Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] MIRAI: Evaluating LLM Agents for Event Forecasting(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Title: A Fingerprint for Large Language Models

  • Authors: Zhiguang Yang, Hanzhou Wu
  • Subjects: cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] A Fingerprint for Large Language Models(https://arxiv.org/abs/)
  • Keywords: protect, attack, robust, large language model
  • Abstract: Recent advances show that scaling a pre-trained language model could achieve state-of-the-art performance on many downstream tasks, prompting large language models (LLMs) to become a hot research topic in the field of artificial intelligence. However, due to the resource-intensive nature of training LLMs from scratch, it is urgent and crucial to protect the intellectual property of LLMs against infringement. This has motivated the authors in this paper to propose a novel black-box fingerprinting technique for LLMs, which requires neither model training nor model fine-tuning. We first demonstrate that the outputs of LLMs span a unique vector space associated with each model. We model the problem of ownership authentication as the task of evaluating the similarity between the victim model's space and the output's space of the suspect model. To deal with this problem, we propose two solutions, where the first solution involves verifying whether the outputs of the suspected large model are in the same space as those of the victim model, enabling rapid identification of model infringement, and the second one reconstructs the union of the vector spaces for LLM outputs and the victim model to address situations where the victim model has undergone the Parameter-Efficient Fine-Tuning (PEFT) attacks. Experimental results indicate that the proposed technique achieves superior performance in ownership verification and robustness against PEFT attacks. This work reveals inherent characteristics of LLMs and provides a promising solution for ownership verification of LLMs in black-box scenarios, ensuring efficiency, generality and practicality.

Title: SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

  • Authors: Ao Liang, Wenyu Chen, Jian Fang, Huaici Zhao
  • Subjects: cs.CV, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through dropping points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by dropping points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI \textit{test} set, SGCCNet achieves $80.82\%$ for the metric of $AP_{3D}$ on the \textit{Moderate} level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by $2.35\%$ and $3.42\%$, respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors

Title: CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

  • Authors: Ci Li, Elin Hernlund, Hedvig Kjellström, Silvia Zuffi
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.

Title: QUEEN: Query Unlearning against Model Extraction

  • Authors: Huajie Chen, Tianqing Zhu, Lefeng Zhang, Bo Liu, Derui Wang, Wanlei Zhou, Minhui Xue
  • Subjects: cs.CR, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] QUEEN: Query Unlearning against Model Extraction(https://arxiv.org/abs/)
  • Keywords: security, privacy, protect, defense, attack, steal, extraction, watermark
  • Abstract: Model extraction attacks currently pose a non-negligible threat to the security and privacy of deep learning models. By querying the model with a small dataset and usingthe query results as the ground-truth labels, an adversary can steal a piracy model with performance comparable to the original model. Two key issues that cause the threat are, on the one hand, accurate and unlimited queries can be obtained by the adversary; on the other hand, the adversary can aggregate the query results to train the model step by step. The existing defenses usually employ model watermarking or fingerprinting to protect the ownership. However, these methods cannot proactively prevent the violation from happening. To mitigate the threat, we propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks from the very beginning. To limit the potential threat, QUEEN has sensitivity measurement and outputs perturbation that prevents the adversary from training a piracy model with high performance. In sensitivity measurement, QUEEN measures the single query sensitivity by its distance from the center of its cluster in the feature space. To reduce the learning accuracy of attacks, for the highly sensitive query batch, QUEEN applies query unlearning, which is implemented by gradient reverse to perturb the softmax output such that the piracy model will generate reverse gradients to worsen its performance unconsciously. Experiments show that QUEEN outperforms the state-of-the-art defenses against various model extraction attacks with a relatively low cost to the model accuracy. The artifact is publicly available at https://anonymous.4open.science/r/queen implementation-5408/.

Title: DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking

  • Authors: Alsharif Abuadbba, Nicholas Rhodes, Kristen Moore, Bushra Sabir, Shuo Wang, Yansong Gao
  • Subjects: cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking(https://arxiv.org/abs/)
  • Keywords: security, defense, attack, robust, watermark
  • Abstract: Deep learning solutions in critical domains like autonomous vehicles, facial recognition, and sentiment analysis require caution due to the severe consequences of errors. Research shows these models are vulnerable to adversarial attacks, such as data poisoning and neural trojaning, which can covertly manipulate model behavior, compromising reliability and safety. Current defense strategies like watermarking have limitations: they fail to detect all model modifications and primarily focus on attacks on CNNs in the image domain, neglecting other critical architectures like RNNs. To address these gaps, we introduce DeepiSign-G, a versatile watermarking approach designed for comprehensive verification of leading DNN architectures, including CNNs and RNNs. DeepiSign-G enhances model security by embedding an invisible watermark within the Walsh-Hadamard transform coefficients of the model's parameters. This watermark is highly sensitive and fragile, ensuring prompt detection of any modifications. Unlike traditional hashing techniques, DeepiSign-G allows substantial metadata incorporation directly within the model, enabling detailed, self-contained tracking and verification. We demonstrate DeepiSign-G's applicability across various architectures, including CNN models (VGG, ResNets, DenseNet) and RNNs (Text sentiment classifier). We experiment with four popular datasets: VGG Face, CIFAR10, GTSRB Traffic Sign, and Large Movie Review. We also evaluate DeepiSign-G under five potential attacks. Our comprehensive evaluation confirms that DeepiSign-G effectively detects these attacks without compromising CNN and RNN model performance, highlighting its efficacy as a robust security measure for deep learning applications. Detection of integrity breaches is nearly perfect, while hiding only a bit in approximately 1% of the Walsh-Hadamard coefficients.

Title: Complementary Fusion of Deep Network and Tree Model for ETA Prediction

  • Authors: YuRui Huang, Jie Zhang, HengDa Bao, Yang Yang, Jian Yang
  • Subjects: cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Complementary Fusion of Deep Network and Tree Model for ETA Prediction(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Estimated time of arrival (ETA) is a very important factor in the transportation system. It has attracted increasing attentions and has been widely used as a basic service in navigation systems and intelligent transportation systems. In this paper, we propose a novel solution to the ETA estimation problem, which is an ensemble on tree models and neural networks. We proved the accuracy and robustness of the solution on the A/B list and finally won first place in the SIGSPATIAL 2021 GISCUP competition.

Title: The African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases

  • Authors: Serene Lim
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] The African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: This study investigates the subtle and often concealed biases present in Large Language Models (LLMs), which, despite passing explicit bias tests, can still exhibit implicit biases akin to those observed in humans who profess egalitarian beliefs yet demonstrate underlying prejudices. The challenge of measuring such biases is exacerbated as LLMs become increasingly proprietary, restricting access to their internal mechanisms such as embeddings, which are crucial for applying traditional bias measures. To tackle these issues, this study introduces innovative measures of bias inspired by psychological methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias. The LLM IAT Bias is a prompt-based method designed to unearth implicit biases by simulating the well-known psychological IAT but adapted for use with LLMs. The LLM Decision Bias measure is developed to detect subtle discrimination in decision-making tasks, focusing on how LLMs choose between individuals in various scenarios. Open-ended generation is also utilised through thematic analysis of word generations and storytelling. The experiments revealed biases across gender and racial domains, from discriminatory categorisations to exoticisation. Our findings indicate that the prompt-based measure of implicit bias not only correlates with traditional embedding-based methods but also more effectively predicts downstream behaviors, which are crucially measured by the LLM Decision Bias. This relationship underscores the importance of relative, rather than absolute, evaluations in assessing implicit biases, reflecting psychological insights into human bias assessment. This research contributes to the broader understanding of AI ethics and provides suggestions for continually assessing and mitigating biases in advanced AI systems, emphasising the need for more qualitative and downstream focus.

Title: Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

  • Authors: Andrew Zamai, Andrea Zugarini, Leonardo Rigutini, Marco Ernandes, Marco Maggini
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER(https://arxiv.org/abs/)
  • Keywords: robust, large language model
  • Abstract: Recently, several specialized instruction-tuned Large Language Models (LLMs) for Named Entity Recognition (NER) have emerged. Compared to traditional NER approaches, these models have strong generalization capabilities. Existing LLMs mainly focus on zero-shot NER in out-of-domain distributions, being fine-tuned on an extensive number of entity classes that often highly or completely overlap with test sets. In this work instead, we propose SLIMER, an approach designed to tackle never-seen-before named entity tags by instructing the model on fewer examples, and by leveraging a prompt enriched with definition and guidelines. Experiments demonstrate that definition and guidelines yield better performance, faster and more robust learning, particularly when labelling unseen Named Entities. Furthermore, SLIMER performs comparably to state-of-the-art approaches in out-of-domain zero-shot NER, while being trained on a reduced tag set.

Title: Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints

  • Authors: Xiaoliang Sun, Liangchao Guo, Wenlong Zhang, Zi Wang, Qifeng Yu
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Factors, such as rapid relative motion, clutter background, etc., make robust small aerial target detection for airborne infrared detection systems a challenge. Existing methods are facing difficulties when dealing with such cases. We consider that a continuous and smooth trajectory is critical in boosting small infrared aerial target detection performance. A simple and effective small aerial target detection method for airborne infrared detection system using light gradient boosting model (LightGBM) and trajectory constraints is proposed in this article. First, we simply formulate target candidate detection as a binary classification problem. Target candidates in every individual frame are detected via interesting pixel detection and a trained LightGBM model. Then, the local smoothness and global continuous characteristic of the target trajectory are modeled as short-strict and long-loose constraints. The trajectory constraints are used efficiently for detecting the true small infrared aerial targets from numerous target candidates. Experiments on public datasets demonstrate that the proposed method performs better than other existing methods. Furthermore, a public dataset for small aerial target detection in airborne infrared detection systems is constructed. To the best of our knowledge, this dataset has the largest data scale and richest scene types within this field.

Title: Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space

  • Authors: Menglin Yang, Harshit Verma, Delvin Ce Zhang, Jiahong Liu, Irwin King, Rex Ying
  • Subjects: cs.LG, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Hypformer: Exploring Efficient Hyperbolic Transformer Fully in Hyperbolic Space(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of Hypformer across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.

Title: Formal Verification of Object Detection

  • Authors: Avraham Raviv, Yizhak Y. Elboher, Michelle Aluf-Medina, Yael Leibovich Weiss, Omer Cohen, Roy Assa, Guy Katz, Hillel Kugler
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Formal Verification of Object Detection(https://arxiv.org/abs/)
  • Keywords: attack, robust
  • Abstract: Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

Title: Preserving Full Degradation Details for Blind Image Super-Resolution

  • Authors: Hongda Liu, Longguang Wang, Ye Zhang, Kaiwen Xue, Shunbo Zhou, Yulan Guo
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Preserving Full Degradation Details for Blind Image Super-Resolution(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: The performance of image super-resolution relies heavily on the accuracy of degradation information, especially under blind settings. Due to absence of true degradation models in real-world scenarios, previous methods learn distinct representations by distinguishing different degradations in a batch. However, the most significant degradation differences may provide shortcuts for the learning of representations such that subtle difference may be discarded. In this paper, we propose an alternative to learn degradation representations through reproducing degraded low-resolution (LR) images. By guiding the degrader to reconstruct input LR images, full degradation information can be encoded into the representations. In addition, we develop an energy distance loss to facilitate the learning of the degradation representations by introducing a bounded constraint. Experiments show that our representations can extract accurate and highly robust degradation information. Moreover, evaluations on both synthetic and real images demonstrate that our ReDSR achieves state-of-the-art performance for the blind SR tasks.

Title: Collaborative Performance Prediction for Large Language Models

  • Authors: Qiyuan Zhang, Fuyuan Lyu, Xue Liu, Chen Ma
  • Subjects: cs.CL, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Collaborative Performance Prediction for Large Language Models(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

Title: GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting

  • Authors: Chenxin Li, Hengyu Liu, Zhiwen Fan, Wuyang Li, Yifan Liu, Panwang Pan, Yixuan Yuan
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting(https://arxiv.org/abs/)
  • Keywords: extraction, generative
  • Abstract: Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue remains unexplored for emerging generative 3D formats like Gaussian Splatting. We present GaussianStego, a method for embedding steganographic information in the rendering of generated 3D assets. Our approach employs an optimization framework that enables the accurate extraction of hidden information from images rendered using Gaussian assets derived from large models, while maintaining their original visual quality. We conduct preliminary evaluations of our method across several potential deployment scenarios and discuss issues identified through analysis. GaussianStego represents an initial exploration into the novel challenge of embedding customizable, imperceptible, and recoverable information within the renders produced by current 3D generative models, while ensuring minimal impact on the rendered content's quality.

Title: Robot Instance Segmentation with Few Annotations for Grasping

  • Authors: Moshe Kimhi, David Vainshtein, Chaim Baskin, Dotan Di Castro
  • Subjects: cs.CV, cs.AI, cs.RO
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Robot Instance Segmentation with Few Annotations for Grasping(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an $\text{AP}_{50}$ of $86.37$, almost a $20\%$ improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an $\text{AP}_{50}$ score of $84.89$ with just $1 \%$ of annotated data compared to $72$ presented in ARMBench on the fully annotated counterpart.

Title: Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

  • Authors: Chenxi Li, Abhinav Kumar, Zhen Guo, Jie Hou, Reza Tourani
  • Subjects: cs.LG, cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability(https://arxiv.org/abs/)
  • Keywords: privacy, attack, membership infer, explainability
  • Abstract: The increasing prominence of deep learning applications and reliance on personalized data underscore the urgent need to address privacy vulnerabilities, particularly Membership Inference Attacks (MIAs). Despite numerous MIA studies, significant knowledge gaps persist, particularly regarding the impact of hidden features (in isolation) on attack efficacy and insufficient justification for the root causes of attacks based on raw data features. In this paper, we aim to address these knowledge gaps by first exploring statistical approaches to identify the most informative neurons and quantifying the significance of the hidden activations from the selected neurons on attack accuracy, in isolation and combination. Additionally, we propose an attack-driven explainable framework by integrating the target and attack models to identify the most influential features of raw data that lead to successful membership inference attacks. Our proposed MIA shows an improvement of up to 26% on state-of-the-art MIA.

Title: Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

  • Authors: Perusha Moodley, Pramod Kaushik, Dhillu Thambi, Mark Trovinger, Praveen Paruchuri, Xia Hong, Benjamin Rosman
  • Subjects: cs.LG, cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces(https://arxiv.org/abs/)
  • Keywords: interpretability, transformer
  • Abstract: Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model's performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT's performance and, in some cases, even improves it.

Title: Evaluating Model Performance Under Worst-case Subpopulations

  • Authors: Mike Li, Hongseok Namkoong, Shangzhou Xia
  • Subjects: cs.LG, cs.CY, stat.ML
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Evaluating Model Performance Under Worst-case Subpopulations(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.

Title: Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

  • Authors: Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescós
  • Subjects: cs.CV, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks(https://arxiv.org/abs/)
  • Keywords: transformer, segmentation
  • Abstract: In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

Title: CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

  • Authors: Danial Qashqai, Emad Mousavian, Shahriar Baradaran Shokouhi, Sattar Mirzakuchaki
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at this https URL.

Title: Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction

  • Authors: Jiangbei Hu, Yanggeng Li, Fei Hou, Junhui Hou, Zhebin Zhang, Shengfa Wang, Na Lei, Ying He
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large datasets of 3D shapes, which is costly and often necessitates hyperparameter adjustments for new datasets. This paper presents a novel neural framework, LoSF-UDF, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns within localized areas, prompting us to create a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. This method enables efficient and robust surface reconstruction from point clouds without the need for shape-specific training. Additionally, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We present comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to validate our method's efficacy.

Title: Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

  • Authors: Jayneel Parekh, Quentin Bouniot, Pavlo Mozharovskyi, Alasdair Newson, Florence d'Alché-Buc
  • Subjects: cs.CV, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models(https://arxiv.org/abs/)
  • Keywords: generative
  • Abstract: Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

Title: Protecting Privacy in Classifiers by Token Manipulation

  • Authors: Re'em Harel, Yair Elboher, Yuval Pinter
  • Subjects: cs.CL, cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Protecting Privacy in Classifiers by Token Manipulation(https://arxiv.org/abs/)
  • Keywords: privacy, protect, attack
  • Abstract: Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.

Title: PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

  • Authors: Xuan Yu, Yili Liu, Chenrui Han, Sitong Mao, Shunbo Zhou, Rong Xiong, Yiyi Liao, Yue Wang
  • Subjects: cs.CV, cs.RO
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction(https://arxiv.org/abs/)
  • Keywords: segmentation
  • Abstract: Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

Title: Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models

  • Authors: Xiaolin Xing, Zhiwei He, Haoyu Xu, Xing Wang, Rui Wang, Yu Hong
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models(https://arxiv.org/abs/)
  • Keywords: robust, interpretability, large language model
  • Abstract: This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of this http URL address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models' performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs' cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.

Title: TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

  • Authors: André Sacilotti, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

Title: Badllama 3: removing safety finetuning from Llama 3 in minutes

  • Authors: Dmitrii Volkov
  • Subjects: cs.LG, cs.AI, cs.CL, cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Badllama 3: removing safety finetuning from Llama 3 in minutes(https://arxiv.org/abs/)
  • Keywords: attack
  • Abstract: We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Title: Free-text Rationale Generation under Readability Level Control

  • Authors: Yi-Sheng Hsu, Nils Feldhus, Sherzod Hakimov
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Free-text Rationale Generation under Readability Level Control(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs' ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.

Title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

  • Authors: Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann
  • Subjects: cs.LG, cs.CV, cs.RO
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion(https://arxiv.org/abs/)
  • Keywords: diffusion, generative
  • Abstract: This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

Title: Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

  • Authors: Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká
  • Subjects: cs.CV, cs.CL, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

Title: GalLoP: Learning Global and Local Prompts for Vision-Language Models

  • Authors: Marc Lafon, Elias Ramzi, Clément Rambour, Nicolas Audebert, Nicolas Thome
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] GalLoP: Learning Global and Local Prompts for Vision-Language Models(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout'' technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

Title: Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters

  • Authors: Daniil Gurgurov, Mareike Hartmann, Simon Ostermann
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs -- Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala -- and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyse their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios.

Title: Dynamic Few-Shot Learning for Knowledge Graph Question Answering

  • Authors: Jacopo D'Abramo, Andrea Zugarini, Paolo Torroni
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Dynamic Few-Shot Learning for Knowledge Graph Question Answering(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Large language models present opportunities for innovative Question Answering over Knowledge Graphs (KGQA). However, they are not inherently designed for query generation. To bridge this gap, solutions have been proposed that rely on fine-tuning or ad-hoc architectures, achieving good results but limited out-of-domain distribution generalization. In this study, we introduce a novel approach called Dynamic Few-Shot Learning (DFSL). DFSL integrates the efficiency of in-context learning and semantic similarity and provides a generally applicable solution for KGQA with state-of-the-art performance. We run an extensive evaluation across multiple benchmark datasets and architecture configurations.

Title: HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling

  • Authors: Jesus-German Ortiz-Barajas, Helena Gomez-Adorno, Thamar Solorio
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling(https://arxiv.org/abs/)
  • Keywords: transformer
  • Abstract: We present HyperLoader, a simple approach that combines different parameter-efficient fine-tuning methods in a multi-task setting. To achieve this goal, our model uses a hypernetwork to generate the weights of these modules based on the task, the transformer layer, and its position within this layer. Our method combines the benefits of multi-task learning by capturing the structure of all tasks while reducing the task interference problem by encapsulating the task-specific knowledge in the generated weights and the benefits of combining different parameter-efficient methods to outperform full-fine tuning. We provide empirical evidence that HyperLoader outperforms previous approaches in most datasets and obtains the best average performance across tasks in high-resource and low-resource scenarios.

Title: A Global-Local Attention Mechanism for Relation Classification

  • Authors: Yiping Sun
  • Subjects: cs.CL, cs.IR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] A Global-Local Attention Mechanism for Relation Classification(https://arxiv.org/abs/)
  • Keywords: extraction
  • Abstract: Relation classification, a crucial component of relation extraction, involves identifying connections between two entities. Previous studies have predominantly focused on integrating the attention mechanism into relation classification at a global scale, overlooking the importance of the local context. To address this gap, this paper introduces a novel global-local attention mechanism for relation classification, which enhances global attention with a localized focus. Additionally, we propose innovative hard and soft localization mechanisms to identify potential keywords for local attention. By incorporating both hard and soft localization strategies, our approach offers a more nuanced and comprehensive understanding of the contextual cues that contribute to effective relation classification. Our experimental results on the SemEval-2010 Task 8 dataset highlight the superior performance of our method compared to previous attention-based approaches in relation classification.

Title: FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

  • Authors: Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Luming Liang
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] FORA: Fast-Forward Caching in Diffusion Transformer Acceleration(https://arxiv.org/abs/)
  • Keywords: diffusion, transformer
  • Abstract: Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: this https URL.

Title: Maximizing Blockchain Performance: Mitigating Conflicting Transactions through Parallelism and Dependency Management

  • Authors: Faisal Haque Bappy, Tariqul Islam, Tarannum Shaila Zaman, Md Sajidul Islam Sajid, Mir Mehedi Ahsan Pritom
  • Subjects: cs.CR, cs.DC
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Maximizing Blockchain Performance: Mitigating Conflicting Transactions through Parallelism and Dependency Management(https://arxiv.org/abs/)
  • Keywords: secure, security
  • Abstract: While blockchains initially gained popularity in the realm of cryptocurrencies, their widespread adoption is expanding beyond conventional applications, driven by the imperative need for enhanced data security. Despite providing a secure network, blockchains come with certain tradeoffs, including high latency, lower throughput, and an increased number of transaction failures. A pivotal issue contributing to these challenges is the improper management of "conflicting transactions", commonly referred to as "contention". When a number of pending transactions within a blockchain collide with each other, this results in a state of contention. This situation worsens network latency, leads to the wastage of system resources, and ultimately contributes to reduced throughput and higher transaction failures. In response to this issue, in this work, we present a novel blockchain scheme that integrates transaction parallelism and an intelligent dependency manager aiming to reduce the occurrence of conflicting transactions within blockchain networks. In terms of effectiveness and efficiency, experimental results show that our scheme not only mitigates the challenges posed by conflicting transactions, but also outperforms both existing parallel and non-parallel Hyperledger Fabric blockchain networks achieving higher transaction success rate, throughput, and latency. The integration of our scheme with Hyperledger Fabric appears to be a promising solution for improving the overall performance and stability of blockchain networks in real-world applications.

Title: POST: Email Archival, Processing and Flagging Stack for Incident Responders

  • Authors: Jeffrey Fairbanks
  • Subjects: cs.CR, cs.IR, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] POST: Email Archival, Processing and Flagging Stack for Incident Responders(https://arxiv.org/abs/)
  • Keywords: security
  • Abstract: Phishing is one of the main points of compromise, with email security and awareness being estimated at \$50-100B in 2022. There is great need for email forensics capability to quickly search for malicious content. A novel solution POST is proposed. POST is an API driven serverless email archival, processing, and flagging workflow for both large and small organizations that collects and parses all email, flags emails using state of the art Natural Language Processing and Machine Learning, allows full email searching on every aspect of an email, and provides a cost savings of up to 68.6%.

Title: Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision

  • Authors: Balaji VS, Mahi AR, Anirudh Ganapathy PS, Manju M
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision(https://arxiv.org/abs/)
  • Keywords: protect, robust
  • Abstract: Agriculture faces a growing challenge with wildlife wreaking havoc on crops, threatening sustainability. The project employs advanced object detection, the system utilizes the Mobile Net SSD model for real-time animal classification. The methodology initiates with the creation of a dataset, where each animal is represented by annotated images. The SSD Mobile Net architecture facilitates the use of a model for image classification and object detection. The model undergoes fine-tuning and optimization during training, enhancing accuracy for precise animal classification. Real-time detection is achieved through a webcam and the OpenCV library, enabling prompt identification and categorization of approaching animals. By seamlessly integrating intelligent scarecrow technology with object detection, this system offers a robust solution to field protection, minimizing crop damage and promoting precision farming. It represents a valuable contribution to agricultural sustainability, addressing the challenge of wildlife interference with crops. The implementation of the Intelligent Scarecrow Monitoring System stands as a progressive tool for proactive field management and protection, empowering farmers with an advanced solution for precision agriculture. Keywords: Machine learning, Deep Learning, Computer Vision, MobileNet SSD

Title: AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

  • Authors: Dubing Chen, Wencheng Han, Jin Fang, Jianbing Shen
  • Subjects: cs.CV, cs.RO
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.

Title: Needle in the Haystack for Memory Based Large Language Models

  • Authors: Subhajit Chaudhury, Soham Dan, Payel Das, Georgios Kollias, Elliot Nelson
  • Subjects: cs.CL, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Needle in the Haystack for Memory Based Large Language Models(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.

Title: TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind

  • Authors: Guiyang Hou, Wenqi Zhang, Yongliang Shen, Linjuan Wu, Weiming Lu
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind(https://arxiv.org/abs/)
  • Keywords: robust, large language model
  • Abstract: Theory of Mind (ToM)-the cognitive ability to reason about mental states of ourselves and others, is the foundation of social interaction. Although ToM comes naturally to humans, it poses a significant challenge to even the most advanced Large Language Models (LLMs). Due to the complex logical chains in ToM reasoning, especially in higher-order ToM questions, simply utilizing reasoning methods like Chain of Thought (CoT) will not improve the ToM capabilities of LLMs. We present TimeToM, which constructs a temporal space and uses it as the foundation to improve the ToM capabilities of LLMs in multiple scenarios. Specifically, within the temporal space, we construct Temporal Belief State Chain (TBSC) for each character and inspired by the cognition perspective of the social world model, we divide TBSC into self-world beliefs and social world beliefs, aligning with first-order ToM (first-order beliefs) and higher-order ToM (higher-order beliefs) questions, respectively. Moreover, we design a novel tool-belief solver that, by considering belief communication between characters in temporal space, can transform a character's higher-order beliefs into another character's first-order beliefs under belief communication period. Experimental results indicate that TimeToM can dramatically improve the reasoning performance of LLMs on ToM questions while taking a big step towards coherent and robust ToM reasoning.

Title: Contractual Reinforcement Learning: Pulling Arms with Invisible Hands

  • Authors: Jibang Wu, Siyu Chen, Mengdi Wang, Huazheng Wang, Haifeng Xu
  • Subjects: cs.LG, cs.AI, cs.GT, econ.TH
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Contractual Reinforcement Learning: Pulling Arms with Invisible Hands(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: The agency problem emerges in today's large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed \emph{contractual reinforcement learning}, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent's action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve $\tilde{O}(\sqrt{T})$ regret. We also present an algorithm with $\tilde{O}(T^{2/3})$ for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

Title: Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

  • Authors: Zisu Huang, Xiaohua Wang, Feiran Zhang, Zhibo Xu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement(https://arxiv.org/abs/)
  • Keywords: attack, robust, large language model
  • Abstract: The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: this https URL .

Title: Retrieval-augmented generation in multilingual settings

  • Authors: Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Retrieval-augmented generation in multilingual settings(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at this https URL.

Title: DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

  • Authors: Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Title: Survey and Analysis of IoT Operating Systems: A Comparative Study on the Effectiveness and Acquisition Time of Open Source Digital Forensics Tools

  • Authors: Jeffrey Fairbanks, Md Mashrur Arifin, Sadia Afreen, Alex Curtis
  • Subjects: cs.CR
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Survey and Analysis of IoT Operating Systems: A Comparative Study on the Effectiveness and Acquisition Time of Open Source Digital Forensics Tools(https://arxiv.org/abs/)
  • Keywords: security
  • Abstract: The main goal of this research project is to evaluate the effectiveness and speed of open-source forensic tools for digital evidence collecting from various Internet-of-Things (IoT) devices. The project will create and configure many IoT environments, across popular IoT operating systems, and run common forensics tasks in order to accomplish this goal. To validate these forensic analysis operations, a variety of open-source forensic tools covering four standard digital forensics tasks. These tasks will be utilized across each sample IoT operating system and will have its time spent on record carefully tracked down and examined, allowing for a thorough evaluation of the effectiveness and speed for performing forensics on each type of IoT device. The research also aims to offer recommendations to IoT security experts and digital forensic practitioners about the most efficient open-source tools for forensic investigations with IoT devices while maintaining the integrity of gathered evidence and identifying challenges that exist with these new device types. The results will be shared widely and well-documented in order to provide significant contributions to the field of internet-of-things device makers and digital forensics.

Title: LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

  • Authors: Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, Sara Hooker
  • Subjects: cs.CL, cs.AI, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models' internal biases, calibration and generations' textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear "neutral". which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity.

Title: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

  • Authors: Siwei Li, Yifan Yang, Yifei Shen, Fangyun Wei, Zongqing Lu, Lili Qiu, Yuqing Yang
  • Subjects: cs.CL, cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning(https://arxiv.org/abs/)
  • Keywords: robust
  • Abstract: Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.

Title: RegMix: Data Mixture as Regression for Language Model Pre-training

  • Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] RegMix: Data Mixture as Regression for Language Model Pre-training(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at this https URL.

Title: Self-Cognition in Large Language Models: An Exploratory Study

  • Authors: Dongping Chen, Jiawen Shi, Yao Wan, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
  • Subjects: cs.CL, cs.AI
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Self-Cognition in Large Language Models: An Exploratory Study(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs' self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena--specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core--demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.

Title: MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

  • Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
  • Subjects: cs.CV, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

Title: E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

  • Authors: Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness(https://arxiv.org/abs/)
  • Keywords: robust, diffusion
  • Abstract: Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

Title: DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

  • Authors: Chang-Han Yeh, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Ting-Hsuan Chen, Yu-Lun Liu
  • Subjects: cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8$\times$ super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results at this https URL.

Title: Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

  • Authors: Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, Yang Song
  • Subjects: cs.LG, cs.AI, cs.CV
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing(https://arxiv.org/abs/)
  • Keywords: diffusion
  • Abstract: Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

Title: Empowering 3D Visual Grounding with Reasoning Capabilities

  • Authors: Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu
  • Subjects: cs.CV, cs.AI, cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Empowering 3D Visual Grounding with Reasoning Capabilities(https://arxiv.org/abs/)
  • Keywords: large language model
  • Abstract: Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

Title: Scalable Nested Optimization for Deep Learning

  • Authors: Jonathan Lorraine
  • Subjects: cs.LG, cs.AI, cs.NE, math.OC, stat.ML
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] Scalable Nested Optimization for Deep Learning(https://arxiv.org/abs/)
  • Keywords: generative
  • Abstract: Gradient-based optimization has been critical to the success of machine learning, updating a single set of parameters to minimize a single loss. A growing number of applications rely on a generalization of this, where we have a bilevel or nested optimization of which subsets of parameters update on different objectives nested inside each other. We focus on motivating examples of hyperparameter optimization and generative adversarial networks. However, naively applying classical methods often fails when we look at solving these nested problems on a large scale. In this thesis, we build tools for nested optimization that scale to deep learning setups.

Title: KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

  • Authors: Jiayi Yuan, Hongyi Liu, Shaochen (Henry)Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu
  • Subjects: cs.CL
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches(https://arxiv.org/abs/)
  • Keywords: transformer, large language model
  • Abstract: Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches -- such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures -- have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights -- as well as a friendly workbench -- for the future development of long context-capable LLMs. The source code will be available at this https URL

Title: On the Abuse and Detection of Polyglot Files

  • Authors: Luke Koch, Sean Oesch, Amul Chaulagain, Jared Dixon, Matthew Dixon, Mike Huettal, Amir Sadovnik, Cory Watson, Brian Weber, Jacob Hartman, Richard Patulski
  • Subjects: cs.CR, cs.LG
  • Abstract URL: https://arxiv.org/abs/
  • Pdf URL: https://arxiv.org/pdf/
  • Copy Paste: [[]] On the Abuse and Detection of Polyglot Files(https://arxiv.org/abs/)
  • Keywords: attack, robust
  • Abstract: A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.