2024-12-04

Title: Explainable Artificial Intelligence for Medical Applications: A Review

Authors: Qiyang Sun, Alican Akman, Björn W. Schuller
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.01829
Pdf URL: https://arxiv.org/pdf/2412.01829
Copy Paste: [[2412.01829]] Explainable Artificial Intelligence for Medical Applications: A Review(https://arxiv.org/abs/2412.01829)
Keywords: robust
Abstract: The continuous development of artificial intelligence (AI) theory has propelled this field to unprecedented heights, owing to the relentless efforts of scholars and researchers. In the medical realm, AI takes a pivotal role, leveraging robust machine learning (ML) algorithms. AI technology in medical imaging aids physicians in X-ray, computed tomography (CT) scans, and magnetic resonance imaging (MRI) diagnoses, conducts pattern recognition and disease prediction based on acoustic data, delivers prognoses on disease types and developmental trends for patients, and employs intelligent health management wearable devices with human-computer interaction technology to name but a few. While these well-established applications have significantly assisted in medical field diagnoses, clinical decision-making, and management, collaboration between the medical and AI sectors faces an urgent challenge: How to substantiate the reliability of decision-making? The underlying issue stems from the conflict between the demand for accountability and result transparency in medical scenarios and the black-box model traits of AI. This article reviews recent research grounded in explainable artificial intelligence (XAI), with an emphasis on medical practices within the visual, audio, and multimodal perspectives. We endeavour to categorise and synthesise these practices, aiming to provide support and guidance for future researchers and healthcare professionals.

Title: Data Augmentation through Background Removal for Apple Leaf Disease Classification Using the MobileNetV2 Model

Authors: Youcef Ferdi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01854
Pdf URL: https://arxiv.org/pdf/2412.01854
Copy Paste: [[2412.01854]] Data Augmentation through Background Removal for Apple Leaf Disease Classification Using the MobileNetV2 Model(https://arxiv.org/abs/2412.01854)
Keywords: robust
Abstract: The advances in computer vision made possible by deep learning technology are increasingly being used in precision agriculture to automate the detection and classification of plant diseases. Symptoms of plant diseases are often seen on their leaves. The leaf images in existing datasets have been collected either under controlled conditions or in the field. The majority of previous studies have focused on identifying leaf diseases using images captured in controlled laboratory settings, often achieving high performance. However, methods aimed at detecting and classifying leaf diseases in field images have generally exhibited lower performance. The objective of this study is to evaluate the impact of a data augmentation approach that involves removing complex backgrounds from leaf images on the classification performance of apple leaf diseases in images captured under real world conditions. To achieve this objective, the lightweight pre-trained MobileNetV2 deep learning model was fine-tuned and subsequently used to evaluate the impact of expanding the training dataset with background-removed images on classification performance. Experimental results show that this augmentation strategy enhances classification accuracy. Specifically, using the Adam optimizer, the proposed method achieved a classification accuracy of 98.71% on the Plant Pathology database, representing an approximately 3% improvement and outperforming state-of-the-art methods. This demonstrates the effectiveness of background removal as a data augmentation technique for improving the robustness of disease classification models in real-world conditions.

Title: Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

Authors: Swayambhoo Jain, Ravi Raju, Bo Li, Zoltan Csaki, Jonathan Li, Kaizhao Liang, Guoyao Feng, Urmish Thakkar, Anand Sampat, Raghu Prabhakar, Sumati Jairath
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01868
Pdf URL: https://arxiv.org/pdf/2412.01868
Copy Paste: [[2412.01868]] Composition of Experts: A Modular Compound AI System Leveraging Large Language Models(https://arxiv.org/abs/2412.01868)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion average active parameters on MT-Bench.

Title: Planar Gaussian Splatting

Authors: Farhad G. Zanjani, Hong Cai, Hanno Ackermann, Leila Mirvakhabova, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01931
Pdf URL: https://arxiv.org/pdf/2412.01931
Copy Paste: [[2412.01931]] Planar Gaussian Splatting(https://arxiv.org/abs/2412.01931)
Keywords: segmentation
Abstract: This paper presents Planar Gaussian Splatting (PGS), a novel neural rendering approach to learn the 3D geometry and parse the 3D planes of a scene, directly from multiple RGB images. The PGS leverages Gaussian primitives to model the scene and employ a hierarchical Gaussian mixture approach to group them. Similar Gaussians are progressively merged probabilistically in the tree-structured Gaussian mixtures to identify distinct 3D plane instances and form the overall 3D scene geometry. In order to enable the grouping, the Gaussian primitives contain additional parameters, such as plane descriptors derived by lifting 2D masks from a general 2D segmentation model and surface normals. Experiments show that the proposed PGS achieves state-of-the-art performance in 3D planar reconstruction without requiring either 3D plane labels or depth supervision. In contrast to existing supervised methods that have limited generalizability and struggle under domain shift, PGS maintains its performance across datasets thanks to its neural rendering and scene-specific optimization mechanism, while also being significantly faster than existing optimization-based approaches.

Title: Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers

Authors: Alberto Gonzalo Rodriguez Salgado, Maying Schen, Philipp Harzig, Peter Mayer, Jose M. Alvarez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01941
Pdf URL: https://arxiv.org/pdf/2412.01941
Copy Paste: [[2412.01941]] Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers(https://arxiv.org/abs/2412.01941)
Keywords: robust, transformer, segmentation
Abstract: Robustness to out-of-distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to improve the robustness of Vision Transformers for semantic segmentation. CWFA applies a globally estimated perturbation per encoder with minimal compute overhead during training. Extensive evaluations on Cityscapes and ADE20K, with three state-of-the-art Vision Transformer architectures : SegFormer, Swin Transformer, and Twins demonstrate that CWFA-enhanced models significantly improve robustness without affecting clean data performance. For instance, on Cityscapes, a CWFA-augmented SegFormer-B1 model yields up to 27.7% mIoU robustness gain on impulse noise compared to the non-augmented SegFormer-B1. Furthermore, CWFA-augmented SegFormer-B5 achieves a new state-of-the-art 84.3% retention rate, a 0.7% improvement over the recently published FAN+STL.

Title: Enhancing Crop Segmentation in Satellite Image Time Series with Transformer Networks

Authors: Ignazio Gallo, Mattia Gatti, Nicola Landro, Christian Loschiavo, Mirco Boschetti, Riccardo La Grassa
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01944
Pdf URL: https://arxiv.org/pdf/2412.01944
Copy Paste: [[2412.01944]] Enhancing Crop Segmentation in Satellite Image Time Series with Transformer Networks(https://arxiv.org/abs/2412.01944)
Keywords: transformer, segmentation
Abstract: Recent studies have shown that Convolutional Neural Networks (CNNs) achieve impressive results in crop segmentation of Satellite Image Time Series (SITS). However, the emergence of transformer networks in various vision tasks raises the question of whether they can outperform CNNs in this task as well. This paper presents a revised version of the Transformer-based Swin UNETR model, specifically adapted for crop segmentation of SITS. The proposed model demonstrates significant advancements, achieving a validation accuracy of 96.14% and a test accuracy of 95.26% on the Munich dataset, surpassing the previous best results of 93.55% for validation and 92.94% for the test. Additionally, the model's performance on the Lombardia dataset is comparable to UNet3D and superior to FPN and DeepLabV3. Experiments of this study indicate that the model will likely achieve comparable or superior accuracy to CNNs while requiring significantly less training time. These findings highlight the potential of transformer-based architectures for crop segmentation in SITS, opening new avenues for remote sensing applications.

Title: A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients

Authors: Junbo Shen, Bing Xue, Thomas Kannampallil, Chenyang Lu, Joanna Abraham
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01950
Pdf URL: https://arxiv.org/pdf/2412.01950
Copy Paste: [[2412.01950]] A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients(https://arxiv.org/abs/2412.01950)
Keywords: interpretability, generative
Abstract: Early detection of surgical complications allows for timely therapy and proactive risk mitigation. Machine learning (ML) can be leveraged to identify and predict patient risks for postoperative complications. We developed and validated the effectiveness of predicting postoperative complications using a novel surgical Variational Autoencoder (surgVAE) that uncovers intrinsic patterns via cross-task and cross-cohort presentation learning. This retrospective cohort study used data from the electronic health records of adult surgical patients over four years (2018 - 2021). Six key postoperative complications for cardiac surgery were assessed: acute kidney injury, atrial fibrillation, cardiac arrest, deep vein thrombosis or pulmonary embolism, blood transfusion, and other intraoperative cardiac events. We compared prediction performances of surgVAE against widely-used ML models and advanced representation learning and generative models under 5-fold cross-validation. 89,246 surgeries (49% male, median (IQR) age: 57 (45-69)) were included, with 6,502 in the targeted cardiac surgery cohort (61% male, median (IQR) age: 60 (53-70)). surgVAE demonstrated superior performance over existing ML solutions across all postoperative complications of cardiac surgery patients, achieving macro-averaged AUPRC of 0.409 and macro-averaged AUROC of 0.831, which were 3.4% and 3.7% higher, respectively, than the best alternative method (by AUPRC scores). Model interpretation using Integrated Gradients highlighted key risk factors based on preoperative variable importance. surgVAE showed excellent discriminatory performance for predicting postoperative complications and addressing the challenges of data complexity, small cohort sizes, and low-frequency positive events. surgVAE enables data-driven predictions of patient risks and prognosis while enhancing the interpretability of patient risk profiles.

Title: The use of large language models to enhance cancer clinical trial educational materials

Authors: Mingye Gao, Aman Varshney, Shan Chen, Vikram Goddla, Jack Gallifant, Patrick Doyle, Claire Novack, Maeve Dillon-Martin, Teresia Perkins, Xinrong Correia, Erik Duhaime, Howard Isenstein, Elad Sharon, Lisa Soleymani Lehmann, David Kozono, Brian Anthony, Dmitriy Dligach, Danielle S. Bitterman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01955
Pdf URL: https://arxiv.org/pdf/2412.01955
Copy Paste: [[2412.01955]] The use of large language models to enhance cancer clinical trial educational materials(https://arxiv.org/abs/2412.01955)
Keywords: large language model
Abstract: Cancer clinical trials often face challenges in recruitment and engagement due to a lack of participant-facing informational and educational resources. This study investigated the potential of Large Language Models (LLMs), specifically GPT4, in generating patient-friendly educational content from clinical trial informed consent forms. Using data from this http URL, we employed zero-shot learning for creating trial summaries and one-shot learning for developing multiple-choice questions, evaluating their effectiveness through patient surveys and crowdsourced annotation. Results showed that GPT4-generated summaries were both readable and comprehensive, and may improve patients' understanding and interest in clinical trials. The multiple-choice questions demonstrated high accuracy and agreement with crowdsourced annotators. For both resource types, hallucinations were identified that require ongoing human oversight. The findings demonstrate the potential of LLMs "out-of-the-box" to support the generation of clinical trial education materials with minimal trial-specific engineering, but implementation with a human-in-the-loop is still needed to avoid misinformation risks.

Title: Enhancing Deep Learning Model Robustness through Metamorphic Re-Training

Authors: Said Togru, Youssef Sameh Mostafa, Karim Lotfy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01958
Pdf URL: https://arxiv.org/pdf/2412.01958
Copy Paste: [[2412.01958]] Enhancing Deep Learning Model Robustness through Metamorphic Re-Training(https://arxiv.org/abs/2412.01958)
Keywords: robust
Abstract: This paper evaluates the use of metamorphic relations to enhance the robustness and real-world performance of machine learning models. We propose a Metamorphic Retraining Framework, which applies metamorphic relations to data and utilizes semi-supervised learning algorithms in an iterative and adaptive multi-cycle process. The framework integrates multiple semi-supervised retraining algorithms, including FixMatch, FlexMatch, MixMatch, and FullMatch, to automate the retraining, evaluation, and testing of models with specified configurations. To assess the effectiveness of this approach, we conducted experiments on CIFAR-10, CIFAR-100, and MNIST datasets using a variety of image processing models, both pretrained and non-pretrained. Our results demonstrate the potential of metamorphic retraining to significantly improve model robustness as we show in our results that each model witnessed an increase of an additional flat 17 percent on average in our robustness metric.

Title: FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph Attention Networks and Transformer Encoders

Authors: Jinming Xing, Ruilin Xing, Yan Sun
Subjects: cs.LG, cs.IR, cs.NE
Abstract URL: https://arxiv.org/abs/2412.01979
Pdf URL: https://arxiv.org/pdf/2412.01979
Copy Paste: [[2412.01979]] FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph Attention Networks and Transformer Encoders(https://arxiv.org/abs/2412.01979)
Keywords: robust, transformer
Abstract: Missing data is a pervasive challenge in wireless networks and many other domains, often compromising the performance of machine learning and deep learning models. To address this, we propose a novel framework, FGATT, that combines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder to perform robust and accurate data imputation. FGAT leverages fuzzy rough sets and graph attention mechanisms to capture spatial dependencies dynamically, even in scenarios where predefined spatial information is unavailable. The Transformer encoder is employed to model temporal dependencies, utilizing its self-attention mechanism to focus on significant time-series patterns. A self-adaptive graph construction method is introduced to enable dynamic connectivity learning, ensuring the framework's applicability to a wide range of wireless datasets. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in imputation accuracy and robustness, particularly in scenarios with substantial missing data. The proposed model is well-suited for applications in wireless sensor networks and IoT environments, where data integrity is critical.

Title: Smart Parking with Pixel-Wise ROI Selection for Vehicle Detection Using YOLOv8, YOLOv9, YOLOv10, and YOLOv11

Authors: Gustavo P. C. P. da Luz, Gabriel Massuyoshi Sato, Luis Fernando Gomez Gonzalez, Juliana Freitag Borin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01983
Pdf URL: https://arxiv.org/pdf/2412.01983
Copy Paste: [[2412.01983]] Smart Parking with Pixel-Wise ROI Selection for Vehicle Detection Using YOLOv8, YOLOv9, YOLOv10, and YOLOv11(https://arxiv.org/abs/2412.01983)
Keywords: privacy
Abstract: The increasing urbanization and the growing number of vehicles in cities have underscored the need for efficient parking management systems. Traditional smart parking solutions often rely on sensors or cameras for occupancy detection, each with its limitations. Recent advancements in deep learning have introduced new YOLO models (YOLOv8, YOLOv9, YOLOv10, and YOLOv11), but these models have not been extensively evaluated in the context of smart parking systems, particularly when combined with Region of Interest (ROI) selection for object detection. Existing methods still rely on fixed polygonal ROI selections or simple pixel-based modifications, which limit flexibility and precision. This work introduces a novel approach that integrates Internet of Things, Edge Computing, and Deep Learning concepts, by using the latest YOLO models for vehicle detection. By exploring both edge and cloud computing, it was found that inference times on edge devices ranged from 1 to 92 seconds, depending on the hardware and model version. Additionally, a new pixel-wise post-processing ROI selection method is proposed for accurately identifying regions of interest to count vehicles in parking lot images. The proposed system achieved 99.68% balanced accuracy on a custom dataset of 3,484 images, offering a cost-effective smart parking solution that ensures precise vehicle detection while preserving data privacy

Title: ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Authors: Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, Josef Sivic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01987
Pdf URL: https://arxiv.org/pdf/2412.01987
Copy Paste: [[2412.01987]] ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions(https://arxiv.org/abs/2412.01987)
Keywords: diffusion
Abstract: The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with the provided input image. Third, we evaluate the generated image sequences across three dimensions of accuracy (step, scene, and task) and show our model achieves state-of-the-art results on all of them. Our code, dataset, and trained models are publicly available.

Title: Generalized EXTRA stochastic gradient Langevin dynamics

Authors: Mert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang, Lingjiong Zhu
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01993
Pdf URL: https://arxiv.org/pdf/2412.01993
Copy Paste: [[2412.01993]] Generalized EXTRA stochastic gradient Langevin dynamics(https://arxiv.org/abs/2412.01993)
Keywords: privacy
Abstract: Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches of large datasets and are scalable to large datasets. However, when data is decentralized across a network of agents subject to communication and privacy constraints, standard SGLD algorithms cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms, where Bayesian learning is performed collaboratively by a network of agents without sharing individual data. Nonetheless, existing DE-SGLD algorithms induce a bias at every agent that can negatively impact performance; this bias persists even when using full batches and is attributable to network effects. Motivated by the EXTRA algorithm and its generalizations for decentralized optimization, we propose the generalized EXTRA stochastic gradient Langevin dynamics, which eliminates this bias in the full-batch setting. Moreover, we show that, in the mini-batch setting, our algorithm provides performance bounds that significantly improve upon those of standard DE-SGLD algorithms in the literature. Our numerical results also demonstrate the efficiency of the proposed approach.

Title: Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson's Diagnosis

Authors: David Gimeno-Gómez, Catarina Botelho, Anna Pompili, Alberto Abad, Carlos-D. Martínez-Hinarejos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02006
Pdf URL: https://arxiv.org/pdf/2412.02006
Copy Paste: [[2412.02006]] Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson's Diagnosis(https://arxiv.org/abs/2412.02006)
Keywords: robust, interpretability
Abstract: Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson's Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework's capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.

Title: Explore Reinforced: Equilibrium Approximation with Reinforcement Learning

Authors: Ryan Yu, Mateusz Nowak, Qintong Xie, Michelle Yilin Feng, Peter Chin
Subjects: cs.LG, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2412.02016
Pdf URL: https://arxiv.org/pdf/2412.02016
Copy Paste: [[2412.02016]] Explore Reinforced: Equilibrium Approximation with Reinforcement Learning(https://arxiv.org/abs/2412.02016)
Keywords: security
Abstract: Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle with equilibrium approximation for games in large stochastic environments but are theoretically guaranteed to converge to a strong solution concept. In contrast, modern Reinforcement Learning (RL) algorithms provide faster training yet yield weaker solutions. We introduce Exp3-IXrl - a blend of RL and game-theoretic approach, separating the RL agent's action selection from the equilibrium computation while preserving the integrity of the learning process. We demonstrate that our algorithm expands the application of equilibrium approximation algorithms to new environments. Specifically, we show the improved performance in a complex and adversarial cybersecurity network environment - the Cyber Operations Research Gym - and in the classical multi-armed bandit settings.

Title: NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Authors: Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02030
Pdf URL: https://arxiv.org/pdf/2412.02030
Copy Paste: [[2412.02030]] NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training(https://arxiv.org/abs/2412.02030)
Keywords: diffusion
Abstract: We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. While one-step methods offer dramatic speed advantages, they typically suffer from quality degradation compared to their multi-step counterparts. Just as a panel of art critics provides comprehensive feedback by specializing in different aspects like composition, color, and technique, our approach maintains a large pool of specialized discriminator heads that collectively guide the generation process. Each discriminator group develops expertise in specific quality aspects at different noise levels, providing diverse feedback that enables high-fidelity one-step generation. Our framework combines: (i) a dynamic discriminator pool with specialized discriminator groups to improve generation quality, (ii) strategic refresh mechanisms to prevent discriminator overfitting, and (iii) global-local discriminator heads for multi-scale quality assessment, and unconditional/conditional training for balanced generation. Additionally, our framework uniquely supports flexible deployment through bottom-up refinement, allowing users to dynamically choose between 1-4 denoising steps with the same model for direct quality-speed trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion significantly outperforms existing single-step methods across multiple evaluation metrics, particularly excelling in preserving fine details and global consistency.

Title: Mutli-View 3D Reconstruction using Knowledge Distillation

Authors: Aditya Dutt, Ishikaa Lunawat, Manpreet Kaur
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02039
Pdf URL: https://arxiv.org/pdf/2412.02039
Copy Paste: [[2412.02039]] Mutli-View 3D Reconstruction using Knowledge Distillation(https://arxiv.org/abs/2412.02039)
Keywords: transformer
Abstract: Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r's and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.

Title: Predicting the Impact of Scope Changes on Project Cost and Schedule Using Machine Learning Techniques

Authors: Soheila Sadeghi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02041
Pdf URL: https://arxiv.org/pdf/2412.02041
Copy Paste: [[2412.02041]] Predicting the Impact of Scope Changes on Project Cost and Schedule Using Machine Learning Techniques(https://arxiv.org/abs/2412.02041)
Keywords: robust
Abstract: In the dynamic landscape of project management, scope changes are an inevitable reality that can significantly impact project performance. These changes, whether initiated by stakeholders, external factors, or internal project dynamics, can lead to cost overruns and schedule delays. Accurately predicting the consequences of these changes is crucial for effective project control and informed decision-making. This study aims to develop predictive models to estimate the impact of scope changes on project cost and schedule using machine learning techniques. The research utilizes a comprehensive dataset containing detailed information on project tasks, including the Work Breakdown Structure (WBS), task type, productivity rate, estimated cost, actual cost, duration, task dependencies, scope change magnitude, and scope change timing. Multiple machine learning models are developed and evaluated to predict the impact of scope changes on project cost and schedule. These models include Linear Regression, Decision Tree, Ridge Regression, Random Forest, Gradient Boosting, and XGBoost. The dataset is split into training and testing sets, and the models are trained using the preprocessed data. Model robustness and generalization are assessed using cross-validation techniques. To evaluate the performance of models, we use Mean Squared Error (MSE) and R2. Residual plots are generated to assess the goodness of fit and identify any patterns or outliers. Hyperparameter tuning is performed to optimize the XGBoost model and improve its predictive accuracy. The study identifies the most influential project attributes in determining the magnitude of cost and schedule deviations caused by scope modifications. It is identified that productivity rate, scope change magnitude, task dependencies, estimated cost, actual cost, duration, and specific WBS elements are powerful predictors.

Title: Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Authors: Gary A. McCully, John D. Hastings, Shengjie Xu
Subjects: cs.CR, cs.CL, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2412.02048
Pdf URL: https://arxiv.org/pdf/2412.02048
Copy Paste: [[2412.02048]] Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code(https://arxiv.org/abs/2412.02048)
Keywords: robust, transformer
Abstract: This study examines the impact of data snooping on neural networks for vulnerability detection in lifted code, building on previous research which used word2vec, and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained on datasets, including samples also used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings. The fact that this holds even when data snooping is introduced into the embedding model indicates GPT-2's robustness in representing complex code features, even under less-than-ideal conditions.

Title: Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support

Authors: Anubha Mahajan, Shreya Hegde, Ethan Shay, Daniel Wu, Aviva Prins
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.02057
Pdf URL: https://arxiv.org/pdf/2412.02057
Copy Paste: [[2412.02057]] Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support(https://arxiv.org/abs/2412.02057)
Keywords: fair
Abstract: In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real-time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi-agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q-Learning (IQL), where each farmer acts independently without coordination, Agent-by-Agent (ABA), which sequentially optimizes each farmer's policy in relation to the others, and the Multi-agent Rollout Policy, which jointly optimizes all farmers' actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi-agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer-centric agricultural decision-making systems.

Title: BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts

Authors: Raisa Tasnim, Mehanaz Chowdhury, Md Ataur Rahman
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2412.02058
Pdf URL: https://arxiv.org/pdf/2412.02058
Copy Paste: [[2412.02058]] BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts(https://arxiv.org/abs/2412.02058)
Keywords: security, privacy
Abstract: Author profiling, the analysis of texts to uncover attributes such as gender and age of the author, has become essential with the widespread use of social media platforms. This paper focuses on author profiling in the Bangla language, aiming to extract valuable insights about anonymous authors based on their writing style on social media. The primary objective is to introduce and benchmark the performance of machine learning approaches on a newly created Bangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131 social media posts from 300 authors, labeled by their age and gender. Authors' identities and sensitive information were anonymized to ensure privacy. Various classical machine learning and deep learning techniques were employed to evaluate the dataset. For gender classification, the best accuracy achieved was 80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB) classifier achieved the best F1 score of 0.756. For age classification, MNB attained a maximum accuracy score of 91% with an F1 score of 0.905. This research highlights the effectiveness of machine learning in gender and age classification for Bangla author profiling, with practical implications spanning marketing, security, forensic linguistics, education, and criminal investigations, considering privacy and biases.

Title: CLERF: Contrastive LEaRning for Full Range Head Pose Estimation

Authors: Ting-Ruen Wei, Haowei Liu, Huei-Chung Hu, Xuyang Wu, Yi Fang, Hsin-Tai Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02066
Pdf URL: https://arxiv.org/pdf/2412.02066
Copy Paste: [[2412.02066]] CLERF: Contrastive LEaRning for Full Range Head Pose Estimation(https://arxiv.org/abs/2412.02066)
Keywords: generative
Abstract: We introduce a novel framework for representation learning in head pose estimation (HPE). Previously such a scheme was difficult due to head pose data sparsity, making triplet sampling infeasible. Recent progress in 3D generative adversarial networks (3D-aware GAN) has opened the door for easily sampling triplets (anchor, positive, negative). We perform contrastive learning on extensively augmented data including geometric transformations and demonstrate that contrastive learning allows networks to learn genuine features that contribute to accurate HPE. On the other hand, we observe that existing HPE works struggle to predict head poses as accurately when test image rotation matrices are slightly out of the training dataset distribution. Experiments show that our methodology performs on par with state-of-the-art models on standard test datasets and outperforms them when images are slightly rotated/ flipped or full range head pose. To the best of our knowledge, we are the first to deliver a true full range HPE model capable of accurately predicting any head pose including upside-down pose. Furthermore, we compared with other existing full-yaw range models and demonstrated superior results.

Title: Performance Comparison of Deep Learning Techniques in Naira Classification

Authors: Ismail Ismail Tijjani, Ahmad Abubakar Mustapha, Isma'il Tijjani Idris
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02072
Pdf URL: https://arxiv.org/pdf/2412.02072
Copy Paste: [[2412.02072]] Performance Comparison of Deep Learning Techniques in Naira Classification(https://arxiv.org/abs/2412.02072)
Keywords: security
Abstract: The Naira is Nigeria's official currency in daily transactions. This study presents the deployment and evaluation of Deep Learning (DL) models to classify Currency Notes (Naira) by denomination. Using a diverse dataset of 1,808 images of Naira notes captured under different conditions, trained the models employing different architectures and got the highest accuracy with MobileNetV2, the model achieved a high accuracy rate of in training of 90.75% and validation accuracy of 87.04% in classification tasks and demonstrated substantial performance across various scenarios. This model holds significant potential for practical applications, including automated cash handling systems, sorting systems, and assistive technology for the visually impaired. The results demonstrate how the model could boost the Nigerian economy's security and efficiency of financial transactions.

Title: Topology-Preserving Image Segmentation with Spatial-Aware Persistent Feature Matching

Authors: Bo Wen, Haochen Zhang, Dirk-Uwe G. Bartsch, William R. Freeman, Truong Q. Nguyen, Cheolhong An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02076
Pdf URL: https://arxiv.org/pdf/2412.02076
Copy Paste: [[2412.02076]] Topology-Preserving Image Segmentation with Spatial-Aware Persistent Feature Matching(https://arxiv.org/abs/2412.02076)
Keywords: segmentation
Abstract: Topological correctness is critical for segmentation of tubular structures. Existing topological segmentation loss functions are primarily based on the persistent homology of the image. They match the persistent features from the segmentation with the persistent features from the ground truth and minimize the difference between them. However, these methods suffer from an ambiguous matching problem since the matching only relies on the information in the topological space. In this work, we propose an effective and efficient Spatial-Aware Topological Loss Function that further leverages the information in the original spatial domain of the image to assist the matching of persistent features. Extensive experiments on images of various types of tubular structures show that the proposed method has superior performance in improving the topological accuracy of the segmentation compared with state-of-the-art methods.

Title: Let's Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning

Authors: Shepard Xia, Brian Lu, Jason Eisner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.02081
Pdf URL: https://arxiv.org/pdf/2412.02081
Copy Paste: [[2412.02081]] Let's Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning(https://arxiv.org/abs/2412.02081)
Keywords: robust, large language model
Abstract: A hallmark of intelligence is the ability to flesh out underspecified situations using "common sense." We propose to extract that common sense from large language models (LLMs), in a form that can feed into probabilistic inference. We focus our investigation on $\textit{guesstimation}$ questions such as "How much are Airbnb listings in Newark, NJ?" Formulating a sensible answer without access to data requires drawing on, and integrating, bits of common knowledge about how $\texttt{Price}$ and $\texttt{Location}$ may relate to other variables, such as $\texttt{Property Type}$. Our framework answers such a question by synthesizing an $\textit{ad hoc}$ probabilistic model. First we prompt an LLM to propose a set of random variables relevant to the question, followed by moment constraints on their joint distribution. We then optimize the joint distribution $p$ within a log-linear family to maximize the overall constraint satisfaction. Our experiments show that LLMs can successfully be prompted to propose reasonable variables, and while the proposed numerical constraints can be noisy, jointly optimizing for their satisfaction reconciles them. When evaluated on probabilistic questions derived from three real-world tabular datasets, we find that our framework performs comparably to a direct prompting baseline in terms of total variation distance from the dataset distribution, and is similarly robust to noise.

Title: Comparative Analysis of Black-Box and White-Box Machine Learning Model in Phishing Detection

Authors: Abdullah Fajar, Setiadi Yazid, Indra Budi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02084
Pdf URL: https://arxiv.org/pdf/2412.02084
Copy Paste: [[2412.02084]] Comparative Analysis of Black-Box and White-Box Machine Learning Model in Phishing Detection(https://arxiv.org/abs/2412.02084)
Keywords: attack, interpretability, explainability
Abstract: Background: Explainability in phishing detection model can support a further solution of phishing attack mitigation by increasing trust and understanding how phishing can be detected. Objective: The aims of this study to determine and best recommendation to apply an approach which has several components with abilities to fulfil the critical needs Methods: A methodology starting with analyzing both black-box and white-box models to get the pros and cons specifically in phishing detection. The conclusion of the analysis will be validated by experiment using a set of well-known algorithms and public phishing datasets. Experimental metrics covers 3 measurements such as predictive accuracy and explainability metrics. Conclusion: Both models are comparable in terms of interpretability and consistency, with room for improvement in diverse datasets. EBM as an example of white-box model is generally better suited for applications requiring explainability and actionable insights. Finally, each model, white-box and black-box model has positive and negative aspects both for performance metric and for explainable metric. It is important to consider the objective of model usage.

Title: Offline Stochastic Optimization of Black-Box Objective Functions

Authors: Juncheng Dong, Zihao Wu, Hamid Jafarkhani, Ali Pezeshki, Vahid Tarokh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02089
Pdf URL: https://arxiv.org/pdf/2412.02089
Copy Paste: [[2412.02089]] Offline Stochastic Optimization of Black-Box Objective Functions(https://arxiv.org/abs/2412.02089)
Keywords: robust
Abstract: Many challenges in science and engineering, such as drug discovery and communication network design, involve optimizing complex and expensive black-box functions across vast search spaces. Thus, it is essential to leverage existing data to avoid costly active queries of these black-box functions. To this end, while Offline Black-Box Optimization (BBO) is effective for deterministic problems, it may fall short in capturing the stochasticity of real-world scenarios. To address this, we introduce Stochastic Offline BBO (SOBBO), which tackles both black-box objectives and uncontrolled uncertainties. We propose two solutions: for large-data regimes, a differentiable surrogate allows for gradient-based optimization, while for scarce-data regimes, we directly estimate gradients under conservative field constraints, improving robustness, convergence, and data efficiency. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world tasks.

Title: Crash Severity Risk Modeling Strategies under Data Imbalance

Authors: Abdullah Al Mamun (1), Abyad Enan (1), Debbie A. Indah (2), Judith Mwakalonge (3), Gurcan Comert (4), Mashrur Chowdhury (5) ((1) Graduate Student, Glenn Department of Civil Engineering, Clemson University, (2) Graduate Student, Department of Engineering, South Carolina State University, (3) Professor, Department of Engineering, South Carolina State University, (4) Associate Professor, Computational Data Science and Engineering Department, North Carolina A&T State University, (5) Professor, Glenn Department of Civil Engineering, Clemson University)
Subjects: cs.LG, cs.CY, stat.AP
Abstract URL: https://arxiv.org/abs/2412.02094
Pdf URL: https://arxiv.org/pdf/2412.02094
Copy Paste: [[2412.02094]] Crash Severity Risk Modeling Strategies under Data Imbalance(https://arxiv.org/abs/2412.02094)
Keywords: robust
Abstract: This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) when there are crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data, involving large vehicles in South Carolina work zones for the period between 2014 and 2018, which included 4 times more LS crashes compared to HS crashes. The objective of this study is to explore crash severity prediction performance of various models under different feature selection and data balancing techniques. The findings of this study highlight a disparity between LS and HS predictions, with less-accurate prediction of HS crashes compared to LS crashes due to class imbalance and feature overlaps between LS and HS crashes. Combining features from multiple feature selection techniques: statistical correlation, feature importance, recursive elimination, statistical tests, and mutual information, slightly improves HS crash prediction performance. Data balancing techniques such as NearMiss-1 and RandomUnderSampler, maximize HS recall when paired with certain prediction models, such as Bayesian Mixed Logit (BML), NeuralNet, and RandomForest, making them suitable for HS crash prediction. Conversely, RandomOverSampler, HS Class Weighting, and Kernel-based Synthetic Minority Oversampling (K-SMOTE), used with certain prediction models such as BML, CatBoost, and LightGBM, achieve a balanced performance, defined as achieving an equitable trade-off between LS and HS prediction performance metrics. These insights provide safety analysts with guidance to select models, feature selection techniques, and data balancing techniques that align with their specific safety objectives, offering a robust foundation for enhancing work-zone crash severity prediction.

Title: AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

Authors: Zhihang Lin, Mingbao Lin, Wengyi Zhan, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02099
Pdf URL: https://arxiv.org/pdf/2412.02099
Copy Paste: [[2412.02099]] AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation(https://arxiv.org/abs/2412.02099)
Keywords: diffusion
Abstract: Diffusion models suffer severe object repetition and local distortion when the inference resolution differs from its pre-trained resolution. We propose AccDiffusion v2, an accurate method for patch-wise higher-resolution diffusion extrapolation without training. Our in-depth analysis in this paper shows that using an identical text prompt for different patches leads to repetitive generation, while the absence of a prompt undermines image details. In response, our AccDiffusion v2 novelly decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of a patch. Further analysis reveals that local distortion arises from inaccurate descriptions in prompts about the local structure of higher-resolution images. To address this issue, AccDiffusion v2, for the first time, introduces an auxiliary local structural information through ControlNet during higher-resolution diffusion extrapolation aiming to mitigate the local distortions. Finally, our analysis indicates that global semantic information is conducive to suppressing both repetitive generation and local distortion. Hence, our AccDiffusion v2 further proposes dilated sampling with window interaction for better global semantic information during higher-resolution diffusion extrapolation. We conduct extensive experiments, including both quantitative and qualitative comparisons, to demonstrate the efficacy of our AccDiffusion v2. The quantitative comparison shows that AccDiffusion v2 achieves state-of-the-art performance in image generation extrapolation without training. The qualitative comparison intuitively illustrates that AccDiffusion v2 effectively suppresses the issues of repetitive generation and local distortion in image generation extrapolation. Our code is available at \url{this https URL}.

Title: Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Authors: Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.02104
Pdf URL: https://arxiv.org/pdf/2412.02104
Copy Paste: [[2412.02104]] Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey(https://arxiv.org/abs/2412.02104)
Keywords: robust, interpretability, explainability, large language model
Abstract: The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.

Title: Retrofitting XoM for Stripped Binaries without Embedded Data Relocation

Authors: Chenke Luo, Jiang Ming, Mengfei Xie, Guojun Peng, Jianming Fu
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2412.02110
Pdf URL: https://arxiv.org/pdf/2412.02110
Copy Paste: [[2412.02110]] Retrofitting XoM for Stripped Binaries without Embedded Data Relocation(https://arxiv.org/abs/2412.02110)
Keywords: security, protect
Abstract: In this paper, we present PXoM, a practical technique to seamlessly retrofit XoM into stripped binaries on the x86-64 platform. As handling the mixture of code and data is a well-known challenge for XoM, most existing methods require the strict separation of code and data areas via either compile-time transformation or binary patching, so that the unreadable permission can be safely enforced at the granularity of memory pages. In contrast to previous approaches, we provide a fine-grained memory permission control mechanism to restrict the read permission of code while allowing legitimate data reads within code pages. This novelty enables PXoM to harden stripped binaries but without resorting to error-prone embedded data relocation. We leverage Intel's hardware feature, Memory Protection Keys, to offer an efficient fine-grained permission control. We measure PXoM's performance with both micro- and macro-benchmarks, and it only introduces negligible runtime overhead. Our security evaluation shows that PXoM leaves adversaries with little wiggle room to harvest all of the required gadgets, suggesting PXoM is practical for real-world deployment.

Title: OmniCreator: Self-Supervised Unified Generation with Universal Editing

Authors: Haodong Chen, Lan Wang, Harry Yang, Ser-Nam Lim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02114
Pdf URL: https://arxiv.org/pdf/2412.02114
Copy Paste: [[2412.02114]] OmniCreator: Self-Supervised Unified Generation with Universal Editing(https://arxiv.org/abs/2412.02114)
Keywords: generative
Abstract: We introduce OmniCreator, a novel framework that can conduct text-prompted unified (image+video) generation as well as editing all in one place. OmniCreator acquires generative and universal editing capabilities in a self-supervised manner, taking original text-video pairs as conditions while utilizing the same video as a denoising target to learn the semantic correspondence between video and text. During inference, when presented with a text prompt and a video, OmniCreator is capable of generating a target that is faithful to both, achieving a universal editing effect that is unconstrained as opposed to existing editing work that primarily focuses on certain editing types or relies on additional controls (e.g., structural conditions, attention features, or DDIM inversion). On the other hand, when presented with a text prompt only, OmniCreator becomes generative, producing high-quality video as a result of the semantic correspondence learned. Importantly, we found that the same capabilities extend to images as is, making OmniCreator a truly unified framework. Further, due to the lack of existing generative video editing benchmarks, we introduce the OmniBench-99 dataset, designed to evaluate the performance of generative video editing models comprehensively. Extensive experiments demonstrate that OmniCreator exhibits substantial superiority over all other models.

Title: Streamlining Video Analysis for Efficient Violence Detection

Authors: Gourang Pathak, Abhay Kumar, Sannidhya Rawat, Shikha Gupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02127
Pdf URL: https://arxiv.org/pdf/2412.02127
Copy Paste: [[2412.02127]] Streamlining Video Analysis for Efficient Violence Detection(https://arxiv.org/abs/2412.02127)
Keywords: security, extraction
Abstract: This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as "fight" or "non-fight." This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.

Title: GSOT3D: Towards Generic 3D Single Object Tracking in the Wild

Authors: Yifan Jiao, Yunhao Li, Junhua Ding, Qing Yang, Song Fu, Heng Fan, Libo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02129
Pdf URL: https://arxiv.org/pdf/2412.02129
Copy Paste: [[2412.02129]] GSOT3D: Towards Generic 3D Single Object Tracking in the Wild(https://arxiv.org/abs/2412.02129)
Keywords: robust
Abstract: In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation results will be publicly released at our webpage this https URL.

Title: WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Authors: Yuci Liang, Xinheng Lyu, Meidan Ding, Wenting Chen, Jipeng Zhang, Yuexiang Ren, Xiangjian He, Song Wu, Sen Yang, Xiyue Wang, Xiaohan Xing, Linlin Shen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.02141
Pdf URL: https://arxiv.org/pdf/2412.02141
Copy Paste: [[2412.02141]] WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image(https://arxiv.org/abs/2412.02141)
Keywords: large language model
Abstract: Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

Title: Personalized Multimodal Large Language Models: A Survey

Authors: Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Namyong Park, Sungchul Kim, Huanrui Yang, Subrata Mitra, Zhengmian Hu, Nedim Lipka, Dang Nguyen, Yue Zhao, Jiebo Luo, Julian McAuley
Subjects: cs.CV, cs.AI, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.02142
Pdf URL: https://arxiv.org/pdf/2412.02142
Copy Paste: [[2412.02142]] Personalized Multimodal Large Language Models: A Survey(https://arxiv.org/abs/2412.02142)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.

Title: Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms

Authors: Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.02149
Pdf URL: https://arxiv.org/pdf/2412.02149
Copy Paste: [[2412.02149]] Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms(https://arxiv.org/abs/2412.02149)
Keywords: large language model
Abstract: In this paper, we introduce ChatCite, a novel method leveraging large language models (LLMs) for generating comparative literature summaries. The ability to summarize research papers with a focus on key comparisons between studies is an essential task in academic research. Existing summarization models, while effective at generating concise summaries, fail to provide deep comparative insights. ChatCite addresses this limitation by incorporating a multi-step reasoning mechanism that extracts critical elements from papers, incrementally builds a comparative summary, and refines the output through a reflective memory process. We evaluate ChatCite on a custom dataset, CompLit-LongContext, consisting of 1000 research papers with annotated comparative summaries. Experimental results show that ChatCite outperforms several baseline methods, including GPT-4, BART, T5, and CoT, across various automatic evaluation metrics such as ROUGE and the newly proposed G-Score. Human evaluation further confirms that ChatCite generates more coherent, insightful, and fluent summaries compared to these baseline models. Our method provides a significant advancement in automatic literature review generation, offering researchers a powerful tool for efficiently comparing and synthesizing scientific research.

Title: Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Authors: Abulikemu Abuduweili, Changliu Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02153
Pdf URL: https://arxiv.org/pdf/2412.02153
Copy Paste: [[2412.02153]] Revisiting the Initial Steps in Adaptive Gradient Descent Optimization(https://arxiv.org/abs/2412.02153)
Keywords: transformer
Abstract: Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ($v_0 =0$) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods, highlighting the practical impact of this straightforward modification.

Title: CausalMob: Causal Human Mobility Prediction with LLMs-derived Human Intentions toward Public Events

Authors: Xiaojie Yang, Hangli Ge, Jiawei Wang, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, Noboru Koshizuka
Subjects: cs.LG, cs.AI, cs.IR, cs.SI
Abstract URL: https://arxiv.org/abs/2412.02155
Pdf URL: https://arxiv.org/pdf/2412.02155
Copy Paste: [[2412.02155]] CausalMob: Causal Human Mobility Prediction with LLMs-derived Human Intentions toward Public Events(https://arxiv.org/abs/2412.02155)
Keywords: large language model
Abstract: Large-scale human mobility exhibits spatial and temporal patterns that can assist policymakers in decision making. Although traditional prediction models attempt to capture these patterns, they often interfered by non-periodic public events, such as disasters and occasional celebrations. Since regular human mobility patterns are heavily affected by these events, estimating their causal effects is critical to accurate mobility predictions. Although news articles provide unique perspectives on these events in an unstructured format, processing is a challenge. In this study, we propose a causality-augmented prediction model, called \textbf{CausalMob}, to analyze the causal effects of public events. We first utilize large language models (LLMs) to extract human intentions from news articles and transform them into features that act as causal treatments. Next, the model learns representations of spatio-temporal regional covariates from multiple data sources to serve as confounders for causal inference. Finally, we present a causal effect estimation framework to ensure event features remain independent of confounders during prediction. Based on large-scale real-world data, the experimental results show that the proposed model excels in human mobility prediction, outperforming state-of-the-art models.

Title: Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Authors: Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
Subjects: cs.LG, cs.AI, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02159
Pdf URL: https://arxiv.org/pdf/2412.02159
Copy Paste: [[2412.02159]] Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach(https://arxiv.org/abs/2412.02159)
Keywords: defense, large language model
Abstract: Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

Title: Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Authors: Yu Yuan, Xijun Wang, Yichen Sheng, Prateek Chennuri, Xingguang Zhang, Stanley Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02168
Pdf URL: https://arxiv.org/pdf/2412.02168
Copy Paste: [[2412.02168]] Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis(https://arxiv.org/abs/2412.02168)
Keywords: diffusion, generative
Abstract: Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.

Title: Underload: Defending against Latency Attacks for Object Detectors on Edge Devices

Authors: Tianyi Wang, Zichen Wang, Cong Wang, Yuanchao Shu, Ruilong Deng, Peng Cheng, Jiming Chen (Zhejiang University, Hangzhou, China)
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02171
Pdf URL: https://arxiv.org/pdf/2412.02171
Copy Paste: [[2412.02171]] Underload: Defending against Latency Attacks for Object Detectors on Edge Devices(https://arxiv.org/abs/2412.02171)
Keywords: defense, attack, robust
Abstract: Object detection is a fundamental enabler for many real-time downstream applications such as autonomous driving, augmented reality and supply chain management. However, the algorithmic backbone of neural networks is brittle to imperceptible perturbations in the system inputs, which were generally known as misclassifying attacks. By targeting the real-time processing capability, a new class of latency attacks are reported recently. They exploit new attack surfaces in object detectors by creating a computational bottleneck in the post-processing module, that leads to cascading failure and puts the real-time downstream tasks at risks. In this work, we take an initial attempt to defend against this attack via background-attentive adversarial training that is also cognizant of the underlying hardware capabilities. We first draw system-level connections between latency attack and hardware capacity across heterogeneous GPU devices. Based on the particular adversarial behaviors, we utilize objectness loss as a proxy and build background attention into the adversarial training pipeline, and achieve a reasonable balance between clean and robust accuracy. The extensive experiments demonstrate the defense effectiveness of restoring real-time processing capability from $13$ FPS to $43$ FPS on Jetson Orin NX, with a better trade-off between the clean and robust accuracy.

Title: Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

Authors: Weiche Hsieh, Ziqian Bi, Keyu Chen, Benji Peng, Sen Zhang, Jiawei Xu, Jinlang Wang, Caitlyn Heqi Yin, Yichao Zhang, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Chia Xin Liang, Jintao Ren, Qian Niu, Silin Chen, Lawrence K.Q. Yan, Han Xu, Hong-Ming Tseng, Xinyuan Song, Bowen Jing, Junjie Yang, Junhao Song, Junyu Liu, Ming Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02187
Pdf URL: https://arxiv.org/pdf/2412.02187
Copy Paste: [[2412.02187]] Deep Learning, Machine Learning, Advancing Big Data Analytics and Management(https://arxiv.org/abs/2412.02187)
Keywords: privacy
Abstract: Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.

Title: LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Authors: Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02193
Pdf URL: https://arxiv.org/pdf/2412.02193
Copy Paste: [[2412.02193]] LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models(https://arxiv.org/abs/2412.02193)
Keywords: large language model
Abstract: Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.

Title: Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Authors: Xiangyong Lu, Masanori Suganuma, Takayuki Okatani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02197
Pdf URL: https://arxiv.org/pdf/2412.02197
Copy Paste: [[2412.02197]] Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images(https://arxiv.org/abs/2412.02197)
Keywords: extraction
Abstract: In real-world applications of image recognition tasks, such as human pose estimation, cameras often capture objects, like human bodies, at low resolutions. This scenario poses a challenge in extracting and leveraging multi-scale features, which is often essential for precise inference. To address this challenge, we propose a new attention mechanism, named cascaded multi-scale attention (CMSA), tailored for use in CNN-ViT hybrid architectures, to handle low-resolution inputs effectively. The design of CMSA enables the extraction and seamless integration of features across various scales without necessitating the downsampling of the input image or feature maps. This is achieved through a novel combination of grouped multi-head self-attention mechanisms with window-based local attention and cascaded fusion of multi-scale features over different scales. This architecture allows for the effective handling of features across different scales, enhancing the model's ability to perform tasks such as human pose estimation, head pose estimation, and more with low-resolution images. Our experimental results show that the proposed method outperforms existing state-of-the-art methods in these areas with fewer parameters, showcasing its potential for broad application in real-world scenarios where capturing high-resolution images is not feasible. Code is available at this https URL.

Title: Transformer-Metric Loss for CNN-Based Face Recognition

Authors: Pritesh Prakash, Ashish Jacob Sam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02198
Pdf URL: https://arxiv.org/pdf/2412.02198
Copy Paste: [[2412.02198]] Transformer-Metric Loss for CNN-Based Face Recognition(https://arxiv.org/abs/2412.02198)
Keywords: transformer
Abstract: In deep learning, the loss function plays a crucial role in optimizing the network. Many recent innovations in loss techniques have been made, and various margin-based angular loss functions (metric loss) have been designed particularly for face recognition. The concept of transformers is already well-researched and applied in many facets of machine vision. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results on various validation datasets with some limitations. This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.

Title: 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Authors: Jinzhi Zhang, Feng Xiong, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02202
Pdf URL: https://arxiv.org/pdf/2412.02202
Copy Paste: [[2412.02202]] 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation(https://arxiv.org/abs/2412.02202)
Keywords: transformer, large language model
Abstract: Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.

Title: CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Authors: Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Yuliang Liu, LianWen Jin, Xiang Bai, Shuai Bai, Junyang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02210
Pdf URL: https://arxiv.org/pdf/2412.02210
Copy Paste: [[2412.02210]] CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy(https://arxiv.org/abs/2412.02210)
Keywords: extraction
Abstract: Large Multimodal Models (LMMs) have demonstrated impressive performance on recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possess a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. Furthermore, we evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, driving advancement in LMMs.

Title: An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction

Authors: Yaxin Liang, Xinshi Li, Xin Huang, Ziqi Zhang, Yue Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02211
Pdf URL: https://arxiv.org/pdf/2412.02211
Copy Paste: [[2412.02211]] An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction(https://arxiv.org/abs/2412.02211)
Keywords: extraction, generative
Abstract: This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.

Title: Recovering implicit physics model under real-world constraints

Authors: Ayan Banerjee, Sandeep K.S. Gupta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02215
Pdf URL: https://arxiv.org/pdf/2412.02215
Copy Paste: [[2412.02215]] Recovering implicit physics model under real-world constraints(https://arxiv.org/abs/2412.02215)
Keywords: extraction
Abstract: Recovering a physics-driven model, i.e. a governing set of equations of the underlying dynamical systems, from the real-world data has been of recent interest. Most existing methods either operate on simulation data with unrealistically high sampling rates or require explicit measurements of all system variables, which is not amenable in real-world deployments. Moreover, they assume the timestamps of external perturbations to the physical system are known a priori, without uncertainty, implicitly discounting any sensor time-synchronization or human reporting errors. In this paper, we propose a novel liquid time constant neural network (LTC-NN) based architecture to recover underlying model of physical dynamics from real-world data. The automatic differentiation property of LTC-NN nodes overcomes problems associated with low sampling rates, the input dependent time constant in the forward pass of the hidden layer of LTC-NN nodes creates a massive search space of implicit physical dynamics, the physics model solver based data reconstruction loss guides the search for the correct set of implicit dynamics, and the use of the dropout regularization in the dense layer ensures extraction of the sparsest model. Further, to account for the perturbation timing error, we utilize dense layer nodes to search through input shifts that results in the lowest reconstruction loss. Experiments on four benchmark dynamical systems, three with simulation data and one with the real-world data show that the LTC-NN architecture is more accurate in recovering implicit physics model coefficients than the state-of-the-art sparse model recovery approaches. We also introduce four additional case studies (total eight) on real-life medical examples in simulation and with real-world clinical data to show effectiveness of our approach in recovering underlying model in practice.

Title: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Authors: Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02220
Pdf URL: https://arxiv.org/pdf/2412.02220
Copy Paste: [[2412.02220]] Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs(https://arxiv.org/abs/2412.02220)
Keywords: large language model
Abstract: Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.

Title: How to Use Diffusion Priors under Sparse Views?

Authors: Qisen Wang, Yifan Zhao, Jiawei Ma, Jia Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02225
Pdf URL: https://arxiv.org/pdf/2412.02225
Copy Paste: [[2412.02225]] How to Use Diffusion Priors under Sparse Views?(https://arxiv.org/abs/2412.02225)
Keywords: diffusion
Abstract: Novel view synthesis under sparse views has been a long-term important challenge in 3D reconstruction. Existing works mainly rely on introducing external semantic or depth priors to supervise the optimization of 3D representations. However, the diffusion model, as an external prior that can directly provide visual supervision, has always underperformed in sparse-view 3D reconstruction using Score Distillation Sampling (SDS) due to the low information entropy of sparse views compared to text, leading to optimization challenges caused by mode deviation. To this end, we present a thorough analysis of SDS from the mode-seeking perspective and propose Inline Prior Guided Score Matching (IPSM), which leverages visual inline priors provided by pose relationships between viewpoints to rectify the rendered image distribution and decomposes the original optimization objective of SDS, thereby offering effective diffusion visual guidance without any fine-tuning or pre-training. Furthermore, we propose the IPSM-Gaussian pipeline, which adopts 3D Gaussian Splatting as the backbone and supplements depth and geometry consistency regularization based on IPSM to further improve inline priors and rectified distribution. Experimental results on different public datasets show that our method achieves state-of-the-art reconstruction quality. The code is released at this https URL.

Title: Learning from Concealed Labels

Authors: Zhongnian Li, Meng Wei, Peng Ying, Tongfeng Sun, Xinzheng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02230
Pdf URL: https://arxiv.org/pdf/2412.02230
Copy Paste: [[2412.02230]] Learning from Concealed Labels(https://arxiv.org/abs/2412.02230)
Keywords: privacy, protect
Abstract: Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.

Title: Blockchain-Enabled Device-Enhanced Multi-Access Edge Computing in Open Adversarial Environments

Authors: Muhammad Islam, Niroshinie Fernando, Seng W. Loke, Azadeh Ghari Neiat, Pubudu N. Pathirana
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.02233
Pdf URL: https://arxiv.org/pdf/2412.02233
Copy Paste: [[2412.02233]] Blockchain-Enabled Device-Enhanced Multi-Access Edge Computing in Open Adversarial Environments(https://arxiv.org/abs/2412.02233)
Keywords: secure, security
Abstract: We propose Blockchain-enabled Device-enhanced Multi-access Edge Computing (BdMEC). BdMEC extends the Honeybee framework for on-demand resource pooling with blockchain technology to ensure trust, security, and accountability among devices (even when they are owned by different parties). BdMEC mitigates risks from malicious devices by making computations traceable. Our prototype and results demonstrate BdMEC's ability to manage distributed computing tasks efficiently and securely across multiple devices.

Title: CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution

Authors: Jikai Wang, Huan Zheng, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02234
Pdf URL: https://arxiv.org/pdf/2412.02234
Copy Paste: [[2412.02234]] CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution(https://arxiv.org/abs/2412.02234)
Keywords: extraction, transformer
Abstract: Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network. However, current lightweight SR methods still suffer from inferior performance and unpleasant details. Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery. To respond this issue, we propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation. To be specific, we introduce cube attention, which expands 2D attention to 3D space, facilitating exhaustive information interactions, further encouraging comprehensive information extraction and promoting feature variety. In addition, we inject block and grid sampling strategies to construct intra-cube transformer blocks (Intra-CTB) and inter-cube transformer blocks (Inter-CTB), which perform local and global modeling, respectively. Extensive experiments show that our CubeFormer achieves state-of-the-art performance on commonly used SR benchmarks. Our source code and models will be publicly available.

Title: Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

Authors: Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, Wonjong Rhee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02237
Pdf URL: https://arxiv.org/pdf/2412.02237
Copy Paste: [[2412.02237]] Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models(https://arxiv.org/abs/2412.02237)
Keywords: diffusion, generative
Abstract: Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we present a method for constructing Head Relevance Vectors (HRVs) that align with useful visual concepts. An HRV for a given visual concept is a vector with a length equal to the total number of cross-attention heads, where each element represents the importance of the corresponding head for the given visual concept. We develop and employ an ordered weakening analysis to demonstrate the effectiveness of HRVs as interpretable features. To demonstrate the utility of HRVs, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. We show that misinterpretations of polysemous words in image generation can be corrected in most cases, five challenging attributes in image editing can be successfully modified, and catastrophic neglect in multi-concept generation can be mitigated. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.

Title: Fast LiDAR Data Generation with Rectified Flows

Authors: Kazuto Nakashima, Xiaowen Liu, Tomoya Miyawaki, Yumi Iwashita, Ryo Kurazume
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.02241
Pdf URL: https://arxiv.org/pdf/2412.02241
Copy Paste: [[2412.02241]] Fast LiDAR Data Generation with Rectified Flows(https://arxiv.org/abs/2412.02241)
Keywords: diffusion, transformer, generative
Abstract: Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite the success of diffusion models, generating high-quality samples requires numerous iterations of running neural networks, and the increasing computational cost can pose a barrier to robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with much fewer sampling steps against diffusion models. We also propose a efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on the unconditional generation of the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.

Title: Vision Transformers for Weakly-Supervised Microorganism Enumeration

Authors: Javier Ureña Santiago, Thomas Ströhle, Antonio Rodríguez-Sánchez, Ruth Breu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02250
Pdf URL: https://arxiv.org/pdf/2412.02250
Copy Paste: [[2412.02250]] Vision Transformers for Weakly-Supervised Microorganism Enumeration(https://arxiv.org/abs/2412.02250)
Keywords: extraction, transformer, segmentation
Abstract: Microorganism enumeration is an essential task in many applications, such as assessing contamination levels or ensuring health standards when evaluating surface cleanliness. However, it's traditionally performed by human-supervised methods that often require manual counting, making it tedious and time-consuming. Previous research suggests automating this task using computer vision and machine learning methods, primarily through instance segmentation or density estimation techniques. This study conducts a comparative analysis of vision transformers (ViTs) for weakly-supervised counting in microorganism enumeration, contrasting them with traditional architectures such as ResNet and investigating ViT-based models such as TransCrowd. We trained different versions of ViTs as the architectural backbone for feature extraction using four microbiology datasets to determine potential new approaches for total microorganism enumeration in images. Results indicate that while ResNets perform better overall, ViTs performance demonstrates competent results across all datasets, opening up promising lines of research in microorganism enumeration. This comparative study contributes to the field of microbial image analysis by presenting innovative approaches to the recurring challenge of microorganism enumeration and by highlighting the capabilities of ViTs in the task of regression counting.

Title: Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Authors: Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.02252
Pdf URL: https://arxiv.org/pdf/2412.02252
Copy Paste: [[2412.02252]] Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity(https://arxiv.org/abs/2412.02252)
Keywords: large language model
Abstract: The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding this http URL address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves $35\%$ KV cache without compromising the performance.

Title: ProbPose: A Probabilistic Approach to 2D Human Pose Estimation

Authors: Miroslav Purkrabek, Jiri Matas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02254
Pdf URL: https://arxiv.org/pdf/2412.02254
Copy Paste: [[2412.02254]] ProbPose: A Probabilistic Approach to 2D Human Pose Estimation(https://arxiv.org/abs/2412.02254)
Keywords: robust
Abstract: Current Human Pose Estimation methods have achieved significant improvements. However, state-of-the-art models ignore out-of-image keypoints and use uncalibrated heatmaps as keypoint location representations. To address these limitations, we propose ProbPose, which predicts for each keypoint: a calibrated probability of keypoint presence at each location in the activation window, the probability of being outside of it, and its predicted visibility. To address the lack of evaluation protocols for out-of-image keypoints, we introduce the CropCOCO dataset and the Extended OKS (Ex-OKS) metric, which extends OKS to out-of-image points. Tested on COCO, CropCOCO, and OCHuman, ProbPose shows significant gains in out-of-image keypoint localization while also improving in-image localization through data augmentation. Additionally, the model improves robustness along the edges of the bounding box and offers better flexibility in keypoint evaluation. The code and models are available on this https URL for research purposes.

Title: Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Authors: Jingyu Gong, Chong Zhang, Fengqi Liu, Ke Fan, Qianyu Zhou, Xin Tan, Zhizhong Zhang, Yuan Xie, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02261
Pdf URL: https://arxiv.org/pdf/2412.02261
Copy Paste: [[2412.02261]] Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis(https://arxiv.org/abs/2412.02261)
Keywords: diffusion
Abstract: Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. this https URL

Title: GSGTrack: Gaussian Splatting-Guided Object Pose Tracking from RGB Videos

Authors: Zhiyuan Chen, Fan Lu, Guo Yu, Bin Li, Sanqing Qu, Yuan Huang, Changhong Fu, Guang Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.02267
Pdf URL: https://arxiv.org/pdf/2412.02267
Copy Paste: [[2412.02267]] GSGTrack: Gaussian Splatting-Guided Object Pose Tracking from RGB Videos(https://arxiv.org/abs/2412.02267)
Keywords: robust
Abstract: Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object's appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.

Title: Sustainable Self-evolution Adversarial Training

Authors: Wenxuan Wang, Chenglei Wang, Huihui Qi, Menghao Ye, Xuelin Qian, Peng Wang, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02270
Pdf URL: https://arxiv.org/pdf/2412.02270
Copy Paste: [[2412.02270]] Sustainable Self-evolution Adversarial Training(https://arxiv.org/abs/2412.02270)
Keywords: security, defense, attack
Abstract: With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.

Title: MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines

Authors: Preetika Verma, Kokil Jaidka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.02271
Pdf URL: https://arxiv.org/pdf/2412.02271
Copy Paste: [[2412.02271]] MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines(https://arxiv.org/abs/2412.02271)
Keywords: large language model
Abstract: In this paper, we introduce the MediaSpin dataset aiming to help in the development of models that can detect different forms of media bias present in news headlines, developed through human-supervised and -validated Large Language Model (LLM) labeling of media bias. This corpus comprises 78,910 pairs of news headlines and annotations with explanations of the 13 distinct types of media bias categories assigned. We demonstrate the usefulness of our dataset for automated bias detection in news edits.

Title: PCIM: Learning Pixel Attributions via Pixel-wise Channel Isolation Mixing in High Content Imaging

Authors: Daniel Siegismund, Mario Wieser, Stephan Heyse, Stephan Steigele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02275
Pdf URL: https://arxiv.org/pdf/2412.02275
Copy Paste: [[2412.02275]] PCIM: Learning Pixel Attributions via Pixel-wise Channel Isolation Mixing in High Content Imaging(https://arxiv.org/abs/2412.02275)
Keywords: interpretability
Abstract: Deep Neural Networks (DNNs) have shown remarkable success in various computer vision tasks. However, their black-box nature often leads to difficulty in interpreting their decisions, creating an unfilled need for methods to explain the decisions, and ultimately forming a barrier to their wide acceptance especially in biomedical applications. This work introduces a novel method, Pixel-wise Channel Isolation Mixing (PCIM), to calculate pixel attribution maps, highlighting the image parts most crucial for a classification decision but without the need to extract internal network states or gradients. Unlike existing methods, PCIM treats each pixel as a distinct input channel and trains a blending layer to mix these pixels, reflecting specific classifications. This unique approach allows the generation of pixel attribution maps for each image, but agnostic to the choice of the underlying classification network. Benchmark testing on three application relevant, diverse high content Imaging datasets show state-of-the-art performance, particularly for model fidelity and localization ability in both, fluorescence and bright field High Content Imaging. PCIM contributes as a unique and effective method for creating pixel-level attribution maps from arbitrary DNNs, enabling interpretability and trust.

Title: A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis

Authors: Changzhi Zhou, Dandan Song, Yuhang Tian, Zhijing Wu, Hao Wang, Xinyu Zhang, Jun Yang, Ziyi Yang, Shuhao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02279
Pdf URL: https://arxiv.org/pdf/2412.02279
Copy Paste: [[2412.02279]] A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2412.02279)
Keywords: large language model
Abstract: Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.'' For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks.

Title: GQWformer: A Quantum-based Transformer for Graph Representation Learning

Authors: Lei Yu, Hongyang Chen, Jingsong Lv, Linyao Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02285
Pdf URL: https://arxiv.org/pdf/2412.02285
Copy Paste: [[2412.02285]] GQWformer: A Quantum-based Transformer for Graph Representation Learning(https://arxiv.org/abs/2412.02285)
Keywords: transformer
Abstract: Graph Transformers (GTs) have demonstrated significant advantages in graph representation learning through their global attention mechanisms. However, the self-attention mechanism in GTs tends to neglect the inductive biases inherent in graph structures, making it chanllenging to effectively capture essential structural information. To address this issue, we propose a novel approach that integrate graph inductive bias into self-attention mechanisms by leveraging quantum technology for structural encoding. In this paper, we introduce the Graph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that utilizes quantum walks on attributed graphs to generate node quantum states. These quantum states encapsulate rich structural attributes and serve as inductive biases for the transformer, thereby enabling the generation of more meaningful attention scores. By subsequently incorporating a recurrent neural network, our design amplifies the model's ability to focus on both local and global information. We conducted comprehensive experiments across five publicly available datasets to evaluate the effectiveness of our model. These results clearly indicate that GQWformer outperforms existing state-of-the-art graph classification algorithms. These findings highlight the significant potential of integrating quantum computing methodologies with traditional GNNs to advance the field of graph representation learning, providing a promising direction for future research and applications.

Title: Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance

Authors: Qing Zhang, Zehao Chen, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02287
Pdf URL: https://arxiv.org/pdf/2412.02287
Copy Paste: [[2412.02287]] Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance(https://arxiv.org/abs/2412.02287)
Keywords: diffusion
Abstract: Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.

Title: Learn More by Using Less: Distributed Learning with Energy-Constrained Devices

Authors: Roberto Pereira, Cristian J. Vaca-Rubio, Luis Blanco
Subjects: cs.LG, cs.DC, eess.SP
Abstract URL: https://arxiv.org/abs/2412.02289
Pdf URL: https://arxiv.org/pdf/2412.02289
Copy Paste: [[2412.02289]] Learn More by Using Less: Distributed Learning with Energy-Constrained Devices(https://arxiv.org/abs/2412.02289)
Keywords: privacy, robust, federate
Abstract: Federated Learning (FL) has emerged as a solution for distributed model training across decentralized, privacy-preserving devices, but the different energy capacities of participating devices (system heterogeneity) constrain real-world implementations. These energy limitations not only reduce model accuracy but also increase dropout rates, impacting on convergence in practical FL deployments. In this work, we propose LeanFed, an energy-aware FL framework designed to optimize client selection and training workloads on battery-constrained devices. LeanFed leverages adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, thereby maximizing device participation across communication rounds while ensuring they do not run out of battery during the process. We rigorously evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets, simulating various levels of data heterogeneity and device participation rates. Results show that LeanFed consistently enhances model accuracy and stability, particularly in settings with high data heterogeneity and limited battery life, by mitigating client dropout and extending device availability. This approach demonstrates the potential of energy-efficient, privacy-preserving FL in real-world, large-scale applications, setting a foundation for robust and sustainable pervasive AI on resource-constrained networks.

Title: Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based Model Integrating Temporal and Covariate Interactions

Authors: Guang Wu, Yun Wang, Qian Zhou, Ziyang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02302
Pdf URL: https://arxiv.org/pdf/2412.02302
Copy Paste: [[2412.02302]] Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based Model Integrating Temporal and Covariate Interactions(https://arxiv.org/abs/2412.02302)
Keywords: extraction, transformer
Abstract: Accurate photovoltaic (PV) power forecasting is critical for integrating renewable energy sources into the grid, optimizing real-time energy management, and ensuring energy reliability amidst increasing demand. However, existing models often struggle with effectively capturing the complex relationships between target variables and covariates, as well as the interactions between temporal dynamics and multivariate data, leading to suboptimal forecasting accuracy. To address these challenges, we propose a novel model architecture that leverages the iTransformer for feature extraction from target variables and employs long short-term memory (LSTM) to extract features from covariates. A cross-attention mechanism is integrated to fuse the outputs of both models, followed by a Kolmogorov-Arnold network (KAN) mapping for enhanced representation. The effectiveness of the proposed model is validated using publicly available datasets from Australia, with experiments conducted across four seasons. Results demonstrate that the proposed model effectively capture seasonal variations in PV power generation and improve forecasting accuracy.

Title: Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods

Authors: Jiamian Hu, Yuanyuan Hong, Yihua Chen, He Wang, Moriaki Yasuhara
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02313
Pdf URL: https://arxiv.org/pdf/2412.02313
Copy Paste: [[2412.02313]] Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods(https://arxiv.org/abs/2412.02313)
Keywords: robust
Abstract: We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists' annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation. The dataset has pseudo-classes, where annotators misclassified samples that should belong to an existing class into a new pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance factor $\rho$ = 22429. This presents a unique challenge for robust machine learning methods, as existing approaches have not been extensively evaluated on fine-grained classification tasks with such diverse real-world noise. Initial experiments using current robust learning techniques have not yielded significant performance improvements on the Noisy Ostracods dataset compared to cross-entropy training on the raw, noisy data. On the other hand, noise detection methods have underperformed in error hit rate compared to naive cross-validation ensembling for identifying problematic labels. These findings suggest that the fine-grained, imbalanced nature, and complex noise characteristics of the dataset present considerable challenges for existing noise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our goal is to encourage further research into the development of noise-resilient machine learning methods capable of effectively handling diverse, real-world noise in fine-grained classification tasks. The dataset, along with its evaluation protocols, can be accessed at this https URL.

Title: LoCo: Low-Contrast-Enhanced Contrastive Learning for Semi-Supervised Endoscopic Image Segmentation

Authors: Lingcong Cai, Yun Li, Xiaomao Fan, Kaixuan Song, Yongcheng Li, Yixuan Yuan, Ruxin Wang, Wenbin Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02314
Pdf URL: https://arxiv.org/pdf/2412.02314
Copy Paste: [[2412.02314]] LoCo: Low-Contrast-Enhanced Contrastive Learning for Semi-Supervised Endoscopic Image Segmentation(https://arxiv.org/abs/2412.02314)
Keywords: robust, segmentation
Abstract: The segmentation of endoscopic images plays a vital role in computer-aided diagnosis and treatment. The advancements in deep learning have led to the employment of numerous models for endoscopic tumor segmentation, achieving promising segmentation performance. Despite recent advancements, precise segmentation remains challenging due to limited annotations and the issue of low contrast. To address these issues, we propose a novel semi-supervised segmentation framework termed LoCo via low-contrast-enhanced contrastive learning (LCC). This innovative approach effectively harnesses the vast amounts of unlabeled data available for endoscopic image segmentation, improving both accuracy and robustness in the segmentation process. Specifically, LCC incorporates two advanced strategies to enhance the distinctiveness of low-contrast pixels: inter-class contrast enhancement (ICE) and boundary contrast enhancement (BCE), enabling models to segment low-contrast pixels among malignant tumors, benign tumors, and normal tissues. Additionally, a confidence-based dynamic filter (CDF) is designed for pseudo-label selection, enhancing the utilization of generated pseudo-labels for unlabeled data with a specific focus on minority classes. Extensive experiments conducted on two public datasets, as well as a large proprietary dataset collected over three years, demonstrate that LoCo achieves state-of-the-art results, significantly outperforming previous methods. The source code of LoCo is available at the URL of this https URL.

Title: HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset

Authors: Zedong Chu, Feng Xiong, Meiduo Liu, Jinzhi Zhang, Mingqi Shao, Zhaoxu Sun, Di Wang, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02317
Pdf URL: https://arxiv.org/pdf/2412.02317
Copy Paste: [[2412.02317]] HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset(https://arxiv.org/abs/2412.02317)
Keywords: robust, transformer
Abstract: With the rapid evolution of 3D generation algorithms, the cost of producing 3D humanoid character models has plummeted, yet the field is impeded by the lack of a comprehensive dataset for automatic rigging, which is a pivotal step in character animation. Addressing this gap, we present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging, encompassing 11,434 meticulously curated T-posed meshes adhered to a uniform skeleton topology. Capitalizing on this dataset, we introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods in handling complex AI-generated meshes. Our approach integrates a Prior-Guided Skeleton Estimator (PGSE) module, which uses 2D skeleton joints to provide a preliminary 3D skeleton, and a Mesh-Skeleton Mutual Attention Network (MSMAN) that fuses skeleton features with 3D mesh features extracted by a U-shaped point transformer. This enables a coarse-to-fine 3D skeleton joint regression and a robust skinning estimation, surpassing previous methods in quality and versatility. This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.

Title: Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

Authors: Xinjie Li, Yang Zhao, Dong Wang, Yuan Chen, Li Cao, Xiaoping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02322
Pdf URL: https://arxiv.org/pdf/2412.02322
Copy Paste: [[2412.02322]] Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation(https://arxiv.org/abs/2412.02322)
Keywords: robust, diffusion, generative
Abstract: Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.

Title: Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script

Authors: Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02323
Pdf URL: https://arxiv.org/pdf/2412.02323
Copy Paste: [[2412.02323]] Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script(https://arxiv.org/abs/2412.02323)
Keywords: attack, robust
Abstract: The textual adversarial attack refers to an attack method in which the attacker adds imperceptible perturbations to the original texts by elaborate design so that the NLP (natural language processing) model produces false judgments. This method is also used to evaluate the robustness of NLP models. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, to the best of our knowledge, there is little research targeting Chinese minority languages. Textual adversarial attacks are a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a Tibetan syllable-level black-box textual adversarial attack called TSAttacker based on syllable cosine distance and scoring mechanism. And then, we conduct TSAttacker on six models generated by fine-tuning two PLMs (pre-trained language models) for three downstream tasks. The experiment results show that TSAttacker is effective and generates high-quality adversarial samples. In addition, the robustness of the involved models still has much room for improvement.

Title: GRAND : Graph Reconstruction from potential partial Adjacency and Neighborhood Data

Authors: Sofiane Azogagh, Zelma Aubin Birba, Josée Desharnais, Sébastien Gambs Marc-Olivier Killijian, Nadia Tawbi
Subjects: cs.CR, cs.SI
Abstract URL: https://arxiv.org/abs/2412.02329
Pdf URL: https://arxiv.org/pdf/2412.02329
Copy Paste: [[2412.02329]] GRAND : Graph Reconstruction from potential partial Adjacency and Neighborhood Data(https://arxiv.org/abs/2412.02329)
Keywords: secure, privacy, protect
Abstract: Cryptographic approaches, such as secure multiparty computation, can be used to compute in a secure manner the function of a distributed graph without centralizing the data of each participant. However, the output of the protocol itself can leak sensitive information about the structure of the original graph. In particular, in this work we propose an approach by which an adversary observing the result of a private protocol for the computation of the number of common neighbors between all pairs of vertices, can reconstruct the adjacency matrix of the graph. In fact, this can only be done up to co-squareness, a notion we introduce, as two different graphs can have the same matrix of common neighbors. We consider two models of adversary, one who observes the common neighbors matrix only, and a knowledgeable one, that has a partial knowledge of the original graph. Our results demonstrate that secure multiparty protocols are not enough for privacy protection, especially in the context of highly structured data such as graphs. The reconstruction that we propose is interesting in itself from the point of view of graph theory.

Title: SimuScope: Realistic Endoscopic Synthetic Dataset Generation through Surgical Simulation and Diffusion Models

Authors: Sabina Martyniak, Joanna Kaleta, Diego Dall'Alba, Michał Naskręt, Szymon Płotka, Przemysław Korzeniowski
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.02332
Pdf URL: https://arxiv.org/pdf/2412.02332
Copy Paste: [[2412.02332]] SimuScope: Realistic Endoscopic Synthetic Dataset Generation through Surgical Simulation and Diffusion Models(https://arxiv.org/abs/2412.02332)
Keywords: diffusion
Abstract: Computer-assisted surgical (CAS) systems enhance surgical execution and outcomes by providing advanced support to surgeons. These systems often rely on deep learning models trained on complex, challenging-to-annotate data. While synthetic data generation can address these challenges, enhancing the realism of such data is crucial. This work introduces a multi-stage pipeline for generating realistic synthetic data, featuring a fully-fledged surgical simulator that automatically produces all necessary annotations for modern CAS systems. This simulator generates a wide set of annotations that surpass those available in public synthetic datasets. Additionally, it offers a more complex and realistic simulation of surgical interactions, including the dynamics between surgical instruments and deformable anatomical environments, outperforming existing approaches. To further bridge the visual gap between synthetic and real data, we propose a lightweight and flexible image-to-image translation method based on Stable Diffusion (SD) and Low-Rank Adaptation (LoRA). This method leverages a limited amount of annotated data, enables efficient training, and maintains the integrity of annotations generated by our simulator. The proposed pipeline is experimentally validated and can translate synthetic images into images with real-world characteristics, which can generalize to real-world context, thereby improving both training and CAS guidance. The code and the dataset are available at this https URL.

Title: Amodal Depth Anything: Amodal Depth Estimation in the Wild

Authors: Zhenyu Li, Mykola Lavreniuk, Jian Shi, Shariq Farooq Bhat, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02336
Pdf URL: https://arxiv.org/pdf/2412.02336
Copy Paste: [[2412.02336]] Amodal Depth Anything: Amodal Depth Estimation in the Wild(https://arxiv.org/abs/2412.02336)
Keywords: generative, segmentation
Abstract: Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 69.5% improvement in accuracy over the previous SoTA on the ADIW dataset.

Title: Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Authors: Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, Shoubhik Bhattacharya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02340
Pdf URL: https://arxiv.org/pdf/2412.02340
Copy Paste: [[2412.02340]] Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality(https://arxiv.org/abs/2412.02340)
Keywords: security, privacy, protect, robust, federate
Abstract: Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users' devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA's broad relevance, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build and deploy a system that overcomes these limitations. Our FA system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.

Title: Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model

Authors: Xi Cao, Nuo Qun, Quzong Gesang, Yulei Zhu, Trashi Nyima
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02343
Pdf URL: https://arxiv.org/pdf/2412.02343
Copy Paste: [[2412.02343]] Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model(https://arxiv.org/abs/2412.02343)
Keywords: attack, robust
Abstract: In social media, neural network models have been applied to hate speech detection, sentiment analysis, etc., but neural network models are susceptible to adversarial attacks. For instance, in a text classification task, the attacker elaborately introduces perturbations to the original texts that hardly alter the original semantics in order to trick the model into making different predictions. By studying textual adversarial attack methods, the robustness of language models can be evaluated and then improved. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, there is little research targeting Chinese minority languages. With the rapid development of artificial intelligence technology and the emergence of Chinese minority language models, textual adversarial attacks become a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a multi-granularity Tibetan textual adversarial attack method based on masked language models called TSTricker. We utilize the masked language models to generate candidate substitution syllables or words, adopt the scoring mechanism to determine the substitution order, and then conduct the attack method on several fine-tuned victim models. The experimental results show that TSTricker reduces the accuracy of the classification models by more than 28.70% and makes the classification models change the predictions of more than 90.60% of the samples, which has an evidently higher attack effect than the baseline method.

Title: UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

Authors: Seul-Ki Yeom, Tae-Ho Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02344
Pdf URL: https://arxiv.org/pdf/2412.02344
Copy Paste: [[2412.02344]] UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices(https://arxiv.org/abs/2412.02344)
Keywords: transformer
Abstract: Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications

Title: CTRAPS: CTAP Client Impersonation and API Confusion on FIDO2

Authors: Marco Casagrande, Daniele Antonioli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.02349
Pdf URL: https://arxiv.org/pdf/2412.02349
Copy Paste: [[2412.02349]] CTRAPS: CTAP Client Impersonation and API Confusion on FIDO2(https://arxiv.org/abs/2412.02349)
Keywords: security, privacy, attack
Abstract: FIDO2 is the standard technology for single-factor and second-factor authentication. It is specified in an open standard, including the WebAuthn and CTAP application layer protocols. We focus on CTAP, which allows FIDO2 clients and hardware authenticators to communicate. No prior work has explored the CTAP Authenticator API, a critical protocol-level attack surface. We address this gap by presenting the first security and privacy evaluation of the CTAP Authenticator API. We uncover two classes of protocol-level attacks on CTAP that we call CTRAPS. The client impersonation (CI) attacks exploit the lack of client authentication to tamper with FIDO2 authenticators. They include zero-click attacks capable of deleting FIDO2 credentials, including passkeys, without user interaction. The API confusion (AC) attacks abuse the lack of protocol API enforcements and confound FIDO2 authenticators, clients, and unaware users into calling unwanted CTAP APIs while thinking they are calling legitimate ones. The presented eleven attacks are conducted either in proximity or remotely and are effective regardless of the underlying CTAP transport. We detail the eight vulnerabilities in the CTAP specification, enabling the CTRAPS attacks. Six are novel and include unauthenticated CTAP clients and trackable FIDO2 credentials. We release CTRAPS, an original toolkit, to analyze CTAP and conduct the CTRAPS attacks. We confirm the attacks practicality on a large scale by exploiting six popular authenticators, including a FIPS-certified one from Yubico, Feitian, SoloKeys, and Google, and ten widely used relying parties, such as Microsoft, Apple, GitHub, and Facebook. We present eight practical and backward-compliant countermeasures to fix the attacks and their root causes. We responsibly disclosed our findings to the FIDO alliance and the affected vendors.

Title: Dual Exposure Stereo for Extended Dynamic Range 3D Imaging

Authors: Juhyung Choi, Jinnyeong Kim, Seokjun Choi, Jinwoo Lee, Samuel Brucker, Mario Bijelic, Felix Heide, Seung-Hwan Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02351
Pdf URL: https://arxiv.org/pdf/2412.02351
Copy Paste: [[2412.02351]] Dual Exposure Stereo for Extended Dynamic Range 3D Imaging(https://arxiv.org/abs/2412.02351)
Keywords: robust
Abstract: Achieving robust stereo 3D imaging under diverse illumination conditions is an important however challenging task, due to the limited dynamic ranges (DRs) of cameras, which are significantly smaller than real world DR. As a result, the accuracy of existing stereo depth estimation methods is often compromised by under- or over-exposed images. Here, we introduce dual-exposure stereo for extended dynamic range 3D imaging. We develop automatic dual-exposure control method that adjusts the dual exposures, diverging them when the scene DR exceeds the camera DR, thereby providing information about broader DR. From the captured dual-exposure stereo images, we estimate depth using motion-aware dual-exposure stereo network. To validate our method, we develop a robot-vision system, collect stereo video datasets, and generate a synthetic dataset. Our method outperforms other exposure control methods.

Title: LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

Authors: Ethan Smith, Rami Seid, Alberto Hojel, Paramita Mishra, Jianbo Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02352
Pdf URL: https://arxiv.org/pdf/2412.02352
Copy Paste: [[2412.02352]] LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization(https://arxiv.org/abs/2412.02352)
Keywords: diffusion
Abstract: Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT) methods provide low-memory, storage-efficient solutions for personalizing text-to-image models. However, these methods offer little to no improvement in wall-clock training time or the number of steps needed for convergence compared to full model fine-tuning. While PEFT methods assume that shifts in generated distributions (from base to fine-tuned models) can be effectively modeled through weight changes in a low-rank subspace, they fail to leverage knowledge of common use cases, which typically focus on capturing specific styles or identities. Observing that desired outputs often comprise only a small subset of the possible domain covered by LoRA training, we propose reducing the search space by incorporating a prior over regions of interest. We demonstrate that training a hypernetwork model to generate LoRA weights can achieve competitive quality for specific domains while enabling near-instantaneous conditioning on user input, in contrast to traditional training methods that require thousands of steps.

Title: GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing

Authors: Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, Karthik Nandakumar, Naveed Akhtar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02366
Pdf URL: https://arxiv.org/pdf/2412.02366
Copy Paste: [[2412.02366]] GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing(https://arxiv.org/abs/2412.02366)
Keywords: robust, diffusion, generative
Abstract: Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.

Title: Trajectory-based Road Autolabeling with Lidar-Camera Fusion in Winter Conditions

Authors: Eerik Alamikkotervo, Henrik Toikka, Kari Tammi, Risto Ojala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02370
Pdf URL: https://arxiv.org/pdf/2412.02370
Copy Paste: [[2412.02370]] Trajectory-based Road Autolabeling with Lidar-Camera Fusion in Winter Conditions(https://arxiv.org/abs/2412.02370)
Keywords: robust, segmentation
Abstract: Robust road segmentation in all road conditions is required for safe autonomous driving and advanced driver assistance systems. Supervised deep learning methods provide accurate road segmentation in the domain of their training data but cannot be trusted in out-of-distribution scenarios. Including the whole distribution in the trainset is challenging as each sample must be labeled by hand. Trajectory-based self-supervised methods offer a potential solution as they can learn from the traversed route without manual labels. However, existing trajectory-based methods use learning schemes that rely only on the camera or only on the lidar. In this paper, trajectory-based learning is implemented jointly with lidar and camera for increased performance. Our method outperforms recent standalone camera- and lidar-based methods when evaluated with a challenging winter driving dataset including countryside and suburb driving scenes. The source code is available at this https URL

Title: TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity

Authors: Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02371
Pdf URL: https://arxiv.org/pdf/2412.02371
Copy Paste: [[2412.02371]] TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity(https://arxiv.org/abs/2412.02371)
Keywords: attack, robust
Abstract: Language models based on deep neural networks are vulnerable to textual adversarial attacks. While rich-resource languages like English are receiving focused attention, Tibetan, a cross-border language, is gradually being studied due to its abundant ancient literature and critical language strategy. Currently, there are several Tibetan adversarial text generation methods, but they do not fully consider the textual features of Tibetan script and overestimate the quality of generated adversarial texts. To address this issue, we propose a novel Tibetan adversarial text generation method called TSCheater, which considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics. This method can also be transferred to other abugidas, such as Devanagari script. We utilize a self-constructed Tibetan syllable visual similarity database called TSVSDB to generate substitution candidates and adopt a greedy algorithm-based scoring mechanism to determine substitution order. After that, we conduct the method on eight victim language models. Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation magnitude, semantic similarity, visual similarity, and human acceptance. Finally, we construct the first Tibetan adversarial robustness evaluation benchmark called AdvTS, which is generated by existing methods and proofread by humans.

Title: Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Authors: Xichen Ye, Yifan Wu, Yiwen Xu, Xiaoqiang Li, Weizhong Zhang, Yifan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02373
Pdf URL: https://arxiv.org/pdf/2412.02373
Copy Paste: [[2412.02373]] Active Negative Loss: A Robust Framework for Learning with Noisy Labels(https://arxiv.org/abs/2412.02373)
Keywords: robust, segmentation
Abstract: Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: this https URL.

Title: Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction

Authors: Ziqian Zou, Conghao Wong, Beihao Xia, Qinmu Peng, Xinge You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02395
Pdf URL: https://arxiv.org/pdf/2412.02395
Copy Paste: [[2412.02395]] Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction(https://arxiv.org/abs/2412.02395)
Keywords: explainability
Abstract: Understanding and anticipating human movement has become more critical and challenging in diverse applications such as autonomous driving and surveillance. The complex interactions brought by different relations between agents are a crucial reason that poses challenges to this task. Researchers have put much effort into designing a system using rule-based or data-based models to extract and validate the patterns between pedestrian trajectories and these interactions, which has not been adequately addressed yet. Inspired by how humans perceive social interactions with different level of relations to themself, this work proposes the GrouP ConCeption (short for GPCC) model composed of the Group method, which categorizes nearby agents into either group members or non-group members based on a long-term distance kernel function, and the Conception module, which perceives both visual and acoustic information surrounding the target agent. Evaluated across multiple datasets, the GPCC model demonstrates significant improvements in trajectory prediction accuracy, validating its effectiveness in modeling both social and individual dynamics. The qualitative analysis also indicates that the GPCC framework successfully leverages grouping and perception cues human-like intuitively to validate the proposed model's explainability in pedestrian trajectory forecasting.

Title: RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

Authors: Changli Wu, Qi Chen, Jiayi Ji, Haowei Wang, Yiwei Ma, You Huang, Gen Luo, Hao Fei, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02402
Pdf URL: https://arxiv.org/pdf/2412.02402
Copy Paste: [[2412.02402]] RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation(https://arxiv.org/abs/2412.02402)
Keywords: robust, segmentation
Abstract: 3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance's positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5.1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at this https URL.

Title: VISTA: A Panoramic View of Neural Representations

Authors: Tom White
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02412
Pdf URL: https://arxiv.org/pdf/2412.02412
Copy Paste: [[2412.02412]] VISTA: A Panoramic View of Neural Representations(https://arxiv.org/abs/2412.02412)
Keywords: interpretability
Abstract: We present VISTA (Visualization of Internal States and Their Associations), a novel pipeline for visually exploring and interpreting neural network representations. VISTA addresses the challenge of analyzing vast multidimensional spaces in modern machine learning models by mapping representations into a semantic 2D space. The resulting collages visually reveal patterns and relationships within internal representations. We demonstrate VISTA's utility by applying it to sparse autoencoder latents uncovering new properties and interpretations. We review the VISTA methodology, present findings from our case study ( this https URL ), and discuss implications for neural network interpretability across various domains of machine learning.

Title: GerPS-Compare: Comparing NER methods for legal norm analysis

Authors: Sarah T. Bachinger, Christoph Unger, Robin Erd, Leila Feddoul, Clara Lachenmaier, Sina Zarrieß, Birgitta König-Ries
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02427
Pdf URL: https://arxiv.org/pdf/2412.02427
Copy Paste: [[2412.02427]] GerPS-Compare: Comparing NER methods for legal norm analysis(https://arxiv.org/abs/2412.02427)
Keywords: generative
Abstract: We apply NER to a particular sub-genre of legal texts in German: the genre of legal norms regulating administrative processes in public service administration. The analysis of such texts involves identifying stretches of text that instantiate one of ten classes identified by public service administration professionals. We investigate and compare three methods for performing Named Entity Recognition (NER) to detect these classes: a Rule-based system, deep discriminative models, and a deep generative model. Our results show that Deep Discriminative models outperform both the Rule-based system as well as the Deep Generative model, the latter two roughly performing equally well, outperforming each other in different classes. The main cause for this somewhat surprising result is arguably the fact that the classes used in the analysis are semantically and syntactically heterogeneous, in contrast to the classes used in more standard NER tasks. Deep Discriminative models appear to be better equipped for dealing with this heterogenerity than both generic LLMs and human linguists designing rule-based NER systems.

Title: Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

Authors: Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02454
Pdf URL: https://arxiv.org/pdf/2412.02454
Copy Paste: [[2412.02454]] Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining(https://arxiv.org/abs/2412.02454)
Keywords: security, defense, attack, generative, large language model
Abstract: Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at this https URL.

Title: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02467
Pdf URL: https://arxiv.org/pdf/2412.02467
Copy Paste: [[2412.02467]] DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators(https://arxiv.org/abs/2412.02467)
Keywords: privacy, protect, large language model
Abstract: Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at this https URL.

Title: OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations

Authors: Caixin Kang, Yubo Chen, Shouwei Ruan, Shiji Zhao, Ruochen Zhang, Jiayi Wang, Shan Fu, Xingxing Wei
Subjects: cs.CV, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02479
Pdf URL: https://arxiv.org/pdf/2412.02479
Copy Paste: [[2412.02479]] OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations(https://arxiv.org/abs/2412.02479)
Keywords: defense, robust
Abstract: With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain real-world Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 different facial recognition models and 3 commercial APIs, along with extended experiments on face masks, Vision-Language Models (VLMs), and defense strategies to assess their robustness. Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.

Title: LLMForecaster: Improving Seasonal Event Forecasts with Unstructured Textual Data

Authors: Hanyu Zhang, Chuck Arvin, Dmitry Efimov, Michael W. Mahoney, Dominique Perrault-Joncas, Shankar Ramasubramanian, Andrew Gordon Wilson, Malcolm Wolff
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.02525
Pdf URL: https://arxiv.org/pdf/2412.02525
Copy Paste: [[2412.02525]] LLMForecaster: Improving Seasonal Event Forecasts with Unstructured Textual Data(https://arxiv.org/abs/2412.02525)
Keywords: large language model
Abstract: Modern time-series forecasting models often fail to make full use of rich unstructured information about the time series themselves. This lack of proper conditioning can lead to obvious model failures; for example, models may be unaware of the details of a particular product, and hence fail to anticipate seasonal surges in customer demand in the lead up to major exogenous events like holidays for clearly relevant products. To address this shortcoming, this paper introduces a novel forecast post-processor -- which we call LLMForecaster -- that fine-tunes large language models (LLMs) to incorporate unstructured semantic and contextual information and historical data to improve the forecasts from an existing demand forecasting pipeline. In an industry-scale retail application, we demonstrate that our technique yields statistically significantly forecast improvements across several sets of products subject to holiday-driven demand surges.

Title: Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization

Authors: Nicolás García Trillos, Aditya Kumar Akash, Sixu Li, Konstantin Riedl, Yuhua Zhu
Subjects: cs.LG, cs.CR, cs.MA, math.AP
Abstract URL: https://arxiv.org/abs/2412.02535
Pdf URL: https://arxiv.org/pdf/2412.02535
Copy Paste: [[2412.02535]] Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization(https://arxiv.org/abs/2412.02535)
Keywords: attack, robust, federate
Abstract: Adversarial attacks pose significant challenges in many machine learning applications, particularly in the setting of distributed training and federated learning, where malicious agents seek to corrupt the training process with the goal of jeopardizing and compromising the performance and reliability of the final models. In this paper, we address the problem of robust federated learning in the presence of such attacks by formulating the training task as a bi-level optimization problem. We conduct a theoretical analysis of the resilience of consensus-based bi-level optimization (CB$^2$O), an interacting multi-particle metaheuristic optimization method, in adversarial settings. Specifically, we provide a global convergence analysis of CB$^2$O in mean-field law in the presence of malicious agents, demonstrating the robustness of CB$^2$O against a diverse range of attacks. Thereby, we offer insights into how specific hyperparameter choices enable to mitigate adversarial effects. On the practical side, we extend CB$^2$O to the clustered federated learning setting by proposing FedCB$^2$O, a novel interacting multi-particle system, and design a practical algorithm that addresses the demands of real-world applications. Extensive experiments demonstrate the robustness of the FedCB$^2$O algorithm against label-flipping attacks in decentralized clustered federated learning scenarios, showcasing its effectiveness in practical contexts.

Title: Automatic State Machine Inference for Binary Protocol Reverse Engineering

Authors: Junhai Yang, Fenghua Li, Yixuan Zhang, Junhao Zhang, Liang Fang, Yunchuan Guo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.02540
Pdf URL: https://arxiv.org/pdf/2412.02540
Copy Paste: [[2412.02540]] Automatic State Machine Inference for Binary Protocol Reverse Engineering(https://arxiv.org/abs/2412.02540)
Keywords: attack
Abstract: Protocol Reverse Engineering (PRE) is used to analyze protocols by inferring their structure and behavior. However, current PRE methods mainly focus on field identification within a single protocol and neglect Protocol State Machine (PSM) analysis in mixed protocol environments. This results in insufficient analysis of protocols' abnormal behavior and potential vulnerabilities, which are crucial for detecting and defending against new attack patterns. To address these challenges, we propose an automatic PSM inference framework for unknown protocols, including a fuzzy membership-based auto-converging DBSCAN algorithm for protocol format clustering, followed by a session clustering algorithm based on Needleman-Wunsch and K-Medoids algorithms to classify sessions by protocol type. Finally, we refine a probabilistic PSM algorithm to infer protocol states and the transition conditions between these states. Experimental results show that, compared with existing PRE techniques, our method can infer PSMs while enabling more precise classification of protocols.

Title: Unveiling Concept Attribution in Diffusion Models

Authors: Quang H. Nguyen, Hoang Phan, Khoa D. Doan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02542
Pdf URL: https://arxiv.org/pdf/2412.02542
Copy Paste: [[2412.02542]] Unveiling Concept Attribution in Diffusion Models(https://arxiv.org/abs/2412.02542)
Keywords: interpretability, diffusion, generative
Abstract: Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize layers storing knowledge in generative models without showing how those layers contribute to the target concept. In this work, we approach the model interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show there exist components that contribute negatively to a concept, which has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models. Our code is available at \url{this https URL}

Title: Fractional Order Distributed Optimization

Authors: Andrei Lixandru, Marcel van Gerven, Sergio Pequito
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02546
Pdf URL: https://arxiv.org/pdf/2412.02546
Copy Paste: [[2412.02546]] Fractional Order Distributed Optimization(https://arxiv.org/abs/2412.02546)
Keywords: federate
Abstract: Distributed optimization is fundamental to modern machine learning applications like federated learning, but existing methods often struggle with ill-conditioned problems and face stability-versus-speed tradeoffs. We introduce fractional order distributed optimization (FrODO); a theoretically-grounded framework that incorporates fractional-order memory terms to enhance convergence properties in challenging optimization landscapes. Our approach achieves provable linear convergence for any strongly connected network. Through empirical validation, our results suggest that FrODO achieves up to 4 times faster convergence versus baselines on ill-conditioned problems and 2-3 times speedup in federated neural network training, while maintaining stability and theoretical guarantees.

Title: Patent-CR: A Dataset for Patent Claim Revision

Authors: Lekang Jiang, Pascal A Scherz, Stephan Goetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.02549
Pdf URL: https://arxiv.org/pdf/2412.02549
Copy Paste: [[2412.02549]] Patent-CR: A Dataset for Patent Claim Revision(https://arxiv.org/abs/2412.02549)
Keywords: robust, large language model
Abstract: This paper presents Patent-CR, the first dataset created for the patent claim revision task in English. It includes both initial patent applications rejected by patent examiners and the final granted versions. Unlike normal text revision tasks that predominantly focus on enhancing sentence quality, such as grammar correction and coherence improvement, patent claim revision aims at ensuring the claims meet stringent legal criteria. These criteria are beyond novelty and inventiveness, including clarity of scope, technical accuracy, language precision, and legal robustness. We assess various large language models (LLMs) through professional human evaluation, including general LLMs with different sizes and architectures, text revision models, and domain-specific models. Our results indicate that LLMs often bring ineffective edits that deviate from the target revisions. In addition, domain-specific models and the method of fine-tuning show promising results. Notably, GPT-4 outperforms other tested LLMs, but further revisions are still necessary to reach the examination standard. Furthermore, we demonstrate the inconsistency between automated and human evaluation results, suggesting that GPT-4-based automated evaluation has the highest correlation with human judgment. This dataset, along with our preliminary empirical research, offers invaluable insights for further exploration in patent claim revision.

Title: Semantic Tokens in Retrieval Augmented Generation

Authors: Joel Suro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02563
Pdf URL: https://arxiv.org/pdf/2412.02563
Copy Paste: [[2412.02563]] Semantic Tokens in Retrieval Augmented Generation(https://arxiv.org/abs/2412.02563)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) architectures have recently garnered significant attention for their ability to improve truth grounding and coherence in natural language processing tasks. However, the reliability of RAG systems in producing accurate answers diminishes as the volume of data they access increases. Even with smaller datasets, these systems occasionally fail to address simple queries. This issue arises from their dependence on state-of-the-art large language models (LLMs), which can introduce uncertainty into the system's outputs. In this work, I propose a novel Comparative RAG system that introduces an evaluator module to bridge the gap between probabilistic RAG systems and deterministically verifiable responses. The evaluator compares external recommendations with the retrieved document chunks, adding a decision-making layer that enhances the system's reliability. This approach ensures that the chunks retrieved are both semantically relevant and logically consistent with deterministic insights, thereby improving the accuracy and overall efficiency of RAG systems. This framework paves the way for more reliable and scalable question-answering applications in domains requiring high precision and verifiability.

Title: SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

Authors: Joongwon Chae, Zhenyu Wang, Peiwu Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02565
Pdf URL: https://arxiv.org/pdf/2412.02565
Copy Paste: [[2412.02565]] SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection(https://arxiv.org/abs/2412.02565)
Keywords: segmentation
Abstract: Despite advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in multimodal models - Towards Unified segmentation through coordinate detection, a novel framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework proposes a novel approach for integrating segmentation techniques with vision-language models based on multimodal spatial inference. By leveraging normalized coordinate detection for bounding boxes and translating it into actionable segmentation outputs, we explore the possibility of integrating multimodal spatial and language representations. Based on the proposed technical approach, the framework demonstrates superior performance on various benchmark datasets as well as accurate object segmentation. Results on the COCO 2017 dataset for general object detection and Pascal VOC datasets for semantic segmentation demonstrate the generalization capabilities of the framework.

Title: Copy-Move Forgery Detection and Question Answering for Remote Sensing Image

Authors: Ze Zhang, Enyuan Zhao, Ziyi Wan, Jie Nie, Xinyue Liang, Lei Huang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.02575
Pdf URL: https://arxiv.org/pdf/2412.02575
Copy Paste: [[2412.02575]] Copy-Move Forgery Detection and Question Answering for Remote Sensing Image(https://arxiv.org/abs/2412.02575)
Keywords: security, defense
Abstract: This paper introduces the task of Remote Sensing Copy-Move Question Answering (RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA), RSCMQA focuses on interpreting complex tampering scenarios and inferring relationships between objects. Based on the practical needs of national defense security and land resource monitoring, we have developed an accurate and comprehensive global dataset for remote sensing image copy-move question answering, named RS-CMQA-2.1M. These images were collected from 29 different regions across 14 countries. Additionally, we have refined a balanced dataset, RS-CMQA-B, to address the long-standing issue of long-tail data in the remote sensing field. Furthermore, we propose a region-discriminative guided multimodal CMQA model, which enhances the accuracy of answering questions about tampered images by leveraging prompt about the differences and connections between the source and tampered domains. Extensive experiments demonstrate that our method provides a stronger benchmark for RS-CMQA compared to general VQA and RSVQA models. Our dataset and code are available at this https URL.

Title: The Efficacy of Transfer-based No-box Attacks on Image Watermarking: A Pragmatic Analysis

Authors: Qilong Wu, Varun Chandrasekaran
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.02576
Pdf URL: https://arxiv.org/pdf/2412.02576
Copy Paste: [[2412.02576]] The Efficacy of Transfer-based No-box Attacks on Image Watermarking: A Pragmatic Analysis(https://arxiv.org/abs/2412.02576)
Keywords: attack, robust, watermark
Abstract: Watermarking approaches are widely used to identify if images being circulated are authentic or AI-generated. Determining the robustness of image watermarking methods in the ``no-box'' setting, where the attacker is assumed to have no knowledge about the watermarking model, is an interesting problem. Our main finding is that evading the no-box setting is challenging: the success of optimization-based transfer attacks (involving training surrogate models) proposed in prior work~\cite{hu2024transfer} depends on impractical assumptions, including (i) aligning the architecture and training configurations of both the victim and attacker's surrogate watermarking models, as well as (ii) a large number of surrogate models with potentially large computational requirements. Relaxing these assumptions i.e., moving to a more pragmatic threat model results in a failed attack, with an evasion rate at most $21.1\%$. We show that when the configuration is mostly aligned, a simple non-optimization attack we propose, OFT, with one single surrogate model can already exceed the success of optimization-based efforts. Under the same $\ell_\infty$ norm perturbation budget of $0.25$, prior work~\citet{hu2024transfer} is comparable to or worse than OFT in $11$ out of $12$ configurations and has a limited advantage on the remaining one. The code used for all our experiments is available at \url{this https URL}.

Title: Private Linear Regression with Differential Privacy and PAC Privacy

Authors: Hillary Yang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.02578
Pdf URL: https://arxiv.org/pdf/2412.02578
Copy Paste: [[2412.02578]] Private Linear Regression with Differential Privacy and PAC Privacy(https://arxiv.org/abs/2412.02578)
Keywords: privacy
Abstract: Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.

Title: OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Authors: Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02592
Pdf URL: https://arxiv.org/pdf/2412.02592
Copy Paste: [[2412.02592]] OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation(https://arxiv.org/abs/2412.02592)
Keywords: large language model
Abstract: Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: this https URL

Title: CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs

Authors: Abhas Kumar, Kapil Pathak, Rajesh Kavuru, Prabhakar Srinivasan
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02602
Pdf URL: https://arxiv.org/pdf/2412.02602
Copy Paste: [[2412.02602]] CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs(https://arxiv.org/abs/2412.02602)
Keywords: large language model
Abstract: This paper analyzes the performance of Small Language Models (SLMs) and Vision Language Models (VLMs) and evaluates the trade-off between model performance and carbon emissions across 4 essential tasks: Image Captioning, Visual Question Answering (VQA), Dialogue Summarization and Text-to-SQL conversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture family are chosen and variants based on model size in terms of the number of parameters, quantization level and fine-tuning parameters are evaluated. The model variant's performance and carbon emissions are calculated. To quantify the trade-off between model performance and carbon emissions, we introduce a novel metric called CEGI (Carbon Efficient Gain Index). This metric represents the carbon emission per unit percentage gain per million trainable parameters . This metric provides a normalized measure to compare model's efficiency in terms of performance improvement relative to their environmental cost. The experiment's outcome demonstrates that fine-tuning SLMs and VLMs can achieve performance levels comparable to Large Language Models (LLMs) while producing significantly less carbon emissions. Our findings suggest that the marginal gains in accuracy from larger models do not justify the substantial increase in carbon emissions. Leveraging lower-bit quantization levels, the proposed metric further enhances energy efficiency without compromising performance. This study highlights balancing high performance and environmental sustainability. It offers a valuable metric for selecting models suitable for environmentally-friendly AI development.

Title: Interpretable Company Similarity with Sparse Autoencoders

Authors: Marco Molinari, Vladimir Tregubiak, Victor Shao, Abhimanyu Pandey, Mateusz Mikolajczak, Sebastião Kuznetsov Ryder Torres Pereira
Subjects: cs.CL, cs.LG, econ.GN
Abstract URL: https://arxiv.org/abs/2412.02605
Pdf URL: https://arxiv.org/pdf/2412.02605
Copy Paste: [[2412.02605]] Interpretable Company Similarity with Sparse Autoencoders(https://arxiv.org/abs/2412.02605)
Keywords: interpretability, large language model
Abstract: Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Clustering embeddings of company descriptions has been proposed as a potential technique for determining company similarity, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the interpretability of Large Language Models by decomposing LLM activations into interpretable features. In this paper, we explore the use of SAE features in measuring company similarity and benchmark them against (1) SIC codes and (2) Major Group codes. We conclude that SAE features can reproduce and even surpass sector classifications in quantifying fundamental characteristics of companies, evaluated by the correlation of monthly returns, a proxy for similarity, and PnL from cointegration.

Title: Wasserstein Markets for Differentially-Private Data

Authors: Saurab Chhachhi, Fei Teng
Subjects: cs.LG, cs.CE, cs.CR, cs.GT, econ.GN
Abstract URL: https://arxiv.org/abs/2412.02609
Pdf URL: https://arxiv.org/pdf/2412.02609
Copy Paste: [[2412.02609]] Wasserstein Markets for Differentially-Private Data(https://arxiv.org/abs/2412.02609)
Keywords: privacy
Abstract: Data is an increasingly vital component of decision making processes across industries. However, data access raises privacy concerns motivating the need for privacy-preserving techniques such as differential privacy. Data markets provide a means to enable wider access as well as determine the appropriate privacy-utility trade-off. Existing data market frameworks either require a trusted third party to perform computationally expensive valuations or are unable to capture the combinatorial nature of data value and do not endogenously model the effect of differential privacy. This paper addresses these shortcomings by proposing a valuation mechanism based on the Wasserstein distance for differentially-private data, and corresponding procurement mechanisms by leveraging incentive mechanism design theory, for task-agnostic data procurement, and task-specific procurement co-optimisation. The mechanisms are reformulated into tractable mixed-integer second-order cone programs, which are validated with numerical studies.

Title: AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Authors: Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue
Subjects: cs.CV, cs.AI, cs.CL, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.02611
Pdf URL: https://arxiv.org/pdf/2412.02611
Copy Paste: [[2412.02611]] AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?(https://arxiv.org/abs/2412.02611)
Keywords: large language model
Abstract: Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

Title: Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Authors: Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.02617
Pdf URL: https://arxiv.org/pdf/2412.02617
Copy Paste: [[2412.02617]] Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback(https://arxiv.org/abs/2412.02617)
Keywords: large language model
Abstract: Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.

Title: Time-Reversal Provides Unsupervised Feedback to LLMs

Authors: Yerram Varun, Rahul Madhavan, Sravanti Addepalli, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02626
Pdf URL: https://arxiv.org/pdf/2412.02626
Copy Paste: [[2412.02626]] Time-Reversal Provides Unsupervised Feedback to LLMs(https://arxiv.org/abs/2412.02626)
Keywords: attack, generative, large language model
Abstract: Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5\% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.

Title: Continual Learning of Personalized Generative Face Models with Experience Replay

Authors: Annie N. Wang, Luchao Qi, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02627
Pdf URL: https://arxiv.org/pdf/2412.02627
Copy Paste: [[2412.02627]] Continual Learning of Personalized Generative Face Models with Experience Replay(https://arxiv.org/abs/2412.02627)
Keywords: generative
Abstract: We introduce a novel continual learning problem: how to sequentially update the weights of a personalized 2D and 3D generative face model as new batches of photos in different appearances, styles, poses, and lighting are captured regularly. We observe that naive sequential fine-tuning of the model leads to catastrophic forgetting of past representations of the individual's face. We then demonstrate that a simple random sampling-based experience replay method is effective at mitigating catastrophic forgetting when a relatively large number of images can be stored and replayed. However, for long-term deployment of these models with relatively smaller storage, this simple random sampling-based replay technique also forgets past representations. Thus, we introduce a novel experience replay algorithm that combines random sampling with StyleGAN's latent space to represent the buffer as an optimal convex hull. We observe that our proposed convex hull-based experience replay is more effective in preventing forgetting than a random sampling baseline and the lower bound.

Title: Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Authors: Yiftach Edelstein, Or Patashnik, Dana Cohen-Bar, Lihi Zelnik-Manor
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02631
Pdf URL: https://arxiv.org/pdf/2412.02631
Copy Paste: [[2412.02631]] Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation(https://arxiv.org/abs/2412.02631)
Keywords: diffusion, generative
Abstract: Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

Title: Liquefaction: Privately Liquefying Blockchain Assets

Authors: James Austgen, Andrés Fábrega, Mahimna Kelkar, Dani Vilardell, Sarah Allen, Kushal Babel, Jay Yu, Ari Juels
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.02634
Pdf URL: https://arxiv.org/pdf/2412.02634
Copy Paste: [[2412.02634]] Liquefaction: Privately Liquefying Blockchain Assets(https://arxiv.org/abs/2412.02634)
Keywords: security, privacy, attack
Abstract: Inherent in the world of cryptocurrency systems and their security models is the notion that private keys, and thus assets, are controlled by individuals or individual entities. We present Liquefaction, a wallet platform that demonstrates the dangerous fragility of this foundational assumption by systemically breaking it. Liquefaction uses trusted execution environments (TEEs) to encumber private keys, i.e., attach rich, multi-user policies to their use. In this way, it enables the cryptocurrency credentials and assets of a single end-user address to be freely rented, shared, or pooled. It accomplishes these things privately, with no direct on-chain traces. Liquefaction demonstrates the sweeping consequences of TEE-based key encumbrance for the cryptocurrency landscape. Liquefaction can undermine the security and economic models of many applications and resources, such as locked tokens, DAO voting, airdrops, loyalty points, soulbound tokens, and quadratic voting. It can do so with no on-chain and minimal off-chain visibility. Conversely, we also discuss beneficial applications of Liquefaction, such as privacy-preserving, cost-efficient DAOs and a countermeasure to dusting attacks. Importantly, we describe an existing TEE-based tool that applications can use as a countermeasure to Liquefaction. Our work prompts a wholesale rethinking of existing models and enforcement of key and asset ownership in the cryptocurrency ecosystem.

Title: Robust soybean seed yield estimation using high-throughput ground robot videos

Authors: Jiale Feng, Samuel W. Blair, Timilehin Ayanlade, Aditya Balu, Baskar Ganapathysubramanian, Arti Singh, Soumik Sarkar, Asheesh K Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02642
Pdf URL: https://arxiv.org/pdf/2412.02642
Copy Paste: [[2412.02642]] Robust soybean seed yield estimation using high-throughput ground robot videos(https://arxiv.org/abs/2412.02642)
Keywords: robust, extraction
Abstract: We present a novel method for soybean (Glycine max (L.) Merr.) yield estimation leveraging high throughput seed counting via computer vision and deep learning techniques. Traditional methods for collecting yield data are labor-intensive, costly, prone to equipment failures at critical data collection times, and require transportation of equipment across field sites. Computer vision, the field of teaching computers to interpret visual data, allows us to extract detailed yield information directly from images. By treating it as a computer vision task, we report a more efficient alternative, employing a ground robot equipped with fisheye cameras to capture comprehensive videos of soybean plots from which images are extracted in a variety of development programs. These images are processed through the P2PNet-Yield model, a deep learning framework where we combined a Feature Extraction Module (the backbone of the P2PNet-Soy) and a Yield Regression Module to estimate seed yields of soybean plots. Our results are built on three years of yield testing plot data - 8500 in 2021, 2275 in 2022, and 650 in 2023. With these datasets, our approach incorporates several innovations to further improve the accuracy and generalizability of the seed counting and yield estimation architecture, such as the fisheye image correction and data augmentation with random sensor effects. The P2PNet-Yield model achieved a genotype ranking accuracy score of up to 83%. It demonstrates up to a 32% reduction in time to collect yield data as well as costs associated with traditional yield estimation, offering a scalable solution for breeding programs and agricultural productivity enhancement.

Title: A Bidirectional Long Short Term Memory Approach for Infrastructure Health Monitoring Using On-board Vibration Response

Authors: R. R. Samani, A. Nunez, B. De Schutter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02643
Pdf URL: https://arxiv.org/pdf/2412.02643
Copy Paste: [[2412.02643]] A Bidirectional Long Short Term Memory Approach for Infrastructure Health Monitoring Using On-board Vibration Response(https://arxiv.org/abs/2412.02643)
Keywords: extraction
Abstract: The growing volume of available infrastructural monitoring data enables the development of powerful datadriven approaches to estimate infrastructure health conditions using direct measurements. This paper proposes a deep learning methodology to estimate infrastructure physical parameters, such as railway track stiffness, using drive-by vibration response signals. The proposed method employs a Long Short-term Memory (LSTM) feature extractor accounting for temporal dependencies in the feature extraction phase, and a bidirectional Long Short-term Memory (BiLSTM) networks to leverage bidirectional temporal dependencies in both the forward and backward paths of the drive-by vibration response in condition estimation phase. Additionally, a framing approach is employed to enhance the resolution of the monitoring task to the beam level by segmenting the vibration signal into frames equal to the distance between individual beams, centering the frames over the beam nodes. The proposed LSTM-BiLSTM model offers a versatile tool for various bridge and railway infrastructure conditions monitoring using direct drive-by vibration response measurements. The results demonstrate the potential of incorporating temporal analysis in the feature extraction phase and emphasize the pivotal role of bidirectional temporal information in infrastructure health condition estimation. The proposed methodology can accurately and automatically estimate railway track stiffness and identify local stiffness reductions in the presence of noise using drive-by measurements. An illustrative case study of vehicle-track interaction simulation is used to demonstrate the performance of the proposed model, achieving a maximum mean absolute percentage error of 1.7% and 0.7% in estimating railpad and ballast stiffness, respectively.

Title: Interpretable Generalized Additive Models for Datasets with Missing Values

Authors: Hayden McTavish, Jon Donnelly, Margo Seltzer, Cynthia Rudin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.02646
Pdf URL: https://arxiv.org/pdf/2412.02646
Copy Paste: [[2412.02646]] Interpretable Generalized Additive Models for Datasets with Missing Values(https://arxiv.org/abs/2412.02646)
Keywords: interpretability
Abstract: Many important datasets contain samples that are missing one or more feature values. Maintaining the interpretability of machine learning models in the presence of such missing data is challenging. Singly or multiply imputing missing values complicates the model's mapping from features to labels. On the other hand, reasoning on indicator variables that represent missingness introduces a potentially large number of additional terms, sacrificing sparsity. We solve these problems with M-GAM, a sparse, generalized, additive modeling approach that incorporates missingness indicators and their interaction terms while maintaining sparsity through l0 regularization. We show that M-GAM provides similar or superior accuracy to prior methods while significantly improving sparsity relative to either imputation or naive inclusion of indicator variables.

Title: Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Authors: Yuda Song, Hanlin Zhang, Carson Eisenach, Sham Kakade, Dean Foster, Udaya Ghai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02674
Pdf URL: https://arxiv.org/pdf/2412.02674
Copy Paste: [[2412.02674]] Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models(https://arxiv.org/abs/2412.02674)
Keywords: large language model
Abstract: Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement -- a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. Our findings not only advance understanding of LLM self-improvement with practical implications, but also open numerous avenues for future research into its capabilities and boundaries.

Title: AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

Authors: Lingteng Qiu, Shenhao Zhu, Qi Zuo, Xiaodong Gu, Yuan Dong, Junfei Zhang, Chao Xu, Zhe Li, Weihao Yuan, Liefeng Bo, Guanying Chen, Zilong Dong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.02684
Pdf URL: https://arxiv.org/pdf/2412.02684
Copy Paste: [[2412.02684]] AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction(https://arxiv.org/abs/2412.02684)
Keywords: robust, transformer, generative
Abstract: Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.

Title: T-REG: Preference Optimization with Token-Level Reward Regularization

Authors: Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.02685
Pdf URL: https://arxiv.org/pdf/2412.02685
Copy Paste: [[2412.02685]] T-REG: Preference Optimization with Token-Level Reward Regularization(https://arxiv.org/abs/2412.02685)
Keywords: large language model
Abstract: Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at this https URL.

Title: SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Authors: Viet Nguyen, Anh Aengus Nguyen, Trung Dao, Khoi Nguyen, Cuong Pham, Toan Tran, Anh Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02687
Pdf URL: https://arxiv.org/pdf/2412.02687
Copy Paste: [[2412.02687]] SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(https://arxiv.org/abs/2412.02687)
Keywords: robust, diffusion
Abstract: Recent approaches have yielded promising results in distilling multi-step text-to-image diffusion models into one-step ones. The state-of-the-art efficient distillation technique, i.e., SwiftBrushv2 (SBv2), even surpasses the teacher model's performance with limited resources. However, our study reveals its instability when handling different diffusion model backbones due to using a fixed guidance scale within the Variational Score Distillation (VSD) loss. Another weakness of the existing one-step diffusion models is the missing support for negative prompt guidance, which is crucial in practical image generation. This paper presents SNOOPI, a novel framework designed to address these limitations by enhancing the guidance in one-step diffusion models during both training and inference. First, we effectively enhance training stability through Proper Guidance-SwiftBrush (PG-SB), which employs a random-scale classifier-free guidance approach. By varying the guidance scale of both teacher models, we broaden their output distributions, resulting in a more robust VSD loss that enables SB to perform effectively across diverse backbones while maintaining competitive performance. Second, we propose a training-free method called Negative-Away Steer Attention (NASA), which integrates negative prompts into one-step diffusion models via cross-attention to suppress undesired elements in generated images. Our experimental results show that our proposed methods significantly improve baseline models across various metrics. Remarkably, we achieve an HPSv2 score of 31.08, setting a new state-of-the-art benchmark for one-step diffusion models.

Title: FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Authors: Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Hampali, Cem Keskin, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02690
Pdf URL: https://arxiv.org/pdf/2412.02690
Copy Paste: [[2412.02690]] FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation(https://arxiv.org/abs/2412.02690)
Keywords: diffusion, segmentation
Abstract: Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.

Title: Diffusion-based Visual Anagram as Multi-task Learning

Authors: Zhiyuan Xu, Yinhe Chen, Huan-ang Gao, Weiyan Zhao, Guiyu Zhang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02693
Pdf URL: https://arxiv.org/pdf/2412.02693
Copy Paste: [[2412.02693]] Diffusion-based Visual Anagram as Multi-task Learning(https://arxiv.org/abs/2412.02693)
Keywords: diffusion
Abstract: Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method's superior ability to generate visual anagrams spanning diverse concepts.

Title: Motion Prompting: Controlling Video Generation with Motion Trajectories

Authors: Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.02700
Pdf URL: https://arxiv.org/pdf/2412.02700
Copy Paste: [[2412.02700]] Motion Prompting: Controlling Video Generation with Motion Trajectories(https://arxiv.org/abs/2412.02700)
Keywords: generative
Abstract: Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: this https URL